A Rust DataFrame implementation, built on Apache Arrow

Overview

Rust DataFrame

A dataframe implementation in Rust, powered by Apache Arrow.

What is a dataframe?

A dataframe is a 2-dimensional tabular data structure that is often used for computations and other data transformations. A dataframe often has columns of the same data type, similar to a SQL table.

Functionality

This project is inspired by Pandas and other dataframe libraries, but specifically currently borrows functions from Apache Spark.

It mainly focuses on computation, and aims to include:

  • Scalar functions
  • Aggregate function
  • Window functions
  • Array functions

As a point of reference, we use Apache Spark Python functions for function parity, and aim to be compatible with Apache Spark functions.

Eager vs Lazy Evaluation

The initial experiments of this project were to see if it's possible to create some form of dataframe. We're happy that this condition is met, however the initial version relied on eager evaluation, which would make it difficult to use in a REPL fashion, and make it slow.

We are mainly focusing on creating a process for lazy evaluation (the current LazyFrame), which involves reading an input's schema, then applying transformations on that schema until a materialising action is required. While still figuring this out, there might not be much progress on the surface, as most of this exercise is happening offline.

The plan is to provide a reasonable API for lazily transforming data, and the ability to apply some optimisations on the computation graph (e.g. predicate pushdown, rearranging computations).

In the future, LazyFrame will probably be renamed to DataFrame, and the current DataFrame with eager evaluation removed/made private.

The ongoing experiments on lazy evaluation are in the master branch, and we would appreciate some help 🙏🏾.

Non-Goals

Although we use Apache Spark as a reference, we do not intend on supporting distributed computation beyond a single machine.

Spark is a convenience to reduce bikeshedding, but we will probably provide a more Rust idiomatic API in future.

Status

A low-level API can already be used for simple tasks that do not require aggregations, joins or sorts. A simpler API is currently not a priority until we have more capabilities to transform data.

One good potential immediate use of the library would be copying data from one supported data source to another (e.g. PostgreSQL to Arrow or CSV with minimal transformations).

Roadmap

  • Lazy evaluation (H1 2020)
    • Aggregations
    • Joins
    • Sorting
  • Adding compute fns (H1 2020)
  • Bindings to other languages (H2 2020)

IO

We are working on IO support, with priority for SQL read and write. PostgreSQL IO is supported using the binary protocol, although not all data types are supported (lists, structs, numeric, and a few other non-primitive types)

  • IO Support
    • CSV
      • Read
      • Write
    • JSON
      • Read
      • Write
    • Arrow IPC
      • Read File
      • Write FIle
    • Parquet
      • Read File
      • Write File
    • SQL (part of an effort to create generic DB traits)
      • PostgreSQL (Primitive and temporal types supported, PRs welcome for other types)
        • Read
        • Write
      • MSSQL (using tiberius)
        • Read
        • Write
      • MySQL
        • Read
        • Write

Functionality

  • DataFrame Operations

    • Select single column
    • Select subset of columns, drop columns
    • Add or remove columns
    • Rename columns
    • Create dataframe from record batches (a Vec<RecordBatch> as well as an iterator)
    • Sort dataframes
    • Grouped operations
    • Filter dataframes
    • Join dataframes
  • Scalar Functions

    • Trig functions (sin, cos, tan, asin, asinh, ...) (using the num crate where possible)
    • Basic arithmetic (add, mul, divide, subtract) Implemented from Arrow
    • Date/Time functions
    • String functions
      • [-] Basic string manipulation
      • Regular expressions (leveraging regex)
      • Casting to and from strings (using Arrow compute's cast kernel)
    • Crypto/hash functions (md5, crc32, sha{x}, ...)
    • Other functions (that we haven't classified)
  • Aggregate Functions

    • Sum, max, min
    • Count
    • Statistical aggregations (mean, mode, median, stddev, ...)
  • Window Functions

    • Lead, lag
    • Rank, percent rank
    • Other
  • Array Functions

    • Compatibility with Spark 2.4 functions
    • Compatibility with Spark 3.0 functions

Performance

We plan on providing simple benchmarks in the near future. The current blockers are:

  • IO
    • Text format (CSV)
    • Binary format (Arrow IPC)
    • SQL
  • [-] Lazy operations
  • Aggregation
  • Joins
Comments
  • Fix/gh failing build

    Fix/gh failing build

    Hi @nevi-me , I noticed the GH action was failing to build so I added the component for rustfmt. I then noticed tests failing and #29 so here is the start of integrating Postgres into GH. It is still failing with relation \"arrow_data\" does not exist" though but I figured you would know more about how you want to run setup for these tests.

    opened by jbpratt 11
  • partitioned parquet dataset as data source

    partitioned parquet dataset as data source

    I am interested in working on adding support for loading a partitioned parquet dataset (multiple files) as datasource in a read transformation.

    What do you think is the best way to implement this? Here are couple options that come to my mind:

    • add a new variant DataSrouceType::ParquetDataset
    • extend DataSrouceType::Parquet to also take a list of file paths
    • introduce to higher level abstraction that represents a sequence of data sources so other formats like CSV can reuse this feature as well
    opened by houqp 5
  • Implement Sort

    Implement Sort

    Add the ability to sort dataframe by one or multiple criteria. Before implementing this, look at what Arrow's doing, and whether it's possible to implement this functionality there.

    opened by nevi-me 5
  • Looking to contribute

    Looking to contribute

    Hi - I appreciate the good work you guys are doing here and I'd like to contribute to this project. Is there a way to get in contact? My e-mail address is [email protected]

    opened by harris-chris 3
  • implement lex sort

    implement lex sort

    Adds inital sorting support, currently limited to Arrow's ArrowNumericType. More sort types are pending upstream work (floats, lists, structs, etc).

    Closes #16

    opened by nevi-me 1
  • Lazy DataSource Reader

    Lazy DataSource Reader

    Add the ability to read a data source and create a deferred dataframe with the source's schema.

    There should be a common interface for:

    • [x] CSV
    • [x] JSON
    • [x] SQL
    • [x] Arrow File
    opened by nevi-me 1
  • [WIP] ChunkedArray, Column and Table

    [WIP] ChunkedArray, Column and Table

    ref #4

    This PR implements a ChunkedArray, Column and Table, which although not part of the Arrow format, are used in CPP, Python, Go and potentially other implementations.

    This is currently incomplete, and I need to think of:

    • How to make the functions module work with Column instead of PrimitiveArray<T>.
    • Where to use references to avoid copies and moves
    • How to zero-copy arrays and chunks thereof
    opened by nevi-me 1
  • Arrow Table

    Arrow Table

    A Table that works similarly to other Arrow impls (cpp, python, js, Java)

    The aim is to move this to the Arrow repo once I'm happy with it. There is a dependency on:

    • [x] https://issues.apache.org/jira/browse/ARROW-3954 (slice to array)
    • [x] https://issues.apache.org/jira/browse/ARROW-3706 (record batch reader trait)
    • [x] https://issues.apache.org/jira/browse/ARROW-3688 (append_values)
    opened by nevi-me 1
  • Rust Primitive to Arrow Native Type

    Rust Primitive to Arrow Native Type

    For some computations, we need the ability to convert primitive types to native types. I thought a simple std::convert::From would work, but it's not working.

    opened by nevi-me 1
  • Basic dataframe ops

    Basic dataframe ops

    Should be able to:

    • [x] Create a dataframe from data [#3]
    • [x] Add columns to dataframe
    • [x] Remove columns from dataframe
    • [x] Create new columns as computations
    • [x] Select a subset of the dataframe by column names
    opened by nevi-me 1
  • multiple branch possible

    multiple branch possible

    From table.rs:195-207 there seem to be the possibility of multiple branch on variable bounded_len

        pub fn take(&self, indices: &UInt32Array, chunk_size: usize) -> Result<Self> {
            let mut consumed_len = 0;
            let total_len = indices.len();
            let values = self.to_array()?;
            let mut outputs = vec![];
            while consumed_len < total_len {
                let bounded_len = if total_len < chunk_size {
                    total_len
                } else if consumed_len + chunk_size > total_len {
                    chunk_size
                } else {
                    total_len - consumed_len
                };
    
    opened by fgadaleta 0
  • Context to allow providing custom data sources, functions, etc.

    Context to allow providing custom data sources, functions, etc.

    I was trying to keep things simple, avoiding a context, but the typesystem won't allow me to have custom sources without some stateful place to register them.

    Nothing special here, so I'll take a cue from DataFusion. What I'm interested in bikeshedding here is how to create an expressive API that allows data sources to declare their capabilities (e.g. I can pushdown sorts, filters, projections).

    If I can get it to work, I'd want to contribute it to DataFusion, as that's where I think a de-facto Rust data analysis library should be.

    opened by nevi-me 0
  • extensibitlity of data source

    extensibitlity of data source

    I am experimenting with evaluating lazy frame with a custom data source. However, looks like Reader being declared as a struct makes it hard to add support for custom data source that shouldn't be part of the dataframe core code base.

    Would it make sense to change Reader and Writer into traits so that custom data source implementations can be fully decoupled from the core code base?

    opened by houqp 5
  • Apache Arrow Flight support

    Apache Arrow Flight support

    Consider adding support for reading data from an Apache Arrow Flight server. Such support could be in the form of a flight client, that gets Arrow data and converts it into a table

    opened by nevi-me 0
  • Plot out optimisations

    Plot out optimisations

    The lazy evaluation model seems to be fine for most operations, and can be usable when the existing holes are plugged. Next step is to plot out how optimisations on Vec<Computation> would work. I can try out simple optimisations such as reordering a filter and calculate to filter before calculating.

    opened by nevi-me 0
  • Grouping and Aggregation Expressions

    Grouping and Aggregation Expressions

    In order to implement aggregations, we need to be able to group data. Like joins, the task of grouping probably belongs upstream, but we should be able to define how to group data.

    The LazyFrame might need some state (whether it's grouped or not) to prevent 'normal' calculations when it's in a grouped state. I don't want to implement a GroupedLazyFrame because we rely on mutating the &mut LazyFrame to add on computations.

    An aggregation should ideally take in multiple aggregations. A grouping should take in multiple columns, with columns that aren't grouped or aggregated, getting dropped.

    df-lazy-ops 
    opened by nevi-me 1
Owner
Wakahisa
Wakahisa
Fill Apache Arrow record batches from an ODBC data source in Rust.

arrow-odbc Fill Apache Arrow arrays from ODBC data sources. This crate is build on top of the arrow and odbc-api crate and enables you to read the dat

Markus Klein 21 Dec 2, 2022
Polars is a blazingly fast DataFrames library implemented in Rust using Apache Arrow Columnar Format as memory model.

Polars Python Documentation | Rust Documentation | User Guide | Discord | StackOverflow Blazingly fast DataFrames in Rust, Python & Node.js Polars is

null 9.3k Nov 22, 2022
Apache Arrow DataFusion and Ballista query engines

DataFusion is an extensible query execution framework, written in Rust, that uses Apache Arrow as its in-memory format.

The Apache Software Foundation 2.8k Dec 2, 2022
Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing

Apache Arrow Powering In-Memory Analytics Apache Arrow is a development platform for in-memory analytics. It contains a set of technologies that enabl

The Apache Software Foundation 10.7k Dec 2, 2022
Dataframe structure and operations in Rust

Utah Utah is a Rust crate backed by ndarray for type-conscious, tabular data manipulation with an expressive, functional interface. Note: This crate w

Suchin 139 Sep 26, 2022
Rust DataFrame library

Polars Blazingly fast DataFrames in Rust & Python Polars is a blazingly fast DataFrames library implemented in Rust. Its memory model uses Apache Arro

Ritchie Vink 9.3k Nov 25, 2022
DataFrame / Series data processing in Rust

black-jack While PRs are welcome, the approach taken only allows for concrete types (String, f64, i64, ...) I'm not sure this is the way to go. I want

Miles Granger 30 Oct 9, 2022
DataFrame & its adaptors

Fabrix Fabrix is a lib crate, who uses Polars Series and DataFrame as fundamental data structures, and is capable to communicate among different data

Jacob Xie 17 Aug 7, 2022
Provides multiple-dtype columner storage, known as DataFrame in pandas/R

brassfibre Provides multiple-dtype columner storage, known as DataFrame in pandas/R. Series Single-dtype 1-dimentional vector with label (index). Crea

Sinhrks 20 Sep 3, 2021
A new arguably faster implementation of Apache Spark from scratch in Rust

vega Previously known as native_spark. Documentation A new, arguably faster, implementation of Apache Spark from scratch in Rust. WIP Framework tested

raja sekar 2.1k Nov 24, 2022
Apache TinkerPop from Rust via Rucaja (JNI)

Apache TinkerPop from Rust An example showing how to call Apache TinkerPop from Rust via Rucaja (JNI). This repository contains two directories: java

null 8 Sep 27, 2022
A rust library built to support building time-series based projection models

TimeSeries TimeSeries is a framework for building analytical models in Rust that have a time dimension. Inspiration The inspiration for writing this i

James MacAdie 11 Jun 20, 2022
A Modern Real-Time Data Processing & Analytics DBMS with Cloud-Native Architecture, built to make the Data Cloud easy

A Modern Real-Time Data Processing & Analytics DBMS with Cloud-Native Architecture, built to make the Data Cloud easy

Datafuse Labs 4.8k Nov 26, 2022
Fastest and safest Rust implementation of parquet. `unsafe` free. Integration-tested against pyarrow

Parquet2 This is a re-write of the official parquet crate with performance, parallelism and safety in mind. The five main differentiators in compariso

Jorge Leitao 221 Nov 25, 2022
bspipe A Rust implementation of Bidirectional Secure Pipe

bspipe A Rust implementation of Bidirectional Secure Pipe

xufanglu 2 Nov 14, 2022
Yet Another Technical Analysis library [for Rust]

YATA Yet Another Technical Analysis library YaTa implements most common technical analysis methods and indicators. It also provides you an interface t

Dmitry 193 Dec 1, 2022
Rayon: A data parallelism library for Rust

Rayon Rayon is a data-parallelism library for Rust. It is extremely lightweight and makes it easy to convert a sequential computation into a parallel

null 7.6k Nov 23, 2022
A Rust crate that reads and writes tfrecord files

tfrecord-rust The crate provides the functionality to serialize and deserialize TFRecord data format from TensorFlow. Features Provide both high level

null 22 Nov 3, 2022
sparse linear algebra library for rust

sprs, sparse matrices for Rust sprs implements some sparse matrix data structures and linear algebra algorithms in pure Rust. The API is a work in pro

Vincent Barrielle 307 Nov 23, 2022