A Rust DataFrame implementation, built on Apache Arrow

Wakahisa

Last update: Nov 11, 2022

Comments

Fix/gh failing build

Hi @nevi-me , I noticed the GH action was failing to build so I added the component for rustfmt. I then noticed tests failing and #29 so here is the start of integrating Postgres into GH. It is still failing with relation \"arrow_data\" does not exist" though but I figured you would know more about how you want to run setup for these tests.

opened by jbpratt 11
partitioned parquet dataset as data source
I am interested in working on adding support for loading a partitioned parquet dataset (multiple files) as datasource in a read transformation.

What do you think is the best way to implement this? Here are couple options that come to my mind:

add a new variant DataSrouceType::ParquetDataset

extend DataSrouceType::Parquet to also take a list of file paths

introduce to higher level abstraction that represents a sequence of data sources so other formats like CSV can reuse this feature as well
opened by houqp 5
Implement Sort

Add the ability to sort dataframe by one or multiple criteria. Before implementing this, look at what Arrow's doing, and whether it's possible to implement this functionality there.

opened by nevi-me 5
Looking to contribute

Hi - I appreciate the good work you guys are doing here and I'd like to contribute to this project. Is there a way to get in contact? My e-mail address is [email protected].

opened by harris-chris 3
implement lex sort

Adds inital sorting support, currently limited to Arrow's ArrowNumericType. More sort types are pending upstream work (floats, lists, structs, etc).

Closes #16

opened by nevi-me 1
Lazy DataSource Reader
Add the ability to read a data source and create a deferred dataframe with the source's schema.

There should be a common interface for:

[x] CSV

[x] JSON

[x] SQL

[x] Arrow File
opened by nevi-me 1
[WIP] ChunkedArray, Column and Table
ref #4

This PR implements a ChunkedArray, Column and Table, which although not part of the Arrow format, are used in CPP, Python, Go and potentially other implementations.

This is currently incomplete, and I need to think of:

How to make the functions module work with Column instead of PrimitiveArray<T>.

Where to use references to avoid copies and moves

How to zero-copy arrays and chunks thereof
opened by nevi-me 1
Arrow Table
A Table that works similarly to other Arrow impls (cpp, python, js, Java)

The aim is to move this to the Arrow repo once I'm happy with it. There is a dependency on:

[x] https://issues.apache.org/jira/browse/ARROW-3954 (slice to array)

[x] https://issues.apache.org/jira/browse/ARROW-3706 (record batch reader trait)

[x] https://issues.apache.org/jira/browse/ARROW-3688 (append_values)
opened by nevi-me 1
Rust Primitive to Arrow Native Type

For some computations, we need the ability to convert primitive types to native types. I thought a simple std::convert::From would work, but it's not working.

opened by nevi-me 1
Basic dataframe ops
Should be able to:

[x] Create a dataframe from data [#3]

[x] Add columns to dataframe

[x] Remove columns from dataframe

[x] Create new columns as computations

[x] Select a subset of the dataframe by column names
opened by nevi-me 1

multiple branch possible

From table.rs:195-207 there seem to be the possibility of multiple branch on variable bounded_len

    pub fn take(&self, indices: &UInt32Array, chunk_size: usize) -> Result<Self> {
        let mut consumed_len = 0;
        let total_len = indices.len();
        let values = self.to_array()?;
        let mut outputs = vec![];
        while consumed_len < total_len {
            let bounded_len = if total_len < chunk_size {
                total_len
            } else if consumed_len + chunk_size > total_len {
                chunk_size
            } else {
                total_len - consumed_len
            };

opened by fgadaleta 0

Context to allow providing custom data sources, functions, etc.

I was trying to keep things simple, avoiding a context, but the typesystem won't allow me to have custom sources without some stateful place to register them.

Nothing special here, so I'll take a cue from DataFusion. What I'm interested in bikeshedding here is how to create an expressive API that allows data sources to declare their capabilities (e.g. I can pushdown sorts, filters, projections).

If I can get it to work, I'd want to contribute it to DataFusion, as that's where I think a de-facto Rust data analysis library should be.

opened by nevi-me 0
extensibitlity of data source

I am experimenting with evaluating lazy frame with a custom data source. However, looks like Reader being declared as a struct makes it hard to add support for custom data source that shouldn't be part of the dataframe core code base.

Would it make sense to change Reader and Writer into traits so that custom data source implementations can be fully decoupled from the core code base?

opened by houqp 5
Apache Arrow Flight support

Consider adding support for reading data from an Apache Arrow Flight server. Such support could be in the form of a flight client, that gets Arrow data and converts it into a table

opened by nevi-me 0
Plot out optimisations

The lazy evaluation model seems to be fine for most operations, and can be usable when the existing holes are plugged. Next step is to plot out how optimisations on Vec<Computation> would work. I can try out simple optimisations such as reordering a filter and calculate to filter before calculating.

opened by nevi-me 0
Grouping and Aggregation Expressions

In order to implement aggregations, we need to be able to group data. Like joins, the task of grouping probably belongs upstream, but we should be able to define how to group data.

The LazyFrame might need some state (whether it's grouped or not) to prevent 'normal' calculations when it's in a grouped state. I don't want to implement a GroupedLazyFrame because we rely on mutating the &mut LazyFrame to add on computations.

An aggregation should ideally take in multiple aggregations. A grouping should take in multiple columns, with columns that aren't grouped or aggregated, getting dropped.
df-lazy-ops

opened by nevi-me 1

Owner

Wakahisa

GitHub

Fill Apache Arrow record batches from an ODBC data source in Rust.

arrow-odbc Fill Apache Arrow arrays from ODBC data sources. This crate is build on top of the arrow and odbc-api crate and enables you to read the dat

21 Dec 27, 2022

Polars is a blazingly fast DataFrames library implemented in Rust using Apache Arrow Columnar Format as memory model.

Polars Python Documentation | Rust Documentation | User Guide | Discord | StackOverflow Blazingly fast DataFrames in Rust, Python & Node.js Polars is

11.8k Jan 8, 2023

Apache Arrow DataFusion and Ballista query engines

DataFusion is an extensible query execution framework, written in Rust, that uses Apache Arrow as its in-memory format.

2.9k Jan 2, 2023

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing

Apache Arrow Powering In-Memory Analytics Apache Arrow is a development platform for in-memory analytics. It contains a set of technologies that enabl

10.9k Jan 6, 2023

🦖 Evolve your fixed length data files into Apache Arrow tables, fully parallelized!

?? Evolve your fixed length data files into Apache Arrow tables, fully parallelized! ?? Overview ... ?? Installation The easiest way to install evolut

3 Dec 22, 2023

Dataframe structure and operations in Rust

Utah Utah is a Rust crate backed by ndarray for type-conscious, tabular data manipulation with an expressive, functional interface. Note: This crate w

139 Sep 26, 2022

Rust DataFrame library

Polars Blazingly fast DataFrames in Rust & Python Polars is a blazingly fast DataFrames library implemented in Rust. Its memory model uses Apache Arro

11.9k Jan 8, 2023

DataFrame / Series data processing in Rust

black-jack While PRs are welcome, the approach taken only allows for concrete types (String, f64, i64, ...) I'm not sure this is the way to go. I want

30 Dec 10, 2022

DataFrame & its adaptors

Fabrix Fabrix is a lib crate, who uses Polars Series and DataFrame as fundamental data structures, and is capable to communicate among different data

18 Dec 12, 2022

Provides multiple-dtype columner storage, known as DataFrame in pandas/R

brassfibre Provides multiple-dtype columner storage, known as DataFrame in pandas/R. Series Single-dtype 1-dimentional vector with label (index). Crea

21 Nov 28, 2022

A dataframe manipulation tool inspired by dplyr and powered by polars.

dply is a command line tool for viewing, querying, and writing csv and parquet files, inspired by dplyr and powered by polars. Usage overview A dply p

14 May 29, 2023

A new arguably faster implementation of Apache Spark from scratch in Rust

vega Previously known as native_spark. Documentation A new, arguably faster, implementation of Apache Spark from scratch in Rust. WIP Framework tested

2.1k Jan 5, 2023

Integration between arrow-rs and extendr

arrow_extendr arrow-extendr is a crate that facilitates the transfer of Apache Arrow memory between R and Rust. It utilizes extendr, the {nanoarrow} R

8 Nov 24, 2023

Arrow User-Defined Functions Framework on WebAssembly.

Arrow User-Defined Functions Framework on WebAssembly Example Build the WebAssembly module: cargo build --release -p arrow-udf-wasm-example --target w

3 Dec 14, 2023

Apache TinkerPop from Rust via Rucaja (JNI)

Apache TinkerPop from Rust An example showing how to call Apache TinkerPop from Rust via Rucaja (JNI). This repository contains two directories: java

8 Sep 27, 2022

A rust library built to support building time-series based projection models

TimeSeries TimeSeries is a framework for building analytical models in Rust that have a time dimension. Inspiration The inspiration for writing this i

12 Dec 7, 2022

A Modern Real-Time Data Processing & Analytics DBMS with Cloud-Native Architecture, built to make the Data Cloud easy

5k Jan 9, 2023

Fastest and safest Rust implementation of parquet. `unsafe` free. Integration-tested against pyarrow

Parquet2 This is a re-write of the official parquet crate with performance, parallelism and safety in mind. The five main differentiators in compariso

237 Jan 1, 2023

bspipe A Rust implementation of Bidirectional Secure Pipe

2 Nov 14, 2022

A Rust DataFrame implementation, built on Apache Arrow

Related tags

Overview

Rust DataFrame

What is a dataframe?

Functionality

Eager vs Lazy Evaluation

Non-Goals

Status

Roadmap

IO

Functionality

Performance

Comments

Owner

Wakahisa

Fill Apache Arrow record batches from an ODBC data source in Rust.

Polars is a blazingly fast DataFrames library implemented in Rust using Apache Arrow Columnar Format as memory model.

Apache Arrow DataFusion and Ballista query engines

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing

🦖 Evolve your fixed length data files into Apache Arrow tables, fully parallelized!

Dataframe structure and operations in Rust

Rust DataFrame library

DataFrame / Series data processing in Rust

DataFrame & its adaptors

Provides multiple-dtype columner storage, known as DataFrame in pandas/R

A dataframe manipulation tool inspired by dplyr and powered by polars.

A new arguably faster implementation of Apache Spark from scratch in Rust

Integration between arrow-rs and extendr

Arrow User-Defined Functions Framework on WebAssembly.

Apache TinkerPop from Rust via Rucaja (JNI)

A rust library built to support building time-series based projection models

A Modern Real-Time Data Processing & Analytics DBMS with Cloud-Native Architecture, built to make the Data Cloud easy

Fastest and safest Rust implementation of parquet. `unsafe` free. Integration-tested against pyarrow

bspipe A Rust implementation of Bidirectional Secure Pipe