Integration between arrow-rs and extendr

Josiah Parry

Last update: Nov 24, 2023

Related tags

Overview

arrow_extendr

arrow-extendr is a crate that facilitates the transfer of Apache Arrow memory between R and Rust. It utilizes extendr, the {nanoarrow} R package, and arrow-rs.

Versioning

At present, versions of arrow-rs are not compatible with each other. This means if your crate uses arrow-rs version 48.0.1, then the arrow-extendr must also use that same version. As such, arrow-extendr uses the same versions as arrow-rs so that it is easy to match the required versions you need.

Versions:

48.0.1
49.0.0
49.0.0-geoarrow (not available on crates.io but is the current Git version)

Motivating Example

Say we have the following DBI connection which we will send requests to using arrow. The result of dbGetQueryArrow() is a nanoarrow_array_stream. We want to count the number of rows in each batch of the steam using Rust.

# adapted from https://github.com/r-dbi/DBI/blob/main/vignettes/DBI-arrow.Rmd

library(DBI)
con <- dbConnect(RSQLite::SQLite())
data <- data.frame(
  a = runif(10000, 0, 10),
  b = rnorm(10000, 4.5),
  c = sample(letters, 10000, TRUE)
)

dbWriteTable(con, "tbl", data)

We can write an extendr function which creates an ArrowArrayStreamReader from an &Robj. In the function we instantiate a counter to keep track of the number of rows per chunk. For each chunk we print the number of rows.

#[extendr]
/// @export
fn process_stream(stream: Robj) -> i32 {
    let rb = ArrowArrayStreamReader::from_arrow_robj(&stream)
        .unwrap();

    let mut n = 0;

    rprintln!("Processing `ArrowArrayStreamReader`...");
    for chunk in rb {
        let chunk_rows = chunk.unwrap().num_rows();
        rprintln!("Found {chunk_rows} rows");
        n += chunk_rows as i32;
    }

    n
}

With this function we can use it on the output of dbGetQueryArrow() or other Arrow related DBI functions.

query <- dbGetQueryArrow(con, "SELECT * FROM tbl WHERE a < 3")
process_stream(query)
#> Processing `ArrowArrayStreamReader`...
#> Found 256 rows
#> Found 256 rows
#> Found 256 rows
#> ... truncated ...
#> Found 256 rows
#> Found 256 rows
#> Found 143 rows
#> [1] 2959

A fast, powerful, flexible and easy to use open source data analysis and manipulation tool written in Rust

fisher-rs fisher-rs is a Rust library that brings powerful data manipulation and analysis capabilities to Rust developers, inspired by the popular pan

5 Sep 6, 2023

ndarray: an N-dimensional array with array views, multidimensional slicing, and efficient operations

ndarray The ndarray crate provides an n-dimensional container for general elements and for numerics. Please read the API documentation on docs.rs or t

2.6k Jan 7, 2023

Dataframe structure and operations in Rust

Utah Utah is a Rust crate backed by ndarray for type-conscious, tabular data manipulation with an expressive, functional interface. Note: This crate w

139 Sep 26, 2022

A Rust crate that reads and writes tfrecord files

tfrecord-rust The crate provides the functionality to serialize and deserialize TFRecord data format from TensorFlow. Features Provide both high level

22 Nov 3, 2022

Orkhon: ML Inference Framework and Server Runtime

Orkhon: ML Inference Framework and Server Runtime Latest Release License Build Status Downloads Gitter What is it? Orkhon is Rust framework for Machin

129 Dec 21, 2022

Tiny, no-nonsense, self-contained, Tensorflow and ONNX inference

Sonos' Neural Network inference engine. This project used to be called tfdeploy, or Tensorflow-deploy-rust. What ? tract is a Neural Network inference

1.5k Jan 2, 2023

ConnectorX - Fastest library to load data from DB to DataFrames in Rust and Python

ConnectorX enables you to load data from databases into Python in the fastest and most memory efficient way.

939 Jan 5, 2023

Provides a way to use enums to describe and execute ordered data pipelines. 🦀🐾

enum_pipline Provides a way to use enums to describe and execute ordered data pipelines. 🦀 🐾 I needed a succinct way to describe 2d pixel map operat

0 Oct 29, 2021

AppFlowy is an open-source alternative to Notion. You are in charge of your data and customizations

AppFlowy is an open-source alternative to Notion. You are in charge of your data and customizations. Built with Flutter and Rust.

30.7k Jan 7, 2023

Comments

Use nanoarrow instead of arrow

I would probably encourage writers of Rust extensions to go through nanoarrow (e.g., via as_nanoarrow_array_stream() or as_nanoarrow_array()) rather than arrow directly.

@paleolimbot

opened by JosiahParry 14
Restructure: Remove R package

Thanks for taking on the task of creating this! Perhaps this is because this is a very early experiment, but am I correct in understanding that ultimately this is aimed to be a Rust crate, not an R package? (as in arrow-rs' pyarrow feature)

What I had imagined was a crate something like "extendr-nanoarrow". If such a crate does not exist, type conversion from arrow-rs to R would have to be created on individual R packages, correct?

opened by eitsupi 9
Support {arrow} R package

I want to try having two traits IntoArrowRobj and IntoNanoArrowRobj. I think there is at least still some utility in being able to return an {arrow} R package object still.

I think it may be slower to return a nano-arrow stream and then cast as a data.frame than it might be to return a recordbatch reader directly.

opened by JosiahParry 0
Create releases for common arrow versions
Arrow uses a cargo deny approach to prevent compatability across arrow-rs versions. So if a library is using Arrow 47 and another 48, those are not compatible. The suggested approach is to create a release for each version of arrow we want to support. This means that I need to, at minimum, use 48.0.1 (DataFusion) and 49.0.0 current support.

[ ] 48.0.1

[ ] 49.0.0

[ ] 49-dev for geoarrow
opened by JosiahParry 0

Releases(v49.0.0-geoarrow)

v49.0.0-geoarrow(Nov 28, 2023)

Source code(tar.gz)
Source code(zip)

Owner

Josiah Parry

Social Scientist. Spatial Stats @ Esri

GitHub https://josiahparry.github.io/arrow-extendr/arrow_extendr/index.html

Integration between arrow-rs and extendr

Related tags

Overview

arrow_extendr

Versioning

Motivating Example

You might also like...

A fast, powerful, flexible and easy to use open source data analysis and manipulation tool written in Rust

ndarray: an N-dimensional array with array views, multidimensional slicing, and efficient operations

Dataframe structure and operations in Rust

A Rust crate that reads and writes tfrecord files

Orkhon: ML Inference Framework and Server Runtime

Tiny, no-nonsense, self-contained, Tensorflow and ONNX inference

ConnectorX - Fastest library to load data from DB to DataFrames in Rust and Python

Provides a way to use enums to describe and execute ordered data pipelines. 🦀🐾

AppFlowy is an open-source alternative to Notion. You are in charge of your data and customizations

Comments

Use nanoarrow instead of arrow

Restructure: Remove R package

Support {arrow} R package

Create releases for common arrow versions

Releases(v49.0.0-geoarrow)

v49.0.0-geoarrow(Nov 28, 2023)

Owner

Josiah Parry

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing

A Rust DataFrame implementation, built on Apache Arrow

Official Rust implementation of Apache Arrow

Fill Apache Arrow record batches from an ODBC data source in Rust.

Polars is a blazingly fast DataFrames library implemented in Rust using Apache Arrow Columnar Format as memory model.

Arrow User-Defined Functions Framework on WebAssembly.

🦖 Evolve your fixed length data files into Apache Arrow tables, fully parallelized!

Fastest and safest Rust implementation of parquet. `unsafe` free. Integration-tested against pyarrow

Perhaps the fastest and most memory efficient way to pull data from PostgreSQL into pandas and numpy. 🚀

A fast, powerful, flexible and easy to use open source data analysis and manipulation tool written in Rust