Integration between arrow-rs and extendr

Overview

arrow_extendr

arrow-extendr is a crate that facilitates the transfer of Apache Arrow memory between R and Rust. It utilizes extendr, the {nanoarrow} R package, and arrow-rs.

Versioning

At present, versions of arrow-rs are not compatible with each other. This means if your crate uses arrow-rs version 48.0.1, then the arrow-extendr must also use that same version. As such, arrow-extendr uses the same versions as arrow-rs so that it is easy to match the required versions you need.

Versions:

  • 48.0.1
  • 49.0.0
  • 49.0.0-geoarrow (not available on crates.io but is the current Git version)

Motivating Example

Say we have the following DBI connection which we will send requests to using arrow. The result of dbGetQueryArrow() is a nanoarrow_array_stream. We want to count the number of rows in each batch of the steam using Rust.

# adapted from https://github.com/r-dbi/DBI/blob/main/vignettes/DBI-arrow.Rmd

library(DBI)
con <- dbConnect(RSQLite::SQLite())
data <- data.frame(
  a = runif(10000, 0, 10),
  b = rnorm(10000, 4.5),
  c = sample(letters, 10000, TRUE)
)

dbWriteTable(con, "tbl", data)

We can write an extendr function which creates an ArrowArrayStreamReader from an &Robj. In the function we instantiate a counter to keep track of the number of rows per chunk. For each chunk we print the number of rows.

#[extendr]
/// @export
fn process_stream(stream: Robj) -> i32 {
    let rb = ArrowArrayStreamReader::from_arrow_robj(&stream)
        .unwrap();

    let mut n = 0;

    rprintln!("Processing `ArrowArrayStreamReader`...");
    for chunk in rb {
        let chunk_rows = chunk.unwrap().num_rows();
        rprintln!("Found {chunk_rows} rows");
        n += chunk_rows as i32;
    }

    n
}

With this function we can use it on the output of dbGetQueryArrow() or other Arrow related DBI functions.

query <- dbGetQueryArrow(con, "SELECT * FROM tbl WHERE a < 3")
process_stream(query)
#> Processing `ArrowArrayStreamReader`...
#> Found 256 rows
#> Found 256 rows
#> Found 256 rows
#> ... truncated ...
#> Found 256 rows
#> Found 256 rows
#> Found 143 rows
#> [1] 2959
You might also like...
A fast, powerful, flexible and easy to use open source data analysis and manipulation tool written in Rust

fisher-rs fisher-rs is a Rust library that brings powerful data manipulation and analysis capabilities to Rust developers, inspired by the popular pan

ndarray: an N-dimensional array with array views, multidimensional slicing, and efficient operations

ndarray The ndarray crate provides an n-dimensional container for general elements and for numerics. Please read the API documentation on docs.rs or t

Dataframe structure and operations in Rust

Utah Utah is a Rust crate backed by ndarray for type-conscious, tabular data manipulation with an expressive, functional interface. Note: This crate w

A Rust crate that reads and writes tfrecord files

tfrecord-rust The crate provides the functionality to serialize and deserialize TFRecord data format from TensorFlow. Features Provide both high level

Orkhon: ML Inference Framework and Server Runtime
Orkhon: ML Inference Framework and Server Runtime

Orkhon: ML Inference Framework and Server Runtime Latest Release License Build Status Downloads Gitter What is it? Orkhon is Rust framework for Machin

Tiny, no-nonsense, self-contained, Tensorflow and ONNX inference
Tiny, no-nonsense, self-contained, Tensorflow and ONNX inference

Sonos' Neural Network inference engine. This project used to be called tfdeploy, or Tensorflow-deploy-rust. What ? tract is a Neural Network inference

ConnectorX - Fastest library to load data from DB to DataFrames in Rust and Python
ConnectorX - Fastest library to load data from DB to DataFrames in Rust and Python

ConnectorX enables you to load data from databases into Python in the fastest and most memory efficient way.

Provides a way to use enums to describe and execute ordered data pipelines. 🦀🐾

enum_pipline Provides a way to use enums to describe and execute ordered data pipelines. 🦀 🐾 I needed a succinct way to describe 2d pixel map operat

AppFlowy is an open-source alternative to Notion. You are in charge of your data and customizations
AppFlowy is an open-source alternative to Notion. You are in charge of your data and customizations

AppFlowy is an open-source alternative to Notion. You are in charge of your data and customizations. Built with Flutter and Rust.

Comments
  • Use nanoarrow instead of arrow

    Use nanoarrow instead of arrow

    I would probably encourage writers of Rust extensions to go through nanoarrow (e.g., via as_nanoarrow_array_stream() or as_nanoarrow_array()) rather than arrow directly.

    @paleolimbot

    opened by JosiahParry 14
  • Restructure: Remove R package

    Restructure: Remove R package

    Thanks for taking on the task of creating this! Perhaps this is because this is a very early experiment, but am I correct in understanding that ultimately this is aimed to be a Rust crate, not an R package? (as in arrow-rs' pyarrow feature)

    What I had imagined was a crate something like "extendr-nanoarrow". If such a crate does not exist, type conversion from arrow-rs to R would have to be created on individual R packages, correct?

    opened by eitsupi 9
  • Support {arrow} R package

    Support {arrow} R package

    I want to try having two traits IntoArrowRobj and IntoNanoArrowRobj. I think there is at least still some utility in being able to return an {arrow} R package object still.

    I think it may be slower to return a nano-arrow stream and then cast as a data.frame than it might be to return a recordbatch reader directly.

    opened by JosiahParry 0
  • Create releases for common arrow versions

    Create releases for common arrow versions

    Arrow uses a cargo deny approach to prevent compatability across arrow-rs versions. So if a library is using Arrow 47 and another 48, those are not compatible. The suggested approach is to create a release for each version of arrow we want to support. This means that I need to, at minimum, use 48.0.1 (DataFusion) and 49.0.0 current support.

    • [ ] 48.0.1
    • [ ] 49.0.0
    • [ ] 49-dev for geoarrow
    opened by JosiahParry 0
Releases(v49.0.0-geoarrow)
Owner
Josiah Parry
Social Scientist. Spatial Stats @ Esri
Josiah Parry
Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing

Apache Arrow Powering In-Memory Analytics Apache Arrow is a development platform for in-memory analytics. It contains a set of technologies that enabl

The Apache Software Foundation 10.9k Jan 6, 2023
A Rust DataFrame implementation, built on Apache Arrow

Rust DataFrame A dataframe implementation in Rust, powered by Apache Arrow. What is a dataframe? A dataframe is a 2-dimensional tabular data structure

Wakahisa 287 Nov 11, 2022
Official Rust implementation of Apache Arrow

Native Rust implementation of Apache Arrow Welcome to the implementation of Arrow, the popular in-memory columnar format, in Rust. This part of the Ar

The Apache Software Foundation 1.3k Jan 9, 2023
Fill Apache Arrow record batches from an ODBC data source in Rust.

arrow-odbc Fill Apache Arrow arrays from ODBC data sources. This crate is build on top of the arrow and odbc-api crate and enables you to read the dat

Markus Klein 21 Dec 27, 2022
Polars is a blazingly fast DataFrames library implemented in Rust using Apache Arrow Columnar Format as memory model.

Polars Python Documentation | Rust Documentation | User Guide | Discord | StackOverflow Blazingly fast DataFrames in Rust, Python & Node.js Polars is

null 11.8k Jan 8, 2023
Arrow User-Defined Functions Framework on WebAssembly.

Arrow User-Defined Functions Framework on WebAssembly Example Build the WebAssembly module: cargo build --release -p arrow-udf-wasm-example --target w

RisingWave Labs 3 Dec 14, 2023
🦖 Evolve your fixed length data files into Apache Arrow tables, fully parallelized!

?? Evolve your fixed length data files into Apache Arrow tables, fully parallelized! ?? Overview ... ?? Installation The easiest way to install evolut

Firelink Data 3 Dec 22, 2023
Fastest and safest Rust implementation of parquet. `unsafe` free. Integration-tested against pyarrow

Parquet2 This is a re-write of the official parquet crate with performance, parallelism and safety in mind. The five main differentiators in compariso

Jorge Leitao 237 Jan 1, 2023
Perhaps the fastest and most memory efficient way to pull data from PostgreSQL into pandas and numpy. 🚀

flaco Perhaps the fastest and most memory efficient way to pull data from PostgreSQL into pandas and numpy. ?? Have a gander at the initial benchmarks

Miles Granger 14 Oct 31, 2022
A fast, powerful, flexible and easy to use open source data analysis and manipulation tool written in Rust

fisher-rs fisher-rs is a Rust library that brings powerful data manipulation and analysis capabilities to Rust developers, inspired by the popular pan

Syed Vilayat Ali Rizvi 5 Aug 31, 2023