Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing

The Apache Software Foundation

Last update: Jan 6, 2023

Related tags

Data processing arrow

Overview

Apache Arrow

Powering In-Memory Analytics

Apache Arrow is a development platform for in-memory analytics. It contains a set of technologies that enable big data systems to process and move data fast.

Major components of the project include:

The Arrow Columnar In-Memory Format: a standard and efficient in-memory representation of various datatypes, plain or nested
The Arrow IPC Format: an efficient serialization of the Arrow format and associated metadata, for communication between processes and heterogeneous environments
The Arrow Flight RPC protocol: based on the Arrow IPC format, a building block for remote services exchanging Arrow data with application-defined semantics (for example a storage server or a database)
C++ libraries
C bindings using GLib
C# .NET libraries
Gandiva: an LLVM-based Arrow expression compiler, part of the C++ codebase
Go libraries
Java libraries
JavaScript libraries
Plasma Object Store: a shared-memory blob store, part of the C++ codebase
Python libraries
R libraries
Ruby libraries
Rust libraries

Arrow is an Apache Software Foundation project. Learn more at arrow.apache.org.

What's in the Arrow libraries?

The reference Arrow libraries contain many distinct software components:

Columnar vector and table-like containers (similar to data frames) supporting flat or nested types
Fast, language agnostic metadata messaging layer (using Google's Flatbuffers library)
Reference-counted off-heap buffer memory management, for zero-copy memory sharing and handling memory-mapped files
IO interfaces to local and remote filesystems
Self-describing binary wire formats (streaming and batch/file-like) for remote procedure calls (RPC) and interprocess communication (IPC)
Integration tests for verifying binary compatibility between the implementations (e.g. sending data from Java to C++)
Conversions to and from other in-memory data structures
Readers and writers for various widely-used file formats (such as Parquet, CSV)

Implementation status

The official Arrow libraries in this repository are in different stages of implementing the Arrow format and related features. See our current feature matrix on git master.

How to Contribute

Please read our latest project contribution guide.

Getting involved

Even if you do not plan to contribute to Apache Arrow itself or Arrow integrations in other projects, we'd be happy to have you involved:

Join the mailing list: send an email to [email protected]. Share your ideas and use cases for the project.
Follow our activity on JIRA
Learn the format
Contribute code to one of the reference implementations

Comments

ARROW-10224: [Python] Add support for Python 3.9 except macOS wheel and Windows wheel

Adds support and testing for Python 3.9. I am looking for review as this change may have touched too many things, but I'm also looking to get the CI to test all the different environments.

H/T: @kou, the documentation and #5685 for helping me get this off the ground.
Component: Python

opened by terencehonles 182
ARROW-14892: [Python][C++] GCS Bindings
Incorporate GCS file system into python and other bug fixes.

Bugs/Other changes:

Add GCS bindings mostly based on AWS bindings in Python and associated unit tests

Tell was incorrect, it double counted when the stream was constructed with an offset.

Missed setting the define in config.cmake which means FileSystemFromUri was never tested and didn't compile this is now fixed

Refine logic for GetFileInfo with a single path to recognize prefixes followed by a slash as a directory. This allows datasets to work as expected with a toy dataset generated on local-filesystem and copied to the cloud (I believe this is typical of how other systems write to GCS as well.

Switch convention for creating directories to always end in "/" and make use of this as another indicator. From testing with a sample iceberg table it appears this is the convention used for hive-partitioning, so I assume this is common practice for other Hive related writers (i.e. what we want to support).

Fix bug introduced in https://github.com/apache/arrow/commit/a5e45cecb24229433b825dac64e0ffd10d400e8c which caused failures when a deletion occurred on a bucket (not an object in the bucket).

Ensure output streams are closed on destruction (this is consistent with S3)

Component: C++ Component: Python
opened by emkornfield 137
ARROW-16340: [C++][Python] Move all Python related code into PyArrow
This PR moves src/arrow/python directory into pyarrow and arranges PyArrow to build it. The build on the Python side is made in two steps:

_run_cmake_pyarrow_cpp() where the C++ part of the pyarrow is build first (the part that was moved in the refactoring)

_run_cmake() where pyarrow is built as before

No changes are needed in the build process from the user side to successfully build pyarrow after this refactoring. The test for PyArrow CPP will however be moved into Cython and can currently be run with:

>>> pushd python/build/dist/temp >>> ctest
Component: C++ Component: Python Component: FlightRPC Component: Documentation
opened by AlenkaF 122
ARROW-17545: [C++][CI] Mandate C++17 instead of C++11

This PR switches our build system to require C++17 instead of C++11.

Because the conda packaging jobs are out of sync with the conda-forge files, the Windows conda packaging jobs are broken with this change. The related task (sync conda packaging files with conda-forge) is tracked in ARROW-17635.
Component: R Component: Java Component: Parquet Component: C++ Component: Python Component: Ruby Component: C++ - Gandiva Component: GLib Component: MATLAB Component: Documentation

opened by pitrou 120
ARROW-12626: [C++] Support toolchain xsimd, update toolchain version to version 8.1.0
This also updates pinned vcpkg to use xsimd 8.1.0.

This also implements auto python-wheel-windows-vs2017 image update mechanism. We have a problem of "docker build" on Windows. "docker build" doesn't reuse pulled image as caches. "docker build" always rebuilds an image. This implements manual reuse mechanism like the following:

if ! docker pull; then docker build # build only when built images don't exist fi docker run

But this doesn't work when ci/docker/python-wheel-windows-vs2017.dockerfile is updated but pinned vcpkg revision isn't changed. In the case, "docker build" isn't run because "docker pull" is succeeded.

To work this mechanism, this introduces "PYTHON_WHEEL_WINDOWS_IMAGE_REVISION". We must bump it manually when we update ci/docker/python-wheel-windows-vs2017.dockerfile. "PYTHON_WHEEL_WINDOWS_IMAGE_REVISION" is used in tag name. So "docker pull" is failed with new "PYTHON_WHEEL_WINDOWS_IMAGE_REVISION" and then "docker build" is used.
Component: C++
opened by wesm 101
ARROW-6920: [Packaging] Build python 3.8 wheels

adds python3.8 wheels

as far as I can tell python3.8 isn't available for Conda yet (https://github.com/conda-forge/python-feedstock/pull/274), so that's will have to be added later

opened by sjhewitt 95
ARROW-17635: [Python][CI] Sync conda recipe with the arrow-cpp feedstock

Corresponds to status of feedstock as of https://github.com/conda-forge/arrow-cpp-feedstock/pull/848, minus obvious & intentional divergences in the setup here (with the exception of unpinning xsimd, which was pinned as of 9.0.0, but isn't anymore).

opened by h-vetinari 73
ARROW-17692: [R] Add support for building with system AWS SDK C++

This PR uses "pkg-config --static ... arrow" to collect build flags. "pkg-config --static ... arrow" reports suitable build flags that depend on build options and used libraries for Apache Arrow C++. This works with the system AWS SDK C++.
Component: R Component: C++

opened by thisisnic 71
ARROW-15639 [C++][Python] UDF Scalar Function Implementation
PR for Scalar UDF integration

This is the first phase of UDF integration to Arrow. This version only includes ScalarFunctions. In future of PRs, Vector UDF (using Arrow VectorFunction), UDTF (user-defined table function) and Aggregation UDFs will be integrated. This PR includes the following;

[x] UDF Python Scalar Function registration and usage

[x] UDF Python Scalar Function Examples

[x] UDF Python Scalar Function test cases

[x] UDF C++ Example extended from Compute Function Example

[x] Added aggregation example (optional to this PR: if required can remove and push in a different PR)

Component: C++ Component: Python Component: GLib
opened by vibhatha 68
ARROW-16584: [Java] Java JNI with S3 support

macOS development target is changed to 10.13 from 10.11. See also the discussion on mailing list: https://lists.apache.org/thread/pjgjrl716gvqzql586cnnoxb38nb0j5w

opened by REASY 65
ARROW-14506: [C++] Conda support for google-cloud-cpp
This PR adds support for google-cloud-cpp to the Conda files. Probably the most difficult change to grok is the change to compile with C++17 when using Conda:

Conda defaults all its builds to C++17, this bug goes into some detail as to why.

Arrow defaults to C++11 if no CMAKE_CXX_STANDARD argument is provided.

Abseil's ABI changes when used from C++11 vs. C++17, see https://github.com/abseil/abseil-cpp/issues/696

Therefore, one must compile with C++17 to use Abseil in Conda.

And because google-cloud-cpp has a direct dependency on Abseil, exposed through the headers, one must use C++17 to use google-cloud-cpp too.

Component: C++
opened by coryan 65
ARROW-12264: [C++][Dataset] Handle NaNs correctly in Parquet predicate push-down

This PR fixes the issue of handling NaNs in the Parquet predicate push-down. While computing the valid bounds for a column, if the max or min of the column is null, the range should ignore that.
Component: C++

opened by sanjibansg 1
GH-15052: [C++][Parquet] Fixing DELTA_BINARY_PACKED when reading only 1
This patch trying to fix https://github.com/apache/arrow/issues/15052 . The problem is mentioned here: https://github.com/apache/arrow/issues/15052#issuecomment-1367486164

When read 1 value, DeltaBitPackDecoder will not call InitBlock, causing it always read last_value_.

Seems the problem is introduced in https://github.com/apache/arrow/pull/10627 and https://github.com/amol-/arrow/commit/d982bedcf5e03d44c01949b192da54a8c1e525d8

I will add some test tonight

Closes: #15052

Component: Parquet Component: C++
opened by mapleFU 4
[Release][Docs][R] Update version information in patch release

Describe the enhancement requested

We updated version information for documents manually in 10.0.1 release: #14887

We should automate it to reduce release cost. See also https://github.com/apache/arrow/pull/14887#issuecomment-1347332418 for implementation idea.

Component(s)

R, Release
Type: enhancement Component: R Component: Documentation Component: Release

opened by kou 0
[R][CI] pyarrow tests fail on macos 10.13 due to missing pyarrow wheel

Describe the bug, including details regarding any error messages, version, and platform.

see https://github.com/apache/arrow/issues/14829#issuecomment-1367506270

Component(s)

Continuous Integration, R
Type: bug Component: R Component: Continuous Integration

opened by assignUser 0

Owner

The Apache Software Foundation

GitHub https://arrow.apache.org/

Fill Apache Arrow record batches from an ODBC data source in Rust.

arrow-odbc Fill Apache Arrow arrays from ODBC data sources. This crate is build on top of the arrow and odbc-api crate and enables you to read the dat

21 Dec 27, 2022

🦖 Evolve your fixed length data files into Apache Arrow tables, fully parallelized!

?? Evolve your fixed length data files into Apache Arrow tables, fully parallelized! ?? Overview ... ?? Installation The easiest way to install evolut

3 Dec 22, 2023

Apache Arrow DataFusion and Ballista query engines

DataFusion is an extensible query execution framework, written in Rust, that uses Apache Arrow as its in-memory format.

2.9k Jan 2, 2023

A Rust DataFrame implementation, built on Apache Arrow

Rust DataFrame A dataframe implementation in Rust, powered by Apache Arrow. What is a dataframe? A dataframe is a 2-dimensional tabular data structure

287 Nov 11, 2022

Official Rust implementation of Apache Arrow

Native Rust implementation of Apache Arrow Welcome to the implementation of Arrow, the popular in-memory columnar format, in Rust. This part of the Ar

1.3k Jan 9, 2023

A Modern Real-Time Data Processing & Analytics DBMS with Cloud-Native Architecture, built to make the Data Cloud easy

5k Jan 9, 2023

DataFrame / Series data processing in Rust

black-jack While PRs are welcome, the approach taken only allows for concrete types (String, f64, i64, ...) I'm not sure this is the way to go. I want

30 Dec 10, 2022

A Modern Real-Time Data Processing & Analytics DBMS with Cloud-Native Architecture, written in Rust

Datafuse Modern Real-Time Data Processing & Analytics DBMS with Cloud-Native Architecture Datafuse is a Real-Time Data Processing & Analytics DBMS wit

5k Jan 4, 2023

Dataflow is a data processing library, primarily for machine learning

Dataflow Dataflow is a data processing library, primarily for machine learning. It provides efficient pipeline primitives to build a directed acyclic

9 Dec 19, 2022

Integration between arrow-rs and extendr

arrow_extendr arrow-extendr is a crate that facilitates the transfer of Apache Arrow memory between R and Rust. It utilizes extendr, the {nanoarrow} R

8 Nov 24, 2023

Arrow User-Defined Functions Framework on WebAssembly.

Arrow User-Defined Functions Framework on WebAssembly Example Build the WebAssembly module: cargo build --release -p arrow-udf-wasm-example --target w

3 Dec 14, 2023

Perhaps the fastest and most memory efficient way to pull data from PostgreSQL into pandas and numpy. 🚀

flaco Perhaps the fastest and most memory efficient way to pull data from PostgreSQL into pandas and numpy. ?? Have a gander at the initial benchmarks

14 Oct 31, 2022

Apache TinkerPop from Rust via Rucaja (JNI)

Apache TinkerPop from Rust An example showing how to call Apache TinkerPop from Rust via Rucaja (JNI). This repository contains two directories: java

8 Sep 27, 2022

A new arguably faster implementation of Apache Spark from scratch in Rust

vega Previously known as native_spark. Documentation A new, arguably faster, implementation of Apache Spark from scratch in Rust. WIP Framework tested

2.1k Jan 5, 2023

New generation decentralized data warehouse and streaming data pipeline

184 Dec 22, 2022

This library provides a data view for reading and writing data in a byte array.

Docs This library provides a data view for reading and writing data in a byte array. This library requires feature(generic_const_exprs) to be enabled.

2 Nov 2, 2022

A fast, powerful, flexible and easy to use open source data analysis and manipulation tool written in Rust

fisher-rs fisher-rs is a Rust library that brings powerful data manipulation and analysis capabilities to Rust developers, inspired by the popular pan

5 Aug 31, 2023

A fast, powerful, flexible and easy to use open source data analysis and manipulation tool written in Rust

fisher-rs fisher-rs is a Rust library that brings powerful data manipulation and analysis capabilities to Rust developers, inspired by the popular pan

5 Sep 6, 2023

ConnectorX - Fastest library to load data from DB to DataFrames in Rust and Python

ConnectorX enables you to load data from databases into Python in the fastest and most memory efficient way.

939 Jan 5, 2023

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing

Related tags

Overview

Apache Arrow

Powering In-Memory Analytics

What's in the Arrow libraries?

Implementation status

How to Contribute

Getting involved

Comments

Describe the enhancement requested

Component(s)

Describe the bug, including details regarding any error messages, version, and platform.

Component(s)

Owner

The Apache Software Foundation

Fill Apache Arrow record batches from an ODBC data source in Rust.

🦖 Evolve your fixed length data files into Apache Arrow tables, fully parallelized!

Apache Arrow DataFusion and Ballista query engines

A Rust DataFrame implementation, built on Apache Arrow

Official Rust implementation of Apache Arrow

A Modern Real-Time Data Processing & Analytics DBMS with Cloud-Native Architecture, built to make the Data Cloud easy

DataFrame / Series data processing in Rust

A Modern Real-Time Data Processing & Analytics DBMS with Cloud-Native Architecture, written in Rust

Dataflow is a data processing library, primarily for machine learning

Integration between arrow-rs and extendr

Arrow User-Defined Functions Framework on WebAssembly.

Perhaps the fastest and most memory efficient way to pull data from PostgreSQL into pandas and numpy. 🚀

Apache TinkerPop from Rust via Rucaja (JNI)

A new arguably faster implementation of Apache Spark from scratch in Rust

New generation decentralized data warehouse and streaming data pipeline

This library provides a data view for reading and writing data in a byte array.

A fast, powerful, flexible and easy to use open source data analysis and manipulation tool written in Rust

A fast, powerful, flexible and easy to use open source data analysis and manipulation tool written in Rust

ConnectorX - Fastest library to load data from DB to DataFrames in Rust and Python