Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing

Overview

Apache Arrow

Fuzzing Status License Twitter Follow

Powering In-Memory Analytics

Apache Arrow is a development platform for in-memory analytics. It contains a set of technologies that enable big data systems to process and move data fast.

Major components of the project include:

Arrow is an Apache Software Foundation project. Learn more at arrow.apache.org.

What's in the Arrow libraries?

The reference Arrow libraries contain many distinct software components:

  • Columnar vector and table-like containers (similar to data frames) supporting flat or nested types
  • Fast, language agnostic metadata messaging layer (using Google's Flatbuffers library)
  • Reference-counted off-heap buffer memory management, for zero-copy memory sharing and handling memory-mapped files
  • IO interfaces to local and remote filesystems
  • Self-describing binary wire formats (streaming and batch/file-like) for remote procedure calls (RPC) and interprocess communication (IPC)
  • Integration tests for verifying binary compatibility between the implementations (e.g. sending data from Java to C++)
  • Conversions to and from other in-memory data structures
  • Readers and writers for various widely-used file formats (such as Parquet, CSV)

Implementation status

The official Arrow libraries in this repository are in different stages of implementing the Arrow format and related features. See our current feature matrix on git master.

How to Contribute

Please read our latest project contribution guide.

Getting involved

Even if you do not plan to contribute to Apache Arrow itself or Arrow integrations in other projects, we'd be happy to have you involved:

Comments
  • ARROW-10224: [Python] Add support for Python 3.9 except macOS wheel and Windows wheel

    ARROW-10224: [Python] Add support for Python 3.9 except macOS wheel and Windows wheel

    Adds support and testing for Python 3.9. I am looking for review as this change may have touched too many things, but I'm also looking to get the CI to test all the different environments.

    H/T: @kou, the documentation and #5685 for helping me get this off the ground.

    Component: Python 
    opened by terencehonles 182
  • ARROW-14892: [Python][C++] GCS Bindings

    ARROW-14892: [Python][C++] GCS Bindings

    Incorporate GCS file system into python and other bug fixes.

    Bugs/Other changes:

    • Add GCS bindings mostly based on AWS bindings in Python and associated unit tests
    • Tell was incorrect, it double counted when the stream was constructed with an offset.
    • Missed setting the define in config.cmake which means FileSystemFromUri was never tested and didn't compile this is now fixed
    • Refine logic for GetFileInfo with a single path to recognize prefixes followed by a slash as a directory. This allows datasets to work as expected with a toy dataset generated on local-filesystem and copied to the cloud (I believe this is typical of how other systems write to GCS as well.
    • Switch convention for creating directories to always end in "/" and make use of this as another indicator. From testing with a sample iceberg table it appears this is the convention used for hive-partitioning, so I assume this is common practice for other Hive related writers (i.e. what we want to support).
    • Fix bug introduced in https://github.com/apache/arrow/commit/a5e45cecb24229433b825dac64e0ffd10d400e8c which caused failures when a deletion occurred on a bucket (not an object in the bucket).
    • Ensure output streams are closed on destruction (this is consistent with S3)
    Component: C++ Component: Python 
    opened by emkornfield 137
  • ARROW-16340: [C++][Python] Move all Python related code into PyArrow

    ARROW-16340: [C++][Python] Move all Python related code into PyArrow

    This PR moves src/arrow/python directory into pyarrow and arranges PyArrow to build it. The build on the Python side is made in two steps:

    1. _run_cmake_pyarrow_cpp() where the C++ part of the pyarrow is build first (the part that was moved in the refactoring)
    2. _run_cmake() where pyarrow is built as before

    No changes are needed in the build process from the user side to successfully build pyarrow after this refactoring. The test for PyArrow CPP will however be moved into Cython and can currently be run with:

    >>> pushd python/build/dist/temp 
    >>> ctest
    
    Component: C++ Component: Python Component: FlightRPC Component: Documentation 
    opened by AlenkaF 122
  • ARROW-17545: [C++][CI] Mandate C++17 instead of C++11

    ARROW-17545: [C++][CI] Mandate C++17 instead of C++11

    This PR switches our build system to require C++17 instead of C++11.

    Because the conda packaging jobs are out of sync with the conda-forge files, the Windows conda packaging jobs are broken with this change. The related task (sync conda packaging files with conda-forge) is tracked in ARROW-17635.

    Component: R Component: Java Component: Parquet Component: C++ Component: Python Component: Ruby Component: C++ - Gandiva Component: GLib Component: MATLAB Component: Documentation 
    opened by pitrou 120
  • ARROW-12626: [C++] Support toolchain xsimd, update toolchain version to version 8.1.0

    ARROW-12626: [C++] Support toolchain xsimd, update toolchain version to version 8.1.0

    This also updates pinned vcpkg to use xsimd 8.1.0.

    This also implements auto python-wheel-windows-vs2017 image update mechanism. We have a problem of "docker build" on Windows. "docker build" doesn't reuse pulled image as caches. "docker build" always rebuilds an image. This implements manual reuse mechanism like the following:

    if ! docker pull; then
       docker build # build only when built images don't exist
    fi
    docker run
    

    But this doesn't work when ci/docker/python-wheel-windows-vs2017.dockerfile is updated but pinned vcpkg revision isn't changed. In the case, "docker build" isn't run because "docker pull" is succeeded.

    To work this mechanism, this introduces "PYTHON_WHEEL_WINDOWS_IMAGE_REVISION". We must bump it manually when we update ci/docker/python-wheel-windows-vs2017.dockerfile. "PYTHON_WHEEL_WINDOWS_IMAGE_REVISION" is used in tag name. So "docker pull" is failed with new "PYTHON_WHEEL_WINDOWS_IMAGE_REVISION" and then "docker build" is used.

    Component: C++ 
    opened by wesm 101
  • ARROW-6920: [Packaging] Build python 3.8 wheels

    ARROW-6920: [Packaging] Build python 3.8 wheels

    adds python3.8 wheels

    as far as I can tell python3.8 isn't available for Conda yet (https://github.com/conda-forge/python-feedstock/pull/274), so that's will have to be added later

    opened by sjhewitt 95
  • ARROW-17635: [Python][CI] Sync conda recipe with the arrow-cpp feedstock

    ARROW-17635: [Python][CI] Sync conda recipe with the arrow-cpp feedstock

    Corresponds to status of feedstock as of https://github.com/conda-forge/arrow-cpp-feedstock/pull/848, minus obvious & intentional divergences in the setup here (with the exception of unpinning xsimd, which was pinned as of 9.0.0, but isn't anymore).

    opened by h-vetinari 73
  • ARROW-17692: [R] Add support for building with system AWS SDK C++

    ARROW-17692: [R] Add support for building with system AWS SDK C++

    This PR uses "pkg-config --static ... arrow" to collect build flags. "pkg-config --static ... arrow" reports suitable build flags that depend on build options and used libraries for Apache Arrow C++. This works with the system AWS SDK C++.

    Component: R Component: C++ 
    opened by thisisnic 71
  • ARROW-15639 [C++][Python] UDF Scalar Function Implementation

    ARROW-15639 [C++][Python] UDF Scalar Function Implementation

    PR for Scalar UDF integration

    This is the first phase of UDF integration to Arrow. This version only includes ScalarFunctions. In future of PRs, Vector UDF (using Arrow VectorFunction), UDTF (user-defined table function) and Aggregation UDFs will be integrated. This PR includes the following;

    • [x] UDF Python Scalar Function registration and usage
    • [x] UDF Python Scalar Function Examples
    • [x] UDF Python Scalar Function test cases
    • [x] UDF C++ Example extended from Compute Function Example
    • [x] Added aggregation example (optional to this PR: if required can remove and push in a different PR)
    Component: C++ Component: Python Component: GLib 
    opened by vibhatha 68
  • ARROW-16584: [Java] Java JNI with S3 support

    ARROW-16584: [Java] Java JNI with S3 support

    macOS development target is changed to 10.13 from 10.11. See also the discussion on mailing list: https://lists.apache.org/thread/pjgjrl716gvqzql586cnnoxb38nb0j5w

    opened by REASY 65
  • ARROW-14506: [C++] Conda support for google-cloud-cpp

    ARROW-14506: [C++] Conda support for google-cloud-cpp

    This PR adds support for google-cloud-cpp to the Conda files. Probably the most difficult change to grok is the change to compile with C++17 when using Conda:

    • Conda defaults all its builds to C++17, this bug goes into some detail as to why.
    • Arrow defaults to C++11 if no CMAKE_CXX_STANDARD argument is provided.
    • Abseil's ABI changes when used from C++11 vs. C++17, see https://github.com/abseil/abseil-cpp/issues/696
    • Therefore, one must compile with C++17 to use Abseil in Conda.
    • And because google-cloud-cpp has a direct dependency on Abseil, exposed through the headers, one must use C++17 to use google-cloud-cpp too.
    Component: C++ 
    opened by coryan 65
  • ARROW-12264: [C++][Dataset] Handle NaNs correctly in Parquet predicate push-down

    ARROW-12264: [C++][Dataset] Handle NaNs correctly in Parquet predicate push-down

    This PR fixes the issue of handling NaNs in the Parquet predicate push-down. While computing the valid bounds for a column, if the max or min of the column is null, the range should ignore that.

    Component: C++ 
    opened by sanjibansg 1
  • GH-15052: [C++][Parquet] Fixing DELTA_BINARY_PACKED when reading only 1

    GH-15052: [C++][Parquet] Fixing DELTA_BINARY_PACKED when reading only 1

    This patch trying to fix https://github.com/apache/arrow/issues/15052 . The problem is mentioned here: https://github.com/apache/arrow/issues/15052#issuecomment-1367486164

    When read 1 value, DeltaBitPackDecoder will not call InitBlock, causing it always read last_value_.

    Seems the problem is introduced in https://github.com/apache/arrow/pull/10627 and https://github.com/amol-/arrow/commit/d982bedcf5e03d44c01949b192da54a8c1e525d8

    I will add some test tonight

    • Closes: #15052
    Component: Parquet Component: C++ 
    opened by mapleFU 4
  • [Release][Docs][R] Update version information in patch release

    [Release][Docs][R] Update version information in patch release

    Describe the enhancement requested

    We updated version information for documents manually in 10.0.1 release: #14887

    We should automate it to reduce release cost. See also https://github.com/apache/arrow/pull/14887#issuecomment-1347332418 for implementation idea.

    Component(s)

    R, Release

    Type: enhancement Component: R Component: Documentation Component: Release 
    opened by kou 0
  • [R][CI] pyarrow tests fail on macos 10.13 due to missing pyarrow wheel

    [R][CI] pyarrow tests fail on macos 10.13 due to missing pyarrow wheel

    Describe the bug, including details regarding any error messages, version, and platform.

    see https://github.com/apache/arrow/issues/14829#issuecomment-1367506270

    Component(s)

    Continuous Integration, R

    Type: bug Component: R Component: Continuous Integration 
    opened by assignUser 0
Owner
The Apache Software Foundation
The Apache Software Foundation
Fill Apache Arrow record batches from an ODBC data source in Rust.

arrow-odbc Fill Apache Arrow arrays from ODBC data sources. This crate is build on top of the arrow and odbc-api crate and enables you to read the dat

Markus Klein 21 Dec 27, 2022
Apache Arrow DataFusion and Ballista query engines

DataFusion is an extensible query execution framework, written in Rust, that uses Apache Arrow as its in-memory format.

The Apache Software Foundation 2.9k Jan 2, 2023
A Rust DataFrame implementation, built on Apache Arrow

Rust DataFrame A dataframe implementation in Rust, powered by Apache Arrow. What is a dataframe? A dataframe is a 2-dimensional tabular data structure

Wakahisa 287 Nov 11, 2022
Official Rust implementation of Apache Arrow

Native Rust implementation of Apache Arrow Welcome to the implementation of Arrow, the popular in-memory columnar format, in Rust. This part of the Ar

The Apache Software Foundation 1.3k Jan 9, 2023
A Modern Real-Time Data Processing & Analytics DBMS with Cloud-Native Architecture, built to make the Data Cloud easy

A Modern Real-Time Data Processing & Analytics DBMS with Cloud-Native Architecture, built to make the Data Cloud easy

Datafuse Labs 5k Jan 9, 2023
DataFrame / Series data processing in Rust

black-jack While PRs are welcome, the approach taken only allows for concrete types (String, f64, i64, ...) I'm not sure this is the way to go. I want

Miles Granger 30 Dec 10, 2022
A Modern Real-Time Data Processing & Analytics DBMS with Cloud-Native Architecture, written in Rust

Datafuse Modern Real-Time Data Processing & Analytics DBMS with Cloud-Native Architecture Datafuse is a Real-Time Data Processing & Analytics DBMS wit

Datafuse Labs 5k Jan 4, 2023
Dataflow is a data processing library, primarily for machine learning

Dataflow Dataflow is a data processing library, primarily for machine learning. It provides efficient pipeline primitives to build a directed acyclic

Sidekick AI 9 Dec 19, 2022
Perhaps the fastest and most memory efficient way to pull data from PostgreSQL into pandas and numpy. 🚀

flaco Perhaps the fastest and most memory efficient way to pull data from PostgreSQL into pandas and numpy. ?? Have a gander at the initial benchmarks

Miles Granger 14 Oct 31, 2022
Apache TinkerPop from Rust via Rucaja (JNI)

Apache TinkerPop from Rust An example showing how to call Apache TinkerPop from Rust via Rucaja (JNI). This repository contains two directories: java

null 8 Sep 27, 2022
A new arguably faster implementation of Apache Spark from scratch in Rust

vega Previously known as native_spark. Documentation A new, arguably faster, implementation of Apache Spark from scratch in Rust. WIP Framework tested

raja sekar 2.1k Jan 5, 2023
New generation decentralized data warehouse and streaming data pipeline

World's first decentralized real-time data warehouse, on your laptop Docs | Demo | Tutorials | Examples | FAQ | Chat Get Started Watch this introducto

kamu 184 Dec 22, 2022
This library provides a data view for reading and writing data in a byte array.

Docs This library provides a data view for reading and writing data in a byte array. This library requires feature(generic_const_exprs) to be enabled.

null 2 Nov 2, 2022
ConnectorX - Fastest library to load data from DB to DataFrames in Rust and Python

ConnectorX enables you to load data from databases into Python in the fastest and most memory efficient way.

SFU Database Group 939 Jan 5, 2023
Provides a way to use enums to describe and execute ordered data pipelines. 🦀🐾

enum_pipline Provides a way to use enums to describe and execute ordered data pipelines. ?? ?? I needed a succinct way to describe 2d pixel map operat

Ben Greenier 0 Oct 29, 2021
AppFlowy is an open-source alternative to Notion. You are in charge of your data and customizations

AppFlowy is an open-source alternative to Notion. You are in charge of your data and customizations. Built with Flutter and Rust.

null 30.7k Jan 7, 2023
An example repository on how to start building graph applications on streaming data. Just clone and start building 💻 💪

An example repository on how to start building graph applications on streaming data. Just clone and start building ?? ??

Memgraph 40 Dec 20, 2022
High-performance runtime for data analytics applications

Weld Documentation Weld is a language and runtime for improving the performance of data-intensive applications. It optimizes across libraries and func

Weld 2.9k Dec 28, 2022
A high-performance, high-reliability observability data pipeline.

Quickstart • Docs • Guides • Integrations • Chat • Download What is Vector? Vector is a high-performance, end-to-end (agent & aggregator) observabilit

Timber 12.1k Jan 2, 2023