Arrowdantic is a small Python library backed by a mature Rust implementation of Apache Arrow

Jorge Leitao

Last update: Dec 21, 2022

Related tags

Foreign Function Interface arrowdantic

Overview

Welcome to arrowdantic

Arrowdantic is a small Python library backed by a mature Rust implementation of Apache Arrow that can interoperate with

Parquet
Apache Arrow and
ODBC (databases).

For simple (but data-heavy) data engineering tasks, this package essentially replaces pyarrow: it supports reading from and writing to Parquet, Arrow at the same or higher performance and higher safety (e.g. no segfaults).

Furthermore, it supports reading from and writing to ODBC compliant databases at the same or higher performance than turbodbc.

This package is also suitable for environments such as AWS Lambda functions. It takes 13M of disk space, compared to 82M taken by pyarrow.

Features

declare and access Arrow-backed arrays (integers, floats, boolean, string, binary)
read from and write to Apache Arrow IPC file
read from and write to Apache Parquet
read from and write to ODBC-compliant databases (e.g. postgres, mongoDB)

Examples

Use parquet

import io
import arrowdantic as ad

original_arrays = [ad.UInt32Array([1, None])]

schema = ad.Schema(
    [ad.Field(f"c{i}", array.type, True) for i, array in enumerate(original_arrays)]
)

data = io.BytesIO()
with ad.ParquetFileWriter(data, schema) as writer:
    writer.write(ad.Chunk(original_arrays))
data.seek(0)

reader = ad.ParquetFileReader(data)
chunk = next(reader)
assert chunk.arrays() == original_arrays

Use Arrow files

import arrowdantic as ad

original_arrays = [ad.UInt32Array([1, None])]

schema = ad.Schema(
    [ad.Field(f"c{i}", array.type, True) for i, array in enumerate(original_arrays)]
)

import io

data = io.BytesIO()
with ad.ArrowFileWriter(data, schema) as writer:
    writer.write(ad.Chunk(original_arrays))
data.seek(0)

reader = ad.ArrowFileReader(data)
chunk = next(reader)
assert chunk.arrays() == original_arrays

Use ODBC

import arrowdantic as ad


arrays = [ad.Int32Array([1, None]), ad.StringArray(["aa", None])]

with ad.ODBCConnector(r"Driver={SQLite3};Database=sqlite-test.db") as con:
    # create an empty table with a schema
    con.execute("DROP TABLE IF EXISTS example;")
    con.execute("CREATE TABLE example (c1 INT, c2 TEXT);")

    # insert the arrays
    con.write("INSERT INTO example (c1, c2) VALUES (?, ?)", ad.Chunk(arrays))

    # read the arrays
    with con.execute("SELECT c1, c2 FROM example", 1024) as chunks:
        assert chunks.fields() == [
            ad.Field("c1", ad.DataType.int32(), True),
            ad.Field("c2", ad.DataType.string(), True),
        ]
        chunk = next(chunks)
assert chunk.arrays() == arrays

Comments

Added float32 and float64 to datatype

Hello,

Thanks for this great libs, it is very useful.

I add the float32 and float64 that are missing (because I need it). It works and I will deploy it in production.

For the tests, I don't understand but float32 transform 1.2 to 1.2000000476837158

Do you have an idea why ?

opened by blackrez 3
Using the main version of arrow2

Hello,

I tried to use the latest of arrow2 but my build failed.

https://github.com/blackrez/arrowdantic/actions/runs/3563855994/jobs/5987131244

I think this is due to an odbc function.

But the odbc_fix patch is not merged with the master, what is the blocking point and how I can help ?

Thanks in advance (I'm starting to use it in production and it works great).

opened by blackrez 2
Fixed release build

Hello,

I saw the build release is broken. I tried to fix it but I can't build because maturin 2010 doesn't support aarch64 (I only have aarch64 env). So I have to migrate to maturin 2014 and it works.
bug

opened by blackrez 1

Cannot install on Macbook M1

Hello,

I can't install arrowdantic on Macbook M1, there is a compilation error.

        = note: ld: warning: directory not found for option '-L/Users/nabil/lib'
                ld: library not found for -lodbc
                clang: error: linker command failed with exit code 1 (use -v to see invocation)

unixodbc is installed with brew.

opened by blackrez 1

Add wheels for aarch64 and python 3.8, 3.9, 3.10 for linux

Actually, manylinux only push wheels for python 3.7-amd64.

Maturin have the ability to build multi-arch and multi-version.

For example : https://github.com/ijl/orjson/blob/master/.github/workflows/linux-cross.yaml

opened by blackrez 0
Exporting to numpy & cloud file systems
Hi @jorgecarleitao,

Thanks for putting this library together -- it looks awesome! I had a few quick questions.

What is the easiest way to convert to / from numpy using arrowdantic?

Do you have any recommendations for reading Arrow files from cloud storage (e.g., s3 or gcs) that are backed by rust with python bindings?
opened by benjaminrwilson 1

Owner

Jorge Leitao

Open source contributor; PMC member of Apache Arrow

GitHub

Create a Python project automatically with rust (like create-react-app but for python)

create-python-project Create a Python project automatically with rust (like create-react-app but for python) Installation cargo install create-python-