transmute-free Rust library to work with the Arrow format

Jorge Leitao

Last update: Dec 30, 2022

Related tags

Utilities arrow2

Overview

Arrow2: Transmute-free Arrow

This repository contains a Rust library to work with the Arrow format. It is a re-write of the official Arrow crate using transmute-free operations. See FAQ for details.

See the guide.

Design

This repo and crate's primary goal is to offer a safe Rust implementation to interoperate with the Arrow. As such, it

MUST NOT implement any logical type other than the ones defined on the arrow specification, schema.fbs.
MUST lay out memory according to the arrow specification
MUST support reading from and writing to the C data interface at zero-copy.
MUST support reading from, and writing to, the IPC specification, which it MUST verify against golden files available here.

Design documents about each of the parts of this repo are available on their respective READMEs.

Run unit tests

git clone [email protected]:jorgecarleitao/arrow2.git
cd arrow2
cargo test

The test suite is a superset of all integration tests that the original implementation has against golden files from the arrow project. It currently makes no attempt to test the implementation against other implementations in arrow's master; it assumes that arrow's golden files are sufficient to cover the specification. This crate uses both little and big endian golden files, as it supports both endianesses at IPC boundaries.

Features in this crate and not in the original

Uses Rust's compiler whenever possible to prove that memory reads are sound
MIRI checks on non-IO components (MIRI and file systems are a bit funny atm)
IPC supports big endian
More predictable JSON reader
Generalized parsing of CSV based on logical data types
conditional compilation based on cargo features to reduce dependencies and size
single repository dedicated to Rust
faster IPC reader (different design that avoids an extra copy of all data)

Features in the original not availabe in this crate

Parquet IO
some compute kernels
SIMD (no plans to support: favor auto-vectorization instead)

Roadmap

parquet IO
bring documentation up to speed
compute kernels
auto-vectorization of bitmap operations

How to develop

This is a normal Rust project. Clone and run tests with cargo test.

FAQ

Why?

The arrow crate uses Buffer, a generic struct to store contiguous memory regions (of bytes). This construct is used to store data from all arrays in the rust implementation. The simplest example is a buffer containing 1i32, that is represented as &[0u8, 0u8, 0u8, 1u8] or &[1u8, 0u8, 0u8, 0u8] depending on endianness.

When a user wishes to read from a buffer, e.g. to perform a mathematical operation with its values, it needs to interpret the buffer in the target type. Because Buffer is a contiguous region of bytes with no type information, users must transmute its data into the respective type.

Arrow currently transmutes buffers on almost all operations, and very often does not verify that there is type alignment nor correct length when we transmute it to a slice of type &[T].

Just as an example, in v3.0.0, the following code compiles, does not panic, is unsound and results in UB:

let buffer = Buffer::from_slic_ref(&[0i32, 2i32])
let data = ArrayData::new(DataType::Int64, 10, 0, None, 0, vec![buffer], vec![]);
let array = Float64Array::from(Arc::new(data));

println!("{:?}", array.value(1));

Note how this initializes a buffer with bytes from i32, initializes an ArrayData with dynamic type Int64, and then an array Float64Array from Arc. Float64Array's internals will essentially consume the pointer from the buffer, re-interpret it as f64, and offset it by 1.

Still within this example, if we were to use ArrayData's datatype, Int64, to transmute the buffer, we would be creating &[i64] out of a buffer created out of i32.

Any Rust developer acknowledges that this behavior goes very much against Rust's core premise that a functions' behvavior must not be undefined depending on whether the arguments are correct. The obvious observation is that transmute is one of the most unsafe Rust operations and not allowing the compiler to verify the necessary invariants is a large burden for users and developers to take.

This simple example indicates a broader problem with the current design, that we now explore in detail.

Root cause analysis

At its core, Arrow's current design is centered around two main structs:

untyped Buffer
untyped ArrayData

1. untyped `Buffer`

The crate's buffers are untyped, which implies that once created, the type information is lost. Consequently, the compiler has no way of verifying that a certain read can be performed. As such, any read from it requires an alignment and size check at runtime. This is not only detrimental to performance, but also very cumbersome.

For the past 4 months, I have identified and fixed more than 10 instances of unsound code derived from the misuse, within the crate itself, of Buffer. This hints that downstream dependencies using this crate and use this API are likely do be even more affected by this.

2. untyped `ArrayData`

ArrayData is a struct containing buffers and child data that does not differentiate which type of array it represents at compile time.

Consequently, all buffer reads from ArrayData's buffers are effectively unsafe, as they require certain invariants to hold. These invariants are strictly related to ArrayData::datatype: this enum differentiates how to transmute the ArrayData::buffers. For example, an ArrayData::datatype equal to DataType::UInt32 implies that the buffer should be transmuted to u32.

The challenge with the above struct is that it is not possible to prove that ArrayData's creation is sound at compile time. As the sample above showed, there was nothing wrong, during compilation, with passing a buffer with i32 to an ArrayData expecting i64. We could of course check it at runtime, and we should, but we are defeating the whole purpose of using a typed system as powerful as Rust offers.

The main consequence of this observation is that the current code has a significant maintenance cost, as we have to be rigorously check the types of the buffers we are working with. The example above shows that, even with that rigour, we fail to identify obvious problems at runtime.

Overall, there are many instances of our code where we expose public APIs marked as safe that are unsafe and lead to undefined behavior if used incorrectly. This goes against the core goals of the Rust language, and significantly weakens Arrow Rust's implementation core premise that the compiler and borrow checker proves many of the memory safety concerns that we may have.

Equally important, the inability of the compiler to prove certain invariants is detrimental to performance. As an example, the implementation of the take kernel in this repo is semantically equivalent to the current master, but 1.3-2x faster.

How?

Contrarily to the original implementation, this implementation does not transmutate byte buffers based on runtime types, and instead requires all buffers to be typed (in Rust's sense of a generic).

This removes many existing bugs and enables the compiler to prove all type invariants with the only exception of FFI and IPC boundaries.

This crate also has a different design towards arrays' offsets that removes many out of bound reads consequent of using byte and slot offsets interchangibly.

This crate's design of primitive types is also more explicit about its logical and physical representation, enabling support for Timestamps with timezones and a safe implementation of the Interval type.

Consequently, this crate is easier to use, develop, maintain, and debug.

Any plans to merge with the Apache Arrow project?

Yes. The primary reason to have this repo and crate is to be able to propotype and mature using a fundamentally different design based on a transmute-free implementation. This requires breaking backward compatibility and loss of features that is impossible to achieve on the Arrow repo.

Furthermore, the arrow project currently has a release mechanism that is unsuitable for this type of work:

The Apache Arrow project has a single git repository with all 10+ implementations, ranging from C++, Python, C#, Julia, Rust, and execution engines such as Grandiva and DataFusion. A git ref corresponds to all of them, and a commit is about any/all of them.

The implication is this work would require a proibitive number of Jira issues for each PR to the crate, as well as an inhumane number of PRs, reviews, etc.

Another consequence is that it is impossible to release a different design of the arrow crate without breaking every dependency within the project which makes it difficult to iterate.

A release of the Apache consists of a release of all implementations of the arrow format at once, with the same version. It is currently at 3.0.0.

This implies that the crate version is independent of the changelog or its API stability, which violates SemVer. This procedure makes the crate incompatible with Rusts' (and many others') ecosystem that heavily relies on SemVer to constraint software versions.

Secondly, this implies the arrow crate is versioned as >0.x. This places expectations about API stability that are incompatible with this effort.

Comments

Replaced own allocator by `std::Vec`.

This PR moves the custom allocator to a feature gate cache_aligned, replacing it by std::Vec in the default, full, etc.

This allows users to create Buffer and MutableBuffer directly from a Vec at zero cost (MutableBuffer becomes a thin wrapper of Vec). This opens a whole range of possibilities related to crates that assume that the backing container is a Vec.

Note that this is fully compatible with arrow spec; we just do not follow the recommendation that the allocations follow 64 bytes (which neither this nor arrow-rs was following, since we have been using 128 byte alignments in x86_64). I was unable to observe any performance difference between 128-byte, 64-byte and std's (i.e. aligned with T) allocations.

This is backward incompatible, see #449 for details. Closes #449.
feature

opened by jorgecarleitao 17
Read Decimal from Parquet File

Hi,

So far we have the capability to write decimal to parquet. I wonder if we can implement reading decimal value from parquet file as well.

Thank you very much.
good first issue feature

opened by potter420 15
Added COW semantics to `Buffer`, `Bitmap` and some arrays
I make this PR to have some context for the discussion. I understand this hits the core of all memory operations so we must ensure its correct.

This allows for a PrimitiveArray to get a MutablePrimitiveArray zero copy. Allowing for mutations, pushes etc.

This conversion from PrimitiveArray to MutablePrimitiveArray can only be done if

The Arc<> pointer is not shared. This is checked with Arc::get_mut and the borrow checker

The data is allocated in Rust by a Vec and not by an FFI.

the data has an offset of 0. (let's keep it simple)

Both the validity Bitmap and the values Buffer<T> suffice above requirements.

feature
opened by ritchie46 14
Added support for `Extension`

This PR adds support for Arrow's "Extension type".

This is represented by DataType::Extension and ExtensionArray, respectively.

For now, extension arrays are only supported to be shared via IPC (i.e. not FFI, since metadata is still not supported in FFI).

This PR adds an example demonstrating how to use it.

All of this is pending passing integration tests, as usual, as well as more tests covering the feature.
feature

opened by jorgecarleitao 13
Simplified reading parquet
This PR simplifies the code to read parquet, making it a bit more future proof and opening the doors to improve performance in writing by re-using buffers (improvements upstream).

I do not observe differences in performance (vs main) in the following parquet configurations:

single page vs multiple pages

compressed vs uncompressed

different types

It generates flamegraphs that imo are quite optimized:

This corresponds to

cargo flamegraph --features io_parquet,io_parquet_compression \ --example parquet_read fixtures/pyarrow3/v1/multi/snappy/benches_1048576.parquet \ 1 0

i.e. reading a f64 column from a single row group with 1M rows with a page size of 1Mb (default in pyarrow).

(same but for a utf8 column (column index 2))

The majority of the time is used deserializing the data to arrow, which means that the main gains to have continue to be on that front.

Backward incompatible

The API to read parquet now uses FallibleStreamingIterator instead of StreamingIterator (of Result<Page>). As before, we re-export these APIs in io::parquet::read.

The API to write parquet now expects the user to decompress the pages. This is only relevant when not using RowGroupIterator (i.e. in parallelizing). This is now enforced by the type system (DataPage vs CompressedDataPage), so that we do not get it wrong.

backwards-incompatible
opened by jorgecarleitao 12
IPC's `StreamReader` may abort due to excessive memory by overflowing a `usize`d variable

Hi,

I've been trying to use this crate as means to transfer real time streaming data between pyarrow and a Rust process. However, I've been having hard-to-debug issues with the data readout on the Rust side - basically after a few seconds of stream I always get a message similar to: memory allocation of 18446744071884874941 bytes failed and the process aborts without a traceback (Windows 10).

To debug it I forked and inserted a few logging statements in stream.rs which revealed this line as the culprit. Basically, sometime during my application meta_len becomes negative, and when its cast to a usize it obviously wraps around and causes the memory overflow. Now, if meta_len is negative it also means that meta_size is corrupt as well, but I'm still not quite sure what causes this corruption.

During the live streaming I get this error pretty reliably, but if I try to re-run this script without the Python side of things, i.e. only reading from the Arrow IPC file and not writing anything new to it, the file parses correctly. In other words, it appears that concurrent access to the file of the Python process - which writes the data - and the Rust process causes this issue. I was under the impression that Arrow's NativeFile format should help in these situations, but I might've interpreted the pyarrow docs incorrectly. I'm not quite sure how to lock a file between two different processes (if that's even possible), so I'm kinda lost with respect to that issue.

With all of this background, my question is a bit more straight-forward: Assuming that there happened some corruption at the meta_size level, how should I recover from it? Should I move the file Cursor back a few steps? Should I try to re-read the data? Or is aborting the streaming all that's left?

I hope my issue and specific question are clear enough. As a side note, I tried using arrow-rs's IPC implementation, but they have more severe bugs which are much less subtle than this one.

opened by HagaiHargil 12

IPC writing writes all values in sliced arrays with offsets.

I am new to Rust and Arrow (so apologies for my lack of clarity in explaining this issue) and have a use case where I need to transform columnar record batches to row-oriented record batches, i.e. a single record batch for each row.

Full code to reproduce: https://github.com/dectl/arrow2-slice-bug

It is expected that the rows vector of arrays would each contain values for a single row as I sliced the columns using a row number offset with a length of 1.

In this example, I read 5 rows from a csv file and create a record batch. The row-oriented record batches produced are all the same as the original record batch of 5. Looking at the debug output for rows[0] below, it appears that the arrays point to the full record batch buffer, containing data for all 5 rows, rather than the slice of data for the single row. Is there perhaps an issue with the offsets in this implementation or am I doing something wrong?

RecordBatch { schema: Schema { fields: [Field { name: "trip_id", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: None }, Field { name: "lpep_pickup_datetime", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: None }, Field { name: "lpep_dropoff_datetime", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: None }, Field { name: "passenger_count", data_type: Float64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: None }, Field { name: "trip_distance", data_type: Float64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: None }, Field { name: "fare_amount", data_type: Float64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: None }, Field { name: "tip_amount", data_type: Float64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: None }, Field { name: "total_amount", data_type: Float64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: None }, Field { name: "payment_type", data_type: Float64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: None }, Field { name: "trip_type", data_type: Float64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: None }], metadata: {} }, columns: [Utf8Array { data_type: Utf8, offsets: Buffer { data: Bytes { ptr: 0x555aec9da480, len: 6, data: [0, 36, 72, 108, 144, 180] }, offset: 0, length: 2 }, values: Buffer { data: Bytes { ptr: 0x555aec9da780, len: 180, data: [97, 49, 55, 99, 53, 50, 97, 49, 45, 100, 53, 52, 49, 45, 52, 54, 99, 56, 45, 57, 100, 49, 101, 45, 51, 51, 51, 57, 97, 56, 55, 49, 53, 100, 50, 48, 48, 49, 101, 100, 100, 48, 52, 56, 45, 97, 48, 100, 48, 45, 52, 100, 49, 49, 45, 56, 102, 99, 99, 45, 99, 54, 100, 51, 52, 48, 51, 51, 52, 57, 53, 50, 56, 55, 49, 54, 97, 56, 97, 98, 45, 98, 99, 56, 51, 45, 52, 53, 50, 56, 45, 97, 57, 98, 100, 45, 54, 55, 101, 48, 51, 53, 49, 54, 50, 49, 51, 100, 99, 56, 99, 99, 48, 48, 57, 97, 45, 57, 57, 97, 52, 45, 52, 49, 48, 101, 45, 56, 56, 56, 54, 45, 52, 101, 55, 57, 56, 101, 57, 49, 56, 51, 52, 50, 101, 50, 54, 54, 54, 97, 57, 101, 45, 53, 102, 55, 48, 45, 52, 54, 98, 53, 45, 57, 55, 98, 99, 45, 53, 55, 53, 98, 100, 52, 56, 100, 53, 57, 57, 100] }, offset: 0, length: 180 }, validity: None, offset: 0 }, Utf8Array { data_type: Utf8, offsets: Buffer { data: Bytes { ptr: 0x555aec9daa80, len: 6, data: [0, 19, 38, 57, 76, 95] }, offset: 0, length: 2 }, values: Buffer { data: Bytes { ptr: 0x555aec9dac80, len: 95, data: [50, 48, 49, 57, 45, 49, 50, 45, 49, 56, 32, 49, 53, 58, 53, 50, 58, 51, 48, 50, 48, 50, 48, 45, 48, 49, 45, 48, 49, 32, 48, 48, 58, 52, 53, 58, 53, 56, 50, 48, 50, 48, 45, 48, 49, 45, 48, 49, 32, 48, 48, 58, 52, 49, 58, 51, 56, 50, 48, 50, 48, 45, 48, 49, 45, 48, 49, 32, 48, 48, 58, 53, 50, 58, 52, 54, 50, 48, 50, 48, 45, 48, 49, 45, 48, 49, 32, 48, 48, 58, 49, 57, 58, 53, 55] }, offset: 0, length: 95 }, validity: None, offset: 0 }, Utf8Array { data_type: Utf8, offsets: Buffer { data: Bytes { ptr: 0x555aec9dae80, len: 6, data: [0, 19, 38, 57, 76, 95] }, offset: 0, length: 2 }, values: Buffer { data: Bytes { ptr: 0x555aec9db080, len: 95, data: [50, 48, 49, 57, 45, 49, 50, 45, 49, 56, 32, 49, 53, 58, 53, 52, 58, 51, 57, 50, 48, 50, 48, 45, 48, 49, 45, 48, 49, 32, 48, 48, 58, 53, 54, 58, 51, 57, 50, 48, 50, 48, 45, 48, 49, 45, 48, 49, 32, 48, 48, 58, 53, 50, 58, 52, 57, 50, 48, 50, 48, 45, 48, 49, 45, 48, 49, 32, 48, 49, 58, 49, 52, 58, 50, 49, 50, 48, 50, 48, 45, 48, 49, 45, 48, 49, 32, 48, 48, 58, 51, 48, 58, 53, 54] }, offset: 0, length: 95 }, validity: None, offset: 0 }, PrimitiveArray { data_type: Float64, values: Buffer { data: Bytes { ptr: 0x555aec9db300, len: 5, data: [5.0, 2.0, 1.0, 2.0, 1.0] }, offset: 0, length: 1 }, validity: None, offset: 0 }, PrimitiveArray { data_type: Float64, values: Buffer { data: Bytes { ptr: 0x555aec9f4180, len: 5, data: [0.0, 1.28, 2.47, 6.3, 2.3] }, offset: 0, length: 1 }, validity: None, offset: 0 }, PrimitiveArray { data_type: Float64, values: Buffer { data: Bytes { ptr: 0x555aec9f4380, len: 5, data: [3.5, 20.0, 10.5, 21.0, 10.0] }, offset: 0, length: 1 }, validity: None, offset: 0 }, PrimitiveArray { data_type: Float64, values: Buffer { data: Bytes { ptr: 0x555aec9f4580, len: 5, data: [0.01, 4.06, 3.54, 0.0, 0.0] }, offset: 0, length: 1 }, validity: None, offset: 0 }, PrimitiveArray { data_type: Float64, values: Buffer { data: Bytes { ptr: 0x555aec9f4780, len: 5, data: [4.81, 24.36, 15.34, 25.05, 11.3] }, offset: 0, length: 1 }, validity: None, offset: 0 }, PrimitiveArray { data_type: Float64, values: Buffer { data: Bytes { ptr: 0x555aec9f4980, len: 5, data: [1.0, 1.0, 1.0, 2.0, 1.0] }, offset: 0, length: 1 }, validity: None, offset: 0 }, PrimitiveArray { data_type: Float64, values: Buffer { data: Bytes { ptr: 0x555aec9f4c80, len: 5, data: [1.0, 2.0, 1.0, 1.0, 1.0] }, offset: 0, length: 1 }, validity: None, offset: 0 }] }

This seems wrong: Bytes { ptr: 0x555aec9db300, len: 5, data: [5.0, 2.0, 1.0, 2.0, 1.0] }, offset: 0, length: 1 }

use std::path::Path;
use std::sync::Arc;

use arrow2::io::csv::read;
use arrow2::array::Array;
use arrow2::record_batch::RecordBatch;


fn main() {
    let mut reader = read::ReaderBuilder::new().from_path(&Path::new("sample.csv")).unwrap();
    let schema = Arc::new(read::infer_schema(&mut reader, Some(10), true, &read::infer).unwrap());
    
    let mut rows = vec![read::ByteRecord::default(); 5];
    read::read_rows(&mut reader, 0, &mut rows).unwrap();
    let rows = rows.as_slice();

    let record_batch = read::deserialize_batch(
        rows,
        schema.fields(),
        None,
        0,
        read::deserialize_column,
    ).unwrap();

    println!("full record batch:");
    println!("{:?}", record_batch);

    // Convert columnar record batch to a vector of "row" (single record) record batches
    let mut rows = Vec::new();

    for i in 0..record_batch.num_rows() {
        let mut row = Vec::new();
        for column in record_batch.columns() {
            let arr: Arc<dyn Array> = column.slice(i, 1).into();
            row.push(arr);
        }
        rows.push(row);
    }
    
    println!("row record batches:");
    for (i, row) in rows.into_iter().enumerate() {
        println!("row {}", i);
        let row_record_batch = RecordBatch::try_new(schema.clone(), row).unwrap();
        println!("{:?}", row_record_batch);
    }
}

Below is my equivalent code for the official arrow-rs implementation, which appears to work as expected.

use std::fs::File;
use std::path::Path;

use arrow::csv;
use arrow::util::pretty::print_batches;
use arrow::record_batch::RecordBatch;


fn main() {
    let file = File::open(&Path::new("sample.csv")).unwrap();
    let builder = csv::ReaderBuilder::new()
        .has_header(true)
        .infer_schema(Some(10));
    let mut csv = builder.build(file).unwrap();

    let record_batch = csv.next().unwrap().unwrap();
    let schema = record_batch.schema().clone();

    println!("full record batch:");
    print_batches(&[record_batch.clone()]).unwrap();

    // Convert columnar record batch to a vector of "row" (single record) record batches
    let mut rows = Vec::new();


    for i in 0..record_batch.num_rows() {
        let mut row = Vec::new();
        for column in record_batch.columns() {
            row.push(column.slice(i, 1));
        }
        rows.push(row);
    }
    

    // Using new record_batch.slice() method now implemented in SNAPSHOT-5.0.0
    /*
    for i in 0..record_batch.num_rows() {
        rows.push(record_batch.slice(i, 1));
    }
    */

    println!("row record batches:");
    for row in rows {
        let row_record_batch = RecordBatch::try_new(schema.clone(), row).unwrap();
        print_batches(&[row_record_batch]).unwrap();
    }

    /*
    for row in rows {
        print_batches(&[row]).unwrap();
    }
     */
}

opened by declark1 12

Support to read/write from/to ODBC

This PR adds support to reading from, and writing to, an ODBC driver.

I anticipate this to be one of the most efficient ways of loading data into arrow, as odbc-api offers an API to load data in a columnar format whereby most buffers are copied back-to-back into arrow (even when nulls are present). variable length and validity needs a small O(N) deserialization, so not as fast as Arrow IPC (but likely much faster than parquet).
feature

opened by jorgecarleitao 11
Consider removing `RecordBatch`
For historical reasons, we have RecordBatch. RecordBatch represents a collection of columns with a schema.

I see a couple of problems with RecordBatch:

it mixes metadata (Schema) with data (Array). In all IO cases we have, the Schema is known when the metadata from the file is read, way before data is read. I.e. the user has access to the Schema very early, and does not really need to pass it to an iterator or stream of data for the stream to contain the metadata. However, it is required to do so by our APIs, because our APIs currently return a RecordBatch (and thus need a schema on them) even though all the schemas are the same.

it is not part of the arrow spec. A RecordBatch is only mentioned in the IPC, and it does not contain a schema (only columns)

it is a struct that can easily be recreated by users that need it

It indirectly drives design decisions to use it as the data carrier, even though it is not a good one. For example, in DataFusion (apache/arrow-datafusion) the physical nodes return a stream of RecordBatch, which requires piping schemas all the way to the physical nodes so that they can in turn use them to create a RecordBatch. This could have been replaced by Vec<Arc<dyn Array>>, or even more exotic carriers (e.g. an enum with a scalar and vector variants).

help wanted no-changelog investigation
opened by jorgecarleitao 11
Added example showing parallel writes to parquet (x num_cores)

Encoding + compression is embarrassingly parallel across columns, and thus results in a speedup factor equal to the number of available cores, up to the number of columns to be written.
documentation

opened by jorgecarleitao 11
Add support to read JSON
Currently we write and read JSON lines.

https://jsonlines.org/

I believe it would be a minor modification to also be able to read and write JSON. We could wrap the JSON Line values in an array and separate them with a , instead of a new line char.

E.g. now we write:

JSON Lines

{c1:"a"} {c1:null}

JSON

[{c1:"a"}, {c1:null}]

Would this fit the scope of arrow2, as this is something different then what pyarrow does? I don't mean that we should drop the JSON Lines functionality, but that we also allow reading and writing JSON.
enhancement no-changelog
opened by ritchie46 10
Dependencies update

There's a few dependencies that are quite outdated (including some hefty ones, like zstd). The only one left outdated is odbc-api since I'm unsure about that.

opened by aldanor 1
StructArray convenience methods (redux)
I see #61 is marked as completed, but the methods discussed don't seem to exist. Besides that, I'd guess that most of the time when a caller knows a certain field exists, they probably also know its type. So I'd love to be able to do something like this:

arr.child::<PrimitiveArray<u64>>("id").unwrap()

Thoughts? I can take a stab at a PR, if it seems worthwhile.
opened by hohav 0
Reusing chunks when writing?

I'm currently using IPC file writer, where the write() method requires you pass a Chunk<Box<dyn Any>> (note: specifically a box).

If I have a bunch of MutablePrimitiveArray columns that I collect the data to until the chunk fills up, the only choice I have is to move them (i.e. std::mem::take) into Chunk, which will then require reallocation once the data for the next chunk gets collected. There's no feasible way of getting the original arrays out of Chunk either (you can get a &PrimitiveArray by downcasting, but you can't convert it back to a concrete sized type).

Wonder if it's an oversight of some sort (in which case, where exactly, in the IPC API?) or whether there's a good reason? E.g. write() requires Chunk<Box<dyn Any>> whereas it could probably accept a Chunk<impl AsRef<dyn Any>>?

opened by aldanor 0
fix csv infer_schema on empty fields

The default infer function provided by src/io/csv/utils.rs infers an empty field as DataType::Utf8. If the same column contains non empty fields containing valid DataType::Float64 data the column in infer_schema will then contain both DataTypes. When merge_schema is called it will decide that DataType::Utf8 has precedence over DataType::Float64 and set the column type to DataType::Float64.

If we instead decline to infer anything for a field without any data we will in the end get DataType::Float64 if the column contained other fields with valid f64 data and if we only had empty data for the column it will still default to DataType::Utf8.

opened by tripokey 3
Arrow2 read parquet file did not reuse the page decoder buffer to array
Let's look at these codes in
https://github.com/jorgecarleitao/arrow2/blob/main/src/io/parquet/read/deserialize/primitive/basic.rs#L219-L226

State::Required(page) => { values.extend( page.values .by_ref() .map(decode) .map(self.op) .take(remaining), ); }

It had extra memcpy in values.extend and decode, I think maybe we could optimize it by using Buffer clone.

The first motivation is to move

#[derive(Debug, Clone)] pub struct DataPage { pub(super) header: DataPageHeader, pub(super) buffer: Vec<u8>, ... }

to

#[derive(Debug, Clone)] pub struct DataPage { pub(super) header: DataPageHeader, pub(super) buffer: Buffer<u8>, ... }

@jorgecarleitao what do you think about this?

I found arrow-rs had addressed this improvement in https://github.com/apache/arrow-rs/blob/master/parquet/src/arrow/array_reader/byte_array.rs#L115-L138
opened by sundy-li 3

Releases(v0.15.0)

v0.15.0(Dec 18, 2022)
A new release is here, adding a number of new features and improvements to arrow2. Thank you to everyone that contributed to it!

This release adds support to a new format, the "record" JSON format, contributed by @AnIrishDuck, a new trait TryExtendFromSelf to efficiently concatenate an array into an existing mutable array, and multiple improvements by @sundy-li and @ritchie46 to performance. Finally, we have a new API OffsetsBuffer and Offsets proposed by @ritchie46 to allow creating variable sized-arrays without having to check for offsets.

This release also features a number of contributions from first contributors:

@benesch made their first contribution in https://github.com/jorgecarleitao/arrow2/pull/1271

@RinChanNOWWW made their first contribution in https://github.com/jorgecarleitao/arrow2/pull/1287

@datapythonista made their first contribution in https://github.com/jorgecarleitao/arrow2/pull/1290

@sandflee made their first contribution in https://github.com/jorgecarleitao/arrow2/pull/1286

@Samrose-Ahmed made their first contribution in https://github.com/jorgecarleitao/arrow2/pull/1279

@jondo2010 made their first contribution in https://github.com/jorgecarleitao/arrow2/pull/1300

@cyr made their first contribution in https://github.com/jorgecarleitao/arrow2/pull/1318

@universalmind303 made their first contribution in https://github.com/jorgecarleitao/arrow2/pull/1321

Thank you everyone for the great work this year, and happy festivities everyone!

Full Changelog

Breaking changes:

Added values' capacity to MutableBinaryArray::reserve #1277

Removed from_data from all arrays #1328 (jorgecarleitao)

Added Offsets and OffsetsBuffer #1316 (jorgecarleitao)

Bumped parquet2 dependency #1304 (ritchie46)

Added data_pagesize_limit to write parquet pages #1303 (sundy-li)

Bumped arrow-format to 0.8 #1298 (Xuanwo)

Improved iterators #1270 (jorgecarleitao)

New features:

Added TryExtendFromSelf #1278 (jorgecarleitao)

Added support for JSON ser/de records layout #1275 (AnIrishDuck)

Fixed bugs:

Parquet writes all values of sliced arrays? #1323

Avro schema: Invalid record names #1269

Fixed writing nested/sliced arrays to parquet #1326 (ritchie46)

Fixed failing to accept dictionary full of nulls #1312 (ritchie46)

Added support for Extension types in ffi #1300 (jondo2010)

Fixed error in memory usage of sliced binary/list/utf8arrays #1293 (ritchie46)

Fixed descending ordering when specify nulls first #1286 (sandflee)

Added avro record names when converting arrow schema to avro #1279 (Samrose-Ahmed)

Enhancements:

Fixed clippy #1336 (jorgecarleitao)

Improved UnionArray #1331 (jorgecarleitao)

Bumped json-deserializer version #1321 (universalmind303)

Removed flushing during arrow IPC writing to improve performance when using a buffered writer #1318 (cyr)

Improved performance of check_indexes #1313 (ritchie46)

Improved performance of checking offsets ~-64-73% #1305 (ritchie46)

Added reserve to pushable containers in parquet extend_from_decoder #1301 (ritchie46)

Optimized slicing #1285 (jorgecarleitao)

Improved ZipValidity iterators #1284 (ritchie46)

Added MutableBinaryValuesArray #1276 (jorgecarleitao)

Documentation updates:

Fixed link from the API to the guide #1290 (datapythonista)

Source code(tar.gz)
Source code(zip)
v0.14.1(Sep 27, 2022)
A couple of backward-compatible bug fixes and improvements that everyone benefits from :)

Thank you @cjermain, @shaeqahmed and @ozgrakkurt! 🙇

Full Changelog

Fixed bugs:

Potential bug in reading lists from avro? #1252

Removed un-used code #1258 (jorgecarleitao)

Fixed error reading unbounded Avro list #1253 (jorgecarleitao)

Add missing call to try_push_valid for nested avro deserialization #1248 (shaeqahmed)

Enhancements:

Bump json_deserializer version to 0.4.1 #1261 (cjermain)

Fixed clippy for 1.60 #1259 (jorgecarleitao)

Added BinaryArray::into_mut and double-ended support for its iterator #1255 (ozgrakkurt)

Testing updates:

Improved test for nullable struct read from Avro #1250 (jorgecarleitao)

Source code(tar.gz)
Source code(zip)
v0.14.0(Sep 12, 2022)
Another release of arrow2 is here!

Besides API improvements to reading IPC and parquet, there are two main new features, the ability to memory map arrow files (check out https://jorgecarleitao.github.io/arrow2/v0.14.0/guide/io/ipc_mmap.html) and support for decimal 256.

The following had their first time contribution to their crate:

@daniel-martinez-maqueda-sap made their first contribution in https://github.com/jorgecarleitao/arrow2/pull/1204

@AnIrishDuck made their first contribution in https://github.com/jorgecarleitao/arrow2/pull/1211

@samkaufman made their first contribution in https://github.com/jorgecarleitao/arrow2/pull/1213

@teymour-aldridge made their first contribution in https://github.com/jorgecarleitao/arrow2/pull/1225

@poga made their first contribution in https://github.com/jorgecarleitao/arrow2/pull/1234

@knil-sama made their first contribution in https://github.com/jorgecarleitao/arrow2/pull/1237

Thank you everyone for all the issues, PRs and ideas!

Full Changelog

Breaking changes:

Removed Count (parquet statistics) #1217 (jorgecarleitao)

Exposed parquet indexed page filtering to FileReader #1216 (jorgecarleitao)

Simpler IPC API #1208 (jorgecarleitao)

Migrated Avro code to avro-schema repo #1199 (jorgecarleitao)

Added support for decimal 256 #1194 (jorgecarleitao)

New features:

Added support for decoding delta-length-encoded binary (parquet) #1228 (jorgecarleitao)

Added support to read and write Parquet's delta-bitpacked (integer encoding) #1226 (jorgecarleitao)

Added support for parquet sidecar to FileReader #1215 (jorgecarleitao)

Write 64bit aligned IPC files #1201 (jorgecarleitao)

Added support to mmap IPC format #1197 (jorgecarleitao)

Added MutableStructArray #1196 (hohav)

Fixed bugs:

Stack overflow in parquet RowGroupReader with groups_filter #1206

fixed comparisson and validity kernels #1243 (ritchie46)

Fixed reading nested stats #1240 (jorgecarleitao)

FileSink now closes the underlying writer. #1213 (samkaufman)

Fixed JSON infer order #1212 (jorgecarleitao)

Fixed StackOverflow in skipping many parquet row groups #1210 (jorgecarleitao)

Fix escaped like wildcards #1204 (daniel-martinez-maqueda-sap)

Removed println :( #1203 (jorgecarleitao)

Enhancements:

Added schema to FileReader #1246 (jorgecarleitao)

Simpler nested parquet read #1241 (jorgecarleitao)

Removed unneeded code #1229 (jorgecarleitao)

Improved MutableStruct::push #1223 (hohav)

Reduced binary size #1221 (jorgecarleitao)

Added utf8 <> binary cast #1220 (jorgecarleitao)

split parquet compression backend features #1207 (ritchie46)

Improved API of mmap #1205 (ritchie46)

Added MutableArray::reserve #1202 (jorgecarleitao)

Delayed dict #1185 (jorgecarleitao)

Documentation updates:

Fixed guide and improved examples #1247 (jorgecarleitao)

Added documentation on parquet compatibility under TimeUnit. #1238 (TurnOfACard)

Fixed typo in error message for impl StructArray #1237 (knil-sama)

Fixed incorrect command in doc for generating ORC files #1234 (poga)

Improved github page generation #1233 (jorgecarleitao)

Fix a typo in the docs #1225 (teymour-aldridge)

Fix some doc links/typos #1211 (AnIrishDuck)

Testing updates:

Fixed clippy warnings #1227 (jorgecarleitao)

Updated integration test #1214 (jorgecarleitao)

Source code(tar.gz)
Source code(zip)
v0.13.1(Aug 4, 2022)
Full Changelog

Thanks @daniel-martinez-maqueda-sap!

Fixed bugs:

Fix escaped like wildcards #1204 (daniel-martinez-maqueda-sap)

Removed println :( #1203 (jorgecarleitao)

Source code(tar.gz)
Source code(zip)
v0.13.0(Jul 31, 2022)
A new version (0.13) is now available on crates.io! 🎉🎉🎉🎉

This is another large release of arrow2. Among the many, many changes (see below), it is worth noting:

Added copy-on-write API to perform operations in place, improving performance of expressions like (a + b) * 2 by a factor of 2-10x

Added support to read from Apache ORC format

Added support for projection and limit pushdown when reading from Arrow IPC format

Added support for f16

Thank you to the numerous contributors, both via PRs and issues, that resulted in this fantastic release 🙇

Breaking changes:

Made nested argument of array_to_pages non-owning #1174

Replaced Result by panic in boolean comparison #1159 (jorgecarleitao)

Improved dictionary invariants #1137 (jorgecarleitao)

Change signature of PrimitiveScalar::value to return reference #1129 (ncpenke)

Removed need to pass encodings by value #1123 (ritchie46)

Removed unused NativeType::to_ne_bytes #1112 (jorgecarleitao)

Avoid clone in with_validity #1104 (jorgecarleitao)

Reduced need of unsafe in FFI #1100 (jorgecarleitao)

Removed Buffer::into_mut and make_mut functions #1089 (jorgecarleitao)

Renamed Bitmap::null_count to Bitmap::unset_bits #1087 (jorgecarleitao)

Made chunk_size optional in parquet's column_iter_to_arrays #1055 (jorgecarleitao)

Migrated from Arc<dyn Array> to Box<dyn Array> #1042 (jorgecarleitao)

New features:

Added support to read ORC #1189 (jorgecarleitao)

Added support for limit pushdown to IPC reading #1135 (jorgecarleitao)

Added support to write and read Intervals from and to parquet #1122 (jorgecarleitao)

Added support to write FixedSizeBinary to Avro #1118 (jorgecarleitao)

Added support for projections in reading IPC streams #1097 (joshuataylor)

Added support to write parquet _metadata sidecar #1063 (jorgecarleitao)

Added cow APIs (2x-10x vs non-cow) #1061 (jorgecarleitao)

Added support to read and write f16 #1051 (jorgecarleitao)

Fixed bugs:

Fixed error not implemented error when reading plain, after-dict pages for fix-len-binary from parquet #1192 (jorgecarleitao)

Fixed error in decoding nested multi-page columns from parquet #1188 (jorgecarleitao)

Fixed error in counting items in nested parquet #1182 (jorgecarleitao)

Fixed reading stats from int96 parquet #1181 (jorgecarleitao)

Fixed limit pushdown in parquet #1180 (jorgecarleitao)

use FnOnce for PrimitiveArray::apply_validity #1176 (ritchie46)

release memory on predicate with 0% selectivity #1163 (ritchie46)

Fixed error in reading Struct<List<...>> from parquet #1150 (jorgecarleitao)

Fixed IPC projection #1149 (ritchie46)

Fixed casting dictionary keys #1143 (ritchie46)

Fixed reading arrays from parquet with required children #1140 (jorgecarleitao)

Fixed panic in deserializing nested statistics #1139 (jorgecarleitao)

Aligned name of FixedSizeBinaryArray::values_iter #1117 (jorgecarleitao)

Fixed error in FixedSizeListArray::new_null #1114 (jorgecarleitao)

Fixed panic in writing dictionaries to parquet #1113 (jorgecarleitao)

Fixed error in reading chunked parquet #1108 (jorgecarleitao)

Raise error when invalid fields are passed to flight #1093 (jorgecarleitao)

Made IPC projection not sort projection #1082 (jorgecarleitao)

Fixed error in chunked_mut bitmap #1081 (jorgecarleitao)

Fixed panic in bitmap assign_mut #1078 (ritchie46)

Panic-free read of IPC files #1075 (jorgecarleitao)

Bumped parquet2 (minor) requirement #1071 (jorgecarleitao)

Fixed divide by zero on reading empty row group #1062 (jorgecarleitao)

Fixed missing validation of number of encodings passed when writing to parquet #1057 (jorgecarleitao)

Enhancements:

Improved performance of reading Binary from parquet #1190 (ritchie46)

Bumped to latest nightly #1186 (gyscos)

Improved error message #1179 (jorgecarleitao)

Added support to read and write nested dictionaries to parquet #1175 (jorgecarleitao)

Added MutableUtf8Array::into_data #1170 (ritchie46)

Added Default for Utf8Array #1169 (ritchie46)

fix(parquet): allow to read other logical types from parquet #1168 (sundy-li)

fix(parquet): enforce to use ParquetTimeUnit::Nanoseconds for PhysicalType::Int96 #1167 (sundy-li)

Added constructor MutableFixedSizeListArray::new_from #1161 (hohav)

Removed unneeded Default constraint #1157 (hohav)

Improved checks to safety invariants in FFI #1154 (jorgecarleitao)

Removed un-needed indirection #1153 (jorgecarleitao)

Soften generic constraint of Buffer #1152 (sundy-li)

Use ahash by default #1148 (ritchie46)

Reduced bound checks #1142 (ritchie46)

Moved Bytes to own crate #1141 (jorgecarleitao)

Fixed clippy for 1.62 #1134 (Xuanwo)

Cleaned example #1130 (jorgecarleitao)

Removed O(N) clone in writing CSV #1128 (jorgecarleitao)

Avoid zeroed allocation in reading avro #1127 (jorgecarleitao)

Reduced allocations of reading bitmaps from IPC #1126 (jorgecarleitao)

Improved performance of reading from IPC #1125 (jorgecarleitao)

Improved parquet read performance #1124 (jorgecarleitao)

Optimized write nulls to Avro #1119 (jorgecarleitao)

Made row_group::get_field_columns public #1110 (ritchie46)

Removed some panics reading invalid parquet files #1106 (jorgecarleitao)

Reduced reallocations when reading from IPC (~12%) #1105 (ritchie46)

Exposed utilities in io::flight #1094 (jorgecarleitao)

Accept decoding parquet's i64 into u32 written by pyarrow #1090 (jorgecarleitao)

Simplified code #1088 (jorgecarleitao)

Removed un-necessary allocation in assign_ops #1085 (jorgecarleitao)

Replaced some macros by generics #1084 (jorgecarleitao)

Improved performance of Bitmap::make_mut with offset #1079 (jorgecarleitao)

Implemented Default for PrimitiveArray #1073 (ritchie46)

Expose share counts in Buffer #1072 (ritchie46)

Added compute::arity_assign #1070 (jorgecarleitao)

Improved performance in lexical write (~5%) #1067 (ritchie46)

Added cast to/from Null from/to every type #1066 (jorgecarleitao)

prevent unneeded offset check #1059 (ritchie46)

Documentation updates:

Fixed parquet write example #1193 (rajasekarv)

Improved docs #1164 (jorgecarleitao)

Minor cleanup of internal namings #1160 (jorgecarleitao)

Added example reading Avro produced by Kafka #1151 (jorgecarleitao)

Updated license wording #1138 (jorgecarleitao)

Fixed wrong package name in examples #1133 (Xuanwo)

Improved example #1131 (jorgecarleitao)

Added more tests #1111 (jorgecarleitao)

Improved examples #1109 (jorgecarleitao)

Improved internal docs #1107 (jorgecarleitao)

Added notes about creating parquet files and submodules in the development documentation #1096 (joshuataylor)

Improved docs for BooleanArray #1083 (jorgecarleitao)

Added missing link to guide #1065 (jorgecarleitao)

Improve Docs Readability #1054 (ryanrussell)

Testing updates:

Temporary skip decimal256 integration tests #1198 (jorgecarleitao)

Simplified code #1183 (jorgecarleitao)

Made kafka schema_id u32 in example #1162 (jorgecarleitao)

Added more tests #1158 (jorgecarleitao)

Bumped MIRI #1156 (jorgecarleitao)

Simplified code in flight integration tests #1136 (jorgecarleitao)

Added more tests for nested parquet #1121 (jorgecarleitao)

Added more tests for reading and writing CSV #1120 (jorgecarleitao)

Added test for scalar division #1115 (jorgecarleitao)

Added more tests #1103 (jorgecarleitao)

Enabled more integration tests with pyarrow #1102 (jorgecarleitao)

Simplified Bytes (internal) #1099 (jorgecarleitao)

Updated patch to arrow integration tests #1068 (jorgecarleitao)

Added more tests #1064 (jorgecarleitao)

Source code(tar.gz)
Source code(zip)
v0.12.0(Jun 5, 2022)
A new version of arrow2 is now available in crates.io. 🎉🎉🎉

See below all great things that were released 🚀. But before that, thank you so much to everyone that contributed to this release: 🙇

@ahmedriza, @dexterduck, @GPSnoopy, @HaoYang670, @SimonSchneider, @TurnOfACard, @aptr322, @arxra, @b41sh, @cjermain, @dbr, @jorgecarleitao, @ritchie46

Breaking changes:

Require one encoding per parquet column on write #1012

Bumped parquet2 #1035 (jorgecarleitao)

Improved performance of deserializing JSON (2x) #1024 (jorgecarleitao)

Remove from_trusted_len_* from Buffer #1020 (jorgecarleitao)

Bumped arrow-format #1011 (jorgecarleitao)

Replace fn Offset::is_large() as const Offset::IS_LARGE #1002 (HaoYang670)

Renamed ArrowError to Error #993 (jorgecarleitao)

New features:

Added support to deserialize MapArray from parquet #1045 (jorgecarleitao)

Added support for random access reads from IPC #1034 (jorgecarleitao)

Added support for custom sort build_compare_fn #1016 (b41sh)

Added support to write nested parquet #1007 (jorgecarleitao)

Added support for deserializing JSON from iterator #989 (cjermain)

Fixed bugs:

Writing of ListArray does not preserve all values #1008

Write a two-dimensional list to parquet file failed #992

Writing to Parquet fails for extension types that contain lists #830

Fixed using lower limit than size of first parquet row group #1046 (arxra)

Fixed error in consuming sliced FixedSizedBinary from c data interface (FFI) #1026 (jorgecarleitao)

Fixed lexsort limit equal or greater than row_count #1021 (b41sh)

Fixed error in reading nested parquet structs #1015 (jorgecarleitao)

Fixed panic on debug print of invalid timezones #1013 (jorgecarleitao)

Treat empty timezone string as no-timezone #1009 (dbr)

Fixed encoding of NaN to json #990 (SimonSchneider)

Fixed error in writing ListArray to parquet #984 (jorgecarleitao)

Fixed decoding Binary Plain pages with dictionary pages #982 (aptr322)

Enhancements:

Added Debug and PartialEq for MapArray #1043 (jorgecarleitao)

Exposed compression levels for parquet #1041 (ritchie46)

Added .arced/.boxed to arrays #1040 (jorgecarleitao)

Added utility to create encodings #1018 (jorgecarleitao)

Made parquet_to_arrow_schema public #1006 (martingallagher)

Speeded up min_max_boolean for the case where all values are null #1005 (HaoYang670)

Simplified min_max_string and min_max_binary #1004 (HaoYang670)

Added support for Decimal in build_compare #998 (GPSnoopy)

remove accidental quadratic null_count #991 (ritchie46)

Aligns MutableDictionaryArray's with MutablePrimitiveArrays with TryPush #981 (TurnOfACard)

Documentation updates:

Cleaned docs for BinaryArray #1047 (jorgecarleitao)

Improved API docs for MutableBitmap #1025 (jorgecarleitao)

Improved docs for bitmap #1022 (jorgecarleitao)

Improved API docs for PrimitiveArray and Utf8Array #1017 (jorgecarleitao)

Fixed dev guide #1003 (jorgecarleitao)

Testing updates:

Added more tests #1029 (jorgecarleitao)

Moved coverage reporting to cargo-llvm-cov #1028 (jorgecarleitao)

Added more tests (increase coverage) #1027 (jorgecarleitao)

Moved tests from lib to tests #1001 (jorgecarleitao)

Allowed feature-specific test runs #985 (jorgecarleitao)

Source code(tar.gz)
Source code(zip)
v0.11.2(May 5, 2022)
Full Changelog

New features:

Added support to append to existing IPC Arrow file #972 (jorgecarleitao)

Added pop to utf8/binary/fixedSize MutableArray #966 (ygf11)

Added support for union scalars #930 (ncpenke)

Fixed bugs:

Added support to read nested binary from parquet #978 (jorgecarleitao)

Fixed empty reader panic for NDJSON type infer #974 (Roberto-XY)

Prevented SO in large parquet files #973 (ritchie46)

Fixed API bug in async read of IPC metadata #969 (jorgecarleitao)

Fixed writing required list to parquet #968 (jorgecarleitao)

Enhancements:

Added support Parquet deserialize LargeList and Uint data types #979 (b41sh)

Made reading of IPC dictionaries lazy #971 (jorgecarleitao)

Allowed creating IPC FileWriter without writing to the file #970 (jorgecarleitao)

Source code(tar.gz)
Source code(zip)
v0.11.0(Apr 27, 2022)
Arrow2 v0.11.0 is out!! 🎉🎉🎉

This release is mainly focus on improving upon the previous one on better parquet support. In particular, we have the main ingredients to read indexed parquet pages, which allow skipping deserializing individual pages, and since this version parquet files are written with page indexes. There is still some work to improve the frontend API to skip pages via statistics, which will be left for the next version.

This version also contains multiple bug fixes.

Thanks everyone that contributed to this release (individual PRs below)! 🙇

Changelog

Full Changelog

Breaking changes:

Refactored parquet statistics deserialization #962 (jorgecarleitao)

Made GroupFilter Send + Sync #947 (jorgecarleitao)

New features:

Added support for non-ordered projections to IPC reading #961 (jorgecarleitao)

Added support for reading indexed parquet pages #923 (jorgecarleitao)

Fixed bugs:

Parquet regression: exceptions.ArrowErrorException: NotYetImplemented("Can't read Dictionary(UInt32, LargeUtf8, false) from parquet") #955

Reading Parquet binary column panics during deserialization 'attempt to subtract with overflow` #944

Reading Parquet file written by pyarrow with lz4 compression fails with OutOfSpec("Thrift out of range") #940

Issues when trying to create a parquet file with FixedSizedListArray #691

Fixed bug in writing csv with buffer resizing #965 (ritchie46)

Fixed bug in reading binary parquet #945 (jorgecarleitao)

Fixed error in writing fixedSizeListArray to parquet #941 (jorgecarleitao)

Fixed support to read dict nested binary parquet #924 (jorgecarleitao)

Enhancements:

Reduced memory usage in reading parquet #964 (jorgecarleitao)

Simpler IPC code #939 (jorgecarleitao)

don't allocate string when writing to csv #935 (ritchie46)

Removed un-needed generic parameter #927 (jorgecarleitao)

update to odbc-api 0.36.0 #925 (pacman82)

Documentation updates:

Fixed example of parallel read via rayon #958 (jorgecarleitao)

Fixed guide deployment #931 (jorgecarleitao)

Typo fix #919 (bkmgit)

Testing updates:

Fixed patch of integration tests #960 (jorgecarleitao)

Added test for MapArray #942 (jorgecarleitao)

Fixed wrong clippy warning #938 (jorgecarleitao)

Source code(tar.gz)
Source code(zip)
v0.10.0(Mar 12, 2022)
Arrow2 0.10.0 is out! 🚀🚀🚀🚀🚀

Continuing breaking ground, this constitutes one of the most feature rich releases of this crate so far!

Thank you to everyone for the impressive work over the past 2.5 months that make arrow2 so feature rich, safe, fast, and easy to use! :bow:

Here are the main headlines:

Copy on Write

So far, whenever we applied a transformation to an array, we had to create a new array. When multiple operations were used (e.g. c1 x 2 + 1), it lead to the following compute pattern:

1. allocate new region 2. compute 3. allocate new region 4. compute

This was identified by @sundy-li on #741 and addressed by @ritchie46 on #794.

Users can now re-use Arced arrays, just like std::sync::Arc::get_mut. As expected, if the array is being used in multiple places, it will return a None and users do need to allocate a new region (exclusive mutability).

This is being used in Polars to further re-use allocated regions and therefore reduce both memory pressure and wasted compute cycles allocating new regions.

Support for ODBC

This release now supports reading from, and write to, any ODBC driver.

This builds on top of the superb odbc-api created by @pacman82, that allows this crate to use the columnar format provided by ODBC specification.

Given a performant ODBC driver, this is expected to be the fastest way to load data to the Arrow format, as many operations are simple memcopies.

Check out the example and guide for details on how to use it!

async support for writing to Arrow's IPC

Until now, we had limited support to writing to Arrow IPC asynchronously. @dexterduck closed this gap on #878, offering complete async support for both Arrow files and Arrow streams, including implementations of futures::Stream and futures::Sink for them!

Migrated std::simd

After some back and forth with the working group of the project portable simd, this release replaces packed_simd2 by std::simd. This resulted in no performance difference but allow us to leverage the great work that is happening on std::simd.

Support to Serde metadata

A common pain point in using arrow2's logical types is that they are quite rich, making them sometimes difficult to visualize or represent in e.g. JSON. @houqp closed this with #858, that adds compatibility with Serde for schema-related structs in this crate (PhysicalType DataType, Field, Schema).

Support for Arrow C stream interface

Arrow has an experimental specification for an FFI to iterators of arrow arrays. This release now fully supports this interface.

Made crate deny(missing_docs)

This makes us developers more conscious about documenting APIs, thereby allowing users more context about them. We have also start documenting IO-related APIs over whether they are CPU or IO-bounded, so that users know which ones block async contexts.

Changelog

Full Changelog

Breaking changes:

Renamed Ffi_ArrowArray and Ffi_ArrowSchema #859

Improved performance and stability of writing to CSV #866 (ritchie46)

Simplified API for writing to JSON #864 (jorgecarleitao)

Simplified API to import from FFI #854 (jorgecarleitao)

Simplified compute (lower/upper) #847 (jorgecarleitao)

Simplified infering arrow schema from a parquet schema #819 (jorgecarleitao)

Bumped parquet and aligned API to fit into it #795 (jorgecarleitao)

New features:

Added GrowableUnion #902 (jorgecarleitao)

Added cast to months_days_ns #900 (jorgecarleitao)

Added support for hash of month_day_ns arrays #899 (jorgecarleitao)

IPC sink types and IPC file stream #878 (dexterduck)

implemented futures::Sink for parquet async writer #877 (dexterduck)

Added try_new and new to all arrays #873 (jorgecarleitao)

Added support for datatypes serde #858 (houqp)

Added support to the Arrow C stream interface (read and write) #857 (jorgecarleitao)

Support to read/write from/to ODBC #849 (jorgecarleitao)

Added operators that include validities in comparisons #846 (ritchie46)

Added support to read and write Decimal128 to Avro #837 (potter420)

Added support to read Arrow streams asynchronously #832 (jorgecarleitao)

Added support to write LargeUtf8 and LargeBinary to Avro #828 (illumination-k)

Added support for pushdown projection in reading Avro #827 (jorgecarleitao)

Added support to read Avro's structs #826 (jorgecarleitao)

Added support to write largeUtf8/Binary to Avro #825 (jorgecarleitao)

Added json serialization of timestamp/date32/date64 #814 (ritchie46)

Added BooleanArray::from_trusted_len_values_iter_unchecked #799 (ritchie46)

Added MutableUtf8Array::extend_values #798 (ritchie46)

Added COW semantics to Buffer, Bitmap and some arrays #794 (ritchie46)

Added support to read parquet row groups in chunks #789 (jorgecarleitao)

Added scalar bitwise ops #788 (jorgecarleitao)

Migrated to portable simd #747 (jorgecarleitao)

Fixed bugs:

Fixed edge case in reading multiple parquet pages #904 (jorgecarleitao)

Bug fix in offset for sliced unions #891 (ncpenke)

Fix edge case in reading nested parquet #884 (jorgecarleitao)

Fixed unsoundness of #derive(Clone) for FFI structs #882 (jorgecarleitao)

Fixed json writing of dates and datetimes #867 (jorgecarleitao)

Fixed reading parquet with timezone #862 (jorgecarleitao)

Fixed error in writing compressed IPC arrow #855 (jorgecarleitao)

Fixed wrong null_count when slicing a sliced Bitmap #848 (satlank)

Fixed error in writing compressed IPC files #840 (jorgecarleitao)

Fixed float to i128 cast #817 (houqp)

fix unescaped '"' in json writing #812 (ritchie46)

Fixed reading parquet binary dict page #791 (danburkert)

Enhancements:

Add FixedSizeBinaryScalar #782

Use more idiomatic versions #898 (jorgecarleitao)

Added support for min/max for decimal #897 (jorgecarleitao)

Made FixedSizeList::try_push_valid public and added new_with_field #887 (ncpenke)

Added MutableFixedList::mut_values #886 (jorgecarleitao)

Made IPC IO use try_new #879 (jorgecarleitao)

expose ListValuesIter #874 (ritchie46)

Bumped crc #856 (jorgecarleitao)

DRY parquet reading #845 (jorgecarleitao)

Refactored (internal) fmt #842 (jorgecarleitao)

Bumped zstd #841 (jorgecarleitao)

inline push #835 (ritchie46)

Increased API consistency for COW and respective docs #833 (jorgecarleitao)

Improved flexibility of reading parquet #820 (jorgecarleitao)

Small improvement to deserializing fixed-len parquet statistics. #818 (jorgecarleitao)

Added support for other timestamp units from parquet #803 (jorgecarleitao)

More to into_mut implementations #801 (ritchie46)

Added FixedSizeListScalar and FixedSizeBinaryScalar #786 (illumination-k)

DRY parquet module #785 (jorgecarleitao)

Documentation updates:

Improved documentation #860 (jorgecarleitao)

Made crate deny(missing_docs) #808 (jorgecarleitao)

Fixed doc for Bitmap::set_bit #802 (yjshen)

Fixed dyn Array::slice docstring #792 (ritchie46)

Testing updates:

Simpler code (DRY) #901 (jorgecarleitao)

Fixed integration test #885 (jorgecarleitao)

Simplified code to generate parquet files for tests #883 (jorgecarleitao)

Removed un-needed unsafe #843 (jorgecarleitao)

Added more tests #810 (jorgecarleitao)

Reduced code duplication #805 (jorgecarleitao)

upgrade to clap 3.0 #797 (Jimexist)

Simplified avro reading and added more tests #737 (jorgecarleitao)

Source code(tar.gz)
Source code(zip)
v0.9.0(Jan 14, 2022)
A new release is here! 🎉🎉🎉🎉 This release has four major improvements:

It is now backed by std's Vec, thus making it

zero-copy with the rest of Rust's ecosystem

use less unsafe

more ergonomics

faster to compile

(no difference in performance)

It now supports reading from, and writing to, Apache Avro, both sync and async

flatbuffers dependency was replaced by planus, a re-implementation of the flatbuffers specification in Rust (you should check out that project, awesome work by @kristoff3r and @TethysSvensson)

lower risks of unsound

easier-to-maintain code base

Improved security and general maintenance:

Made most of the crate #[forbid(unsafe)]

significantly reduced the use of unsafe via bytemuck's dependency

made most of parsing of Arrow IPC panic-free, to reduce risks of DOS from untrusted data

A big thanks to all contributors (listed below) and our users for all the dedication, hard work, and patience. 🙇

Breaking changes:

Added number of rows read in CSV inference #765 (jorgecarleitao)

Refactored nullif #753 (jorgecarleitao)

Migrated to latest parquet2 #752 (jorgecarleitao)

Replace flatbuffers dependency by Planus #732 (jorgecarleitao)

Simplified Schema and Field #728 (jorgecarleitao)

Replaced RecordBatch by Chunk #717 (jorgecarleitao)

Removed Option from fields' metadata #715 (jorgecarleitao)

Moved dict_id to IPC-specific IO #713 (jorgecarleitao)

Moved is_ordered from Field to DataType::Dictionary #711 (jorgecarleitao)

Refactored JSON writing (5-10x) #709 (jorgecarleitao)

Made Avro read API use Block and CompressedBlock #698 (jorgecarleitao)

Simplified most traits #696 (jorgecarleitao)

Replaced Display by Debug for Array #694 (jorgecarleitao)

Replaced MutableBuffer by std::Vec #693 (jorgecarleitao)

Simplified Utf8Scalar and BinaryScalar #660 (jorgecarleitao)

Simplified Primitive and Boolean scalar #648 (jorgecarleitao)

New features:

Add and_scalar and or_scalar for boolean_kleene #662

Add lower and upper support for string #635

Added support to cast decimal #761 (jorgecarleitao)

Added support to deserialize JSON (!= NDJSON) #758 (jorgecarleitao)

Added support to infer nested json structs #750 (jorgecarleitao)

Added support to compare intervals #746 (jorgecarleitao)

Added any and all kernel #739 (ritchie46)

Added support to write Avro async #736 (jorgecarleitao)

Added support to write interval to Avro #734 (jorgecarleitao)

Added and_scalar and or_scalar for boolean kleene #723 (silathdiir)

Added and_scalar and or_scalar for boolean #707 (silathdiir)

Refactored JSON read to split IO-bounded from CPU-bounded tasks #706 (jorgecarleitao)

Added more conversions from parquet #701 (jorgecarleitao)

Added support for compressed Avro write #699 (jorgecarleitao)

Added support to write to Avro #690 (jorgecarleitao)

Added dynamic version of negation #685 (jorgecarleitao)

Added support to read dictionary-encoded required parquet pages #683 (mdrach)

Added upper #664 (Xuanwo)

Added lower #641 (Xuanwo)

Added support for async read of Avro #620 (jorgecarleitao)

Fixed bugs:

Pyarrow and Arrow2 don't agree on Timestamp resolution #700

Writing compressed dictionary in parquet corrupts the files #667

Replaced assert by error in IPC read #748 (jorgecarleitao)

Made all panics in IPC read errors #722 (jorgecarleitao)

Fixed error in compare booleans #721 (jorgecarleitao)

Fixed error in dispatching scalar arithmetics #682 (jorgecarleitao)

Fixed error in reading negative decimals from parquet #679 (mdrach)

Made IPC reader less restrictive #678 (jorgecarleitao)

Fixed error in trait constraint in compute #665 (jorgecarleitao)

Fixed performance regression of CSV reading #657 (jorgecarleitao)

Fixed filter of predicate with validity #653 (ritchie46)

Made Scalar: Send+Sync #644 (jorgecarleitao)

Enhancements:

Feature: JSON IO? #712

Simplified code #760 (jorgecarleitao)

Added iterator of values of FixedBinaryArray #757 (jorgecarleitao)

Remove un-needed unsafe #756 (jorgecarleitao)

Replaced un-needed unsafe #755 (jorgecarleitao)

Made IO #[forbid(unsafe)] #749 (jorgecarleitao)

Improved reading nullable Avro arrays #727 (Igosuki)

Allow to create primitive array by vec without extra memcopy #710 (sundy-li)

Removed requirement of use Array to access primitives' data_type #697 (jorgecarleitao)

Cleaned up trait usage and added forbid_unsafe to parts #695 (jorgecarleitao)

Migrated from avro-rs to avro-schema #692 (jorgecarleitao)

Added MutablePrimitiveArray::extend_constant #689 (jorgecarleitao)

Do not write validity without nulls in IPC #688 (jorgecarleitao)

DRY code via macro #681 (jorgecarleitao)

Made dyn Array and Scalar usable in #[derive(PartialEq)] #680 (jorgecarleitao)

Made IPC ZSTD-compressed consumable by pyarrow #675 (jorgecarleitao)

Simplified trait bounds in arithmetics #671 (jorgecarleitao)

Improved performance of reading utf8 required from parquet (-15%) #670 (jorgecarleitao)

Avoid double utf8 checks on MutableUtf8 -> Utf8 #655 (jorgecarleitao)

Made Buffer::offset public #652 (ritchie46)

Improved performance in cast Primitive to Binary/String (2x) #646 (sundy-li)

Made Filter: Send+Sync #645 (jorgecarleitao)

Made API to create field accept String #643 (jorgecarleitao)

Documentation updates:

Fixed clippy (coming from 1.58) #763 (jorgecarleitao)

Described how to run part of the tests #762 (jorgecarleitao)

Improved README #735 (jorgecarleitao)

clarify boolean value in DataType::Dictionary #718 (ritchie46)

readme typo #687 (max-sixty)

Added example to read parquet in parallel with rayon #658 (jorgecarleitao)

Added documentation to Bitmap::as_slice #654 (ritchie46)

Testing updates:

Improved json tests #742 (jorgecarleitao)

Added integration tests for writing compressed parquet #740 (jorgecarleitao)

Updated patch for integration test #731 (jorgecarleitao)

Added cargo check to benchmarks #730 (sundy-li)

More tests to CSV writing #724 (jorgecarleitao)

Added integration tests for other compressions with parquet from pyarrow #674 (jorgecarleitao)

Bumped nightly in CI #672 (jorgecarleitao)

Invalidate caches from CI. #656 (jorgecarleitao)

Source code(tar.gz)
Source code(zip)
v0.8.0(Nov 27, 2021)
A new release is here 🚀🚀🚀

This release has so many important new features and bug fixes that will be summarized as: thank you everyone for all the issues and PRs that resulted in this release (in order of appearance) 🙇🙇🙇🙇:

yjshen

sundy-li

simonvandel

illumination-k

Dandandan

flaneur2020

1aguna

ritchie46

yjhmelody

Full Changelog

Breaking changes:

Made CSV write options use chrono formatting by default #624

Add compression to IpcWriteOptions #570

Made cast accept CastOptions parameter #569

Simplified ArrowError #640 (jorgecarleitao)

Use DynComparator for lexsort and partition #637 (yjshen)

Split "compute" feature #634 (jorgecarleitao)

Removed unneeded trait. #628 (jorgecarleitao)

Sealed 2 traits to forbid downstream implementations #621 (jorgecarleitao)

Simplified arithmetics compute #607 (jorgecarleitao)

Refactored comparison Operator #604 (jorgecarleitao)

Simplified dictionary indexes #584 (jorgecarleitao)

Simplified IPC APIs #576 (jorgecarleitao)

Simplified IPC stream writer / remove finish on drop from stream writer #575 (jorgecarleitao)

Simplified trait in compute. #572 (jorgecarleitao)

Compute: add partial option into CastOptions #561 (sundy-li)

Introduced UnionMode enum #557 (simonvandel)

Changed DataType::FixedSize*(i32) to DataType::FixedSize*(usize) #556 (simonvandel)

New features:

Added support to write timestamps with timezones for CSV #623 (jorgecarleitao)

Added support to read Avro files' metadata asynchronously #614 (jorgecarleitao)

Added iterator for StructArray #613 (illumination-k)

Added support to read snappy-compressed Avro #612 (jorgecarleitao)

Added support to read decimal from csv #602 (jorgecarleitao)

Added support to cast NullArray to all other types #589 (flaneur2020)

Added support dictionaries in nested types over IPC #587 (jorgecarleitao)

Added support to write Arrow IPC streams asynchronously #577 (jorgecarleitao)

Added support to write compressed Arrow IPC (feather v2) #566 (jorgecarleitao)

Added support for ffi for FixedSizeList and FixedSizeBinary #565 (jorgecarleitao)

Added support for async csv reading. #562 (jorgecarleitao)

Added support for bitwise operations #553 (1aguna)

Added support to read StructArray from parquet #547 (jorgecarleitao)

Fixed bugs:

Fixed error in reading nullable from Avro. #631 (jorgecarleitao)

Fixed error in union FFI #625 (jorgecarleitao)

Fixed error in computing projection in io::ipc::read::reader::FileReader #596 (illumination-k)

Fixed error in compressing IPC LZ4 #593 (jorgecarleitao)

Fixed growable of dictionaries negative keys #582 (ritchie46)

Made substring kernel on utf8 take chars into account. #568 (ritchie46)

Fixed error in passing sliced arrays via FFI #564 (jorgecarleitao)

Enhancements:

Faster take with null values (2-3x) #633 (jorgecarleitao)

Improved error message for missing feature in compressed parquet #632 (jorgecarleitao)

Added to conversion to FixedSizeBinary #622 (ritchie46)

Bumped confy-table #618 (jorgecarleitao)

Made MutableArray Send + Sync #617 (jorgecarleitao)

Removed most of allocations in IPC reading #611 (jorgecarleitao)

Speed up boolean comparison kernels (~3x) #610 (Dandandan)

Improved performance of decimal arithmetics #605 (jorgecarleitao)

Simplified traits and added documentation #603 (jorgecarleitao)

Improved performance of is_not_null. #600 (jorgecarleitao)

Added len to every array #599 (jorgecarleitao)

Added support for NullArray at FFI. #598 (jorgecarleitao)

Optimized MutableBinaryArray #597 (jorgecarleitao)

Speedup/simplify bitwise operations (avoid extra allocation) #586 (Dandandan)

Improved performance of bitmap::from_trusted (3x) #578 (jorgecarleitao)

Made bitmap not cache null count #563 (jorgecarleitao)

Avoided redundant checks in creating an Utf8Array from MutableUtf8Array #560 (jorgecarleitao)

Avoid unnecessary allocations #559 (simonvandel)

Surfaced errors in reading from avro #558 (jorgecarleitao)

Documentation updates:

Simplified example #619 (jorgecarleitao)

Made example of parallel parquet write be over multiple batches #544 (jorgecarleitao)

Testing updates:

Cleaned up benches #636 (jorgecarleitao)

Ignored tests code in coverage report #615 (yjhmelody)

Added more tests #601 (jorgecarleitao)

Mitigated RUSTSEC-2020-0159 #595 (jorgecarleitao)

Added more tests #591 (jorgecarleitao)

Source code(tar.gz)
Source code(zip)
v0.7.0(Oct 29, 2021)
Another release is here 🚀🚀🚀

As usual, a bunch of optimizations as well as some work in two main fronts:

make the crate smaller and easier to compile

support for nested parquet reads

Thank you to all contributors (names below) for the amazing contributions!

Breaking changes:

Simplified reading parquet #532 (jorgecarleitao)

Change IPC FileReader to own the underlying reader #518 (blakesmith)

Migrate to arrow_format crate #517 (jorgecarleitao)

New features:

Added read of 2-level nested lists from parquet #548 (jorgecarleitao)

add dictionary serialization for csv-writer #515 (ritchie46)

Added checked_negate and wrapping_negate for PrimitiveArray #506 (yjhmelody)

Fixed bugs:

Fixed error in reading fixed len binary from parquet #549 (jorgecarleitao)

Fixed ffi of sliced arrays #540 (jorgecarleitao)

Fixed s3 example #536 (jorgecarleitao)

Fixed error in writing compressed parquet dict pages #523 (jorgecarleitao)

Validity taken into account when writing StructArray to json #511 (VasanthakumarV)

Enhancements:

Bumped Prost and Tonic #550 (PsiACE)

Speedup scalar boolean operations #546 (Dandandan)

Added fast path for validating ASCII text (~1.12-1.89x improvement on reading ASCII parquet data) #542 (Dandandan)

Exposed missing APIs to write parquet in parallel #539 (jorgecarleitao)

improve utf8 init validity #530 (ritchie46)

export missing BinaryValueIter #526 (yjhmelody)

Documentation updates:

Added more IPC documentation #534 (HagaiHargil)

Fixed clippy and fmt #521 (ritchie46)

Testing updates:

Added more tests for utf8 #543 (jorgecarleitao)

Ignored RUSTSEC-2020-0071 and RUSTSEC-2020-0159 #537 (jorgecarleitao)

Improved parquet read benches #533 (jorgecarleitao)

Added fmt and clippy checks to CI. #522 (xudong963)

Source code(tar.gz)
Source code(zip)
v0.6.2(Oct 9, 2021)
Small release with two minor but relevant bug fixes and a new feature.

Full Changelog

New features:

Added wrapping version arithmetics for PrimitiveArray #496 (yjhmelody)

Fixed bugs:

Do not check offsets or utf8 validity in ffi (#505) #510 (NilsBarlaug)

Made try_push_valid public again #509 (ritchie46)

Enhancements:

Use static-typed equal functions directly #507 (yjhmelody)

Source code(tar.gz)
Source code(zip)
v0.6.1(Oct 7, 2021)
(in crates as 0.6.1: I made a mistake in publishing). Anyways, another big release is here!

There are just too many improvements for a 22 days release - let's try to capture important mentions:

Buffer and MutableBuffer are now compatible with Rust's std::Vec with no strings attached: everything continues to work, including FFI with the rest of the ecosystem! You can recover the previous behavior (of using cached-aligned allocations), via feature cache_aligned

Added broad support to timestamp with timezones. Kudos to @VasanthakumarV for all the help.

Added read Decimal from parquet. Kudos to @potter420 for the contribution.

More improvements to performance. Kudos to @Dandandan and @ritchie46.

Support to read from the Avro via feature io_avro

Full Changelog

Breaking changes:

Bring MutableFixedSizeListArray to the spec used by the rest of the Mutable API #475

Removed ALIGNMENT invariant from [Mutable]Buffer #449

Un-nested compute::arithemtics::basic #461 (jorgecarleitao)

Added more serialization options for csv writer. #453 (ritchie46)

Changed validity from &Option<Bitmap> to Option<&Bitmap>. #431 (jorgecarleitao)

Bumped parquet2 #422 (jorgecarleitao)

Changed IPC FileWriter to own the writer. #420 (yjshen)

Made DynComparator Send+Sync #414 (yjshen)

New features:

Read Decimal from Parquet File #444

Add IO read for Avro #401

Added support to read Avro logical types, List,Enum, Duration and Fixed. #493 (jorgecarleitao)

Added read Decimal from parquet #489 (potter420)

Implement BitXor trait for Bitmap #485 (houqp)

Added extend/extend_unchecked for MutableBooleanArray #478 (VasanthakumarV)

expose shrink_to_fit to mutable arrays #467 (ritchie46)

Added support for DataType::Map and MapArray #464 (jorgecarleitao)

Extract parts of datetime #433 (VasanthakumarV)

Added support to add an interval to a timestamp #417 (jorgecarleitao)

Added support to read Avro. #406 (jorgecarleitao)

Replaced own allocator by std::Vec. #385 (jorgecarleitao)

Fixed bugs:

crash in parquet read #459

Made writing stream to parquet require a non-static lifetime #471 (GrandChaman)

Made importing from FFI unsafe #458 (jorgecarleitao)

Fixed panic in division using nulls. #438 (jorgecarleitao)

Fixed error writing dictionary extension to IPC #397 (jorgecarleitao)

Fixed error in extending MutableBitmap #393 (jorgecarleitao)

Enhancements:

Some compare function are not exported #349

Investigate how to add support for timezones in timestamp #23

Made hash work for extension type #487 (jorgecarleitao)

Added extend/extend_unchecked for MutableBinaryArray #486 (VasanthakumarV)

Improved inference and deserialization of CSV #483 (jorgecarleitao)

Added GrowableFixedSizeList and improved MutableFixedSizeListArray #470 (jorgecarleitao)

Added MutableBitmap::shrink_to_fit #468 (jorgecarleitao)

Added MutableArray::as_box #450 (sd2k)

Improved performance of sum aggregation via aligned loads (-10%) #445 (ritchie46)

Removed assert from MutableBuffer::set_len #443 (ritchie46)

Optimized null_count #442 (ritchie46)

Improved performance of list iterator (- 10-20%) #441 (ritchie46)

Improved performance of PrimitiveGrowable for nulls (-10%) #434 (jorgecarleitao)

Allowed accessing validity without importing Array #432 (jorgecarleitao)

Optimize hashing using ahash and multiversion (-30%) #428 (Dandandan)

Improved performance of iterator of Utf8Array and BinaryArray (3-4x) #427 (jorgecarleitao)

Improved performance of utf8 validation of large strings via simdutf8 (-40%) #426 (Dandandan)

Added reading of parquet required dictionary-encoded binary. #419 (jorgecarleitao)

Add extend/extend_unchecked for MutableUtf8Array #413 (VasanthakumarV)

Added support to extract hours and years from timestamps with timezone #412 (jorgecarleitao)

Added io_csv_read and io_csv_write feature #408 (ritchie46)

Improve comparison docs and re-export the array-comparing function #404 (HagaiHargil)

Added support to read dict-encoded required primitive types from parquet #402 (Dandandan)

Added Array::with_validity #399 (ritchie46)

Documentation updates:

Improved documentation #491 (jorgecarleitao)

Added more API docs. #479 (jorgecarleitao)

Added more documentation #476 (jorgecarleitao)

Improved documentation #462 (jorgecarleitao)

Added example showing parallel writes to parquet (x num_cores) #436 (jorgecarleitao)

Improved documentation #430 (jorgecarleitao)

[0.5] The docs io module has no submodules #390

Made docs be compiled with feature full #391 (jorgecarleitao)

Testing updates:

DRY via macro. #477 (jorgecarleitao)

DRY of type check and len check code in compute #474 (yjhmelody)

Added property testing #460 (jorgecarleitao)

Added fmt to CI. #455 (jorgecarleitao)

Simplified CI #452 (jorgecarleitao)

fix filter kernels bench #440 (ritchie46)

Reduced number of combinations in feature tests. #429 (jorgecarleitao)

Move tests from src/compute/ to tests/ #423 (VasanthakumarV)

Skipped some feature permutations. #411 (jorgecarleitao)

Added tests to some invariants of unsafe #403 (jorgecarleitao)

Added support to read and write extension types to and from parquet #396 (jorgecarleitao)

Fix testing of SIMD #394 (jorgecarleitao)

Source code(tar.gz)
Source code(zip)
v0.5.3(Sep 14, 2021)
A new release is here, containing bug fixes and backward-compatible enhancements.

Thank you to all involved in the testing and development that resulted in this version!

Full Changelog

New features:

Added support to read and write extension types to and from parquet #396 (jorgecarleitao)

Fixed bugs:

Fixed error writing dictionary extension to IPC #397 (jorgecarleitao)

Fixed error in extending MutableBitmap #393 (jorgecarleitao)

Enhancements:

Added support to read dict-encoded required primitive types from parquet #402 (Dandandan)

Added Array::with_validity #399 (ritchie46)

Testing updates:

Fix testing of SIMD #394 (jorgecarleitao)

Source code(tar.gz)
Source code(zip)
v0.5.2(Sep 9, 2021)
Hot fix release to make the API docs contain all optional features.

Full Changelog

Documentation updates:

[0.5] The docs io module has no submodules #390

Made docs be compiled with feature full #391 (jorgecarleitao)

Source code(tar.gz)
Source code(zip)
v0.5.0(Sep 8, 2021)
A new release is here! 🎉🎉🎉

This one marked by further alignment with the arrow specification. Of special mention:

✅ Added full support for async parquet write (by @GrandChaman)

✅ Added fast extend_*values to MutablePrimitiveArray (by @ritchie46)

✅ Added support for compute to BinaryArray(by @zhyass)

✅ Added support to extension types (IPC, FFI, etc.) (by @jorgecarleitao)

✅ Added support for the brand new MONTH_DAY_NANO interval type (by @jorgecarleitao)

🚀 Improved performance of the calculation of null counts by 5x (by @jorgecarleitao)

🔧 Made cargo features not default (by @jorgecarleitao)

As usual, there is a small number of backward incompatible changes. See associated issues below, which include the migration paths to each of them.

Full Changelog

Breaking changes:

Added Extension to DataType #361

MonthDayNano added to enum IntervalUnit #360

Make io::parquet::write::write_* return size of file in bytes #354

Renamed bitmap::utils::null_count to bitmap::utils::count_zeros #342

Made GroupFilter optional in parquet'sRecordReader and added method to set it. #386 (jorgecarleitao)

Removed PartialOrd and Ord of all enums in datatypes #379 (jorgecarleitao)

Made cargo features not default #369 (jorgecarleitao)

Prepare APIs for extension types #357 (jorgecarleitao)

New features:

Added support for async parquet write #372 (GrandChaman)

Add support to extension types in FFI #363 (jorgecarleitao)

Added support for field's metadata via FFI #362 (jorgecarleitao)

Added support for Extension (logical) type #359 (jorgecarleitao)

Added support for compute to BinaryArray #346 (zhyass)

Added support for reading binary from CSV #337 (jorgecarleitao)

Added support for MONTH_DAY_NANO interval type #268 (jorgecarleitao)

Fixed bugs:

Parquet read skips a few rows at the end of the page #373

parquet_read fails when a column has too many rows with string values #366

parquet_read panics with index_out_of_bounds #351

Fixed error in MutableBitmap::push_unchecked #384 (jorgecarleitao)

Fixed display of timestamp with tz. #375 (jorgecarleitao)

Enhancements:

Added extend_*values to MutablePrimitiveArray #383 (ritchie46)

Improved performance of writing to CSV (20-25%) #382 (jorgecarleitao)

Bumped lexical-core #378 (jorgecarleitao)

Fixed casting of utf8 <> Timestamp with and without timezone #376 (jorgecarleitao)

Added Send+Sync to MutableBuffer #368 (jorgecarleitao)

Improved performance of unary _not_ for aligned bitmaps (3x) #365 (jorgecarleitao)

Reduced dependencies within num #353 (jorgecarleitao)

Bumped to parquet2 v0.4 #352 (jorgecarleitao)

Bumped tonic and prost in flight #344 (PsiACE)

Improved null count calculation (5x) #343 (jorgecarleitao)

Improved perf of deserializing integers from json (30%) #340 (jorgecarleitao)

Simplified code of json schema inference #339 (jorgecarleitao)

Documentation updates:

Moved guide examples to examples/ #387 (jorgecarleitao)

Added more docs. #358 (jorgecarleitao)

Improved API docs. #355 (jorgecarleitao)

Testing updates:

Moved tests to tests/ #389 (jorgecarleitao)

Moved compute tests to tests/ #388 (jorgecarleitao)

Added more tests. #380 (jorgecarleitao)

Pinned nightly in SIMD tests #364 (jorgecarleitao)

Improved benches for take #348 (jorgecarleitao)

Made IPC integration tests run tests that are not run by arrow-rs #278 (jorgecarleitao)

Source code(tar.gz)
Source code(zip)
v0.4.0(Aug 24, 2021)
A new release is here! 🎉🎉🎉

This one marked by a lot of enhancements to existing functionality. Of special mention:

🚀 improved performance of integer division by 4x-10x via strength division (@sundy-li and @ritchie46)

🚀 improved performance of concatenating nullable arrays by 4x

🚀 improved performance of comparisons by 2x-14x

🔧 moved most tests to a separate directory

🔧 Increased test coverage to over 80%

🔧 Made multiversion, lexical-core and serde-derive dependencies optional

✅ Added support for UnionArray (including FFI and IPC tests)

✅ Added support for FFI of Field

(full list below)

As usual, there is a small number of backward incompatible changes. The associated issues include the migration paths.

Finally, thank you to all contributors and reporters 🙇 In particular, thank you to polars and datafuse teams for the 🐛 reports. They help tremendously 💯

Full Changelog

Breaking changes:

Change dictionary iterator of values from Arrays of one element to Scalars #335

Align FFI API with arrow's C++ API #328

Make *_compare_scalar not return Result #316

Make io::print, get_value_display and get_display not return Result #286

Add MetadataVersion to IPC interfaces #282

Change DataType::Union to enable round trips in IPC #281

Removed clone requirement in StructArray -> RecordBatch #307 (jorgecarleitao)

Fixed error in reading a non-finished IPC stream. #302 (jorgecarleitao)

Generalized ZipIterator to accept a BitmapIter #296 (jorgecarleitao)

New features:

Added API to FFI Field #321 (jorgecarleitao)

Added compare_scalar #317 (jorgecarleitao)

Add UnionArray #283 (jorgecarleitao)

Fixed bugs:

SliceIterator of last bytes is not correct #292

Fixed error in displaying dictionaries with nulls in values #334 (jorgecarleitao)

Fixed error in dict equality #333 (jorgecarleitao)

Fixed small inconsistencies between compute::cast and compute::can_cast #295 (jorgecarleitao)

Removed order implementation for days_ms / Interval(DayTime) #285 (jorgecarleitao)

Enhancements:

Added support for remaining non-nested datatypes #336 (jorgecarleitao)

Made multiversion and lexical-core optional #324 (jorgecarleitao)

Improved performance of utf8 comparison (1.7x-4x) #322 (jorgecarleitao)

Improved performance of boolean comparison (5x-14x) #318 (jorgecarleitao)

Added trait TryPush #314 (jorgecarleitao)

Added cast date32 -> i64 and date64 -> i32 #308 (ritchie46)

Improved performance of comparison with SIMD feature flag (2x-3.5x) #305 (jorgecarleitao)

Added support to read json to BinaryArray #304 (jorgecarleitao)

Improved MutableFixedSizeBinaryArray #303 (jorgecarleitao)

Improved MutablePrimitiveArray and MutableUtf8Array #299 (jorgecarleitao)

Improved MutableBooleanArray #297 (jorgecarleitao)

Improved performance of concatenating non-aligned validities (15x) #291 (jorgecarleitao)

Added support for timestamps with tz and interval to io::print::write #287 (jorgecarleitao)

Improved debug repr of buffers and bitmaps. #284 (jorgecarleitao)

Cleaned up internals of json integration #280 (jorgecarleitao)

Removed serde_derive dependency #279 (jorgecarleitao)

Simplified IPC code. #277 (jorgecarleitao)

Reduced dependencies from confi-table and enabled wasm on io_print feature. #276 (jorgecarleitao)

Improve performance of rem_scalar/div_scalar for integer types (4x-10x) #275 (ritchie46)

Documentation updates:

Cleaned examples and docs from old API. #330 (jorgecarleitao)

Improved documentation #306 (jorgecarleitao)

Testing updates:

Improved naming of testing workflows #315 (jorgecarleitao)

Added tests to scalar API #300 (jorgecarleitao)

Made CSV and JSON tests not use files. #290 (jorgecarleitao)

Moved tests to integration tests #289 (jorgecarleitao)

Closed issues:

Make parquet_read_record support async #331

Panic due to SIMD comparison #312

Bitmap::mutable line 155 may Panic/segfault #309

IPC's StreamReader may abort due to excessive memory by overflowing a usized variable #301

Improve performance of rem_scalar/div_scalar for integer types (4x-10x) #259

Source code(tar.gz)
Source code(zip)
v0.3.0(Aug 11, 2021)
This is the first Rust implementation to support reading parquet on wasm, which opens up a whole new range of possibilities by allowing reading parquet directly on a browser without having to communicate data to a server.

This is also the first time an implementation of arrow in Rust offers native async support to read parquet, enabling consumers to perform ranged queries against blob storage without blocking. Check out an example here

Finally, this implementation now offers a Scalar API that we can build upon to support arrow's more exotic types such as maps and unions.

Thanks a lot to @sundy-li , @PsiACE , @ritchie46, @ghuls and @Dandandan for the contributions and discussions and to @Darksonn for all the patience and help that unblocked me when working on the async parquet support.

Full Changelog

Breaking changes:

Renamed sum to sum_primitive #273

Moved trait Index from array::Index to types::Index #272

Added optional projection to IPC FileReader #271

Added optional page_filter to parquet's RecordReader and get_page_iterator #270

Renamed parquets' CompressionCodec to Compression #269

New features:

Added support for FFI of dictionary-encoded arrays #267 (jorgecarleitao)

Added support for projection pushdown on IPC files #264 (jorgecarleitao)

Added support to read parquet asynchronously #260 (jorgecarleitao)

Added support to filter parquet pages. #256 (jorgecarleitao)

Added wrapping_cast to cast kernels #254 (sundy-li)

Added support to parquet IO on wasm32 #239 (jorgecarleitao)

Added support to round-trip dictionary arrays on parquet #232 (jorgecarleitao)

Added Scalar API #56 (jorgecarleitao)

Fixed bugs:

Fixed error in computing remainder of chunk iterator #262 (jorgecarleitao)

Fixed error in slicing bitmap. #250 (jorgecarleitao)

Enhancements:

Improve the performance in cast kernel using AsPrimitive trait in generic dispatch #252

Poor performance in sort::sort_to_indices with limit option in arrow2 #245

Support loading Feather v2 (IPC) files with more than 1 million tables #231

Migrated to parquet2 v0.3 #265 (jorgecarleitao)

Added more tests to cast and min/max #253 (jorgecarleitao)

Prettytable is unmaintained. Change to comfy-table #251 (PsiACE)

Added IndexRange to remove checks in hot loops #247 (jorgecarleitao)

Make merge_sort_slices MergeSortSlices public #243 (sundy-li)

Documentation updates:

Added example and guide section on compute #242 (jorgecarleitao)

Closed issues:

Allow projection pushdown to IPC files #261

Add support to write dictionary-encoded pages #211

Make IpcWriteOptions easier to find. #120

Source code(tar.gz)
Source code(zip)
v0.2.0(Jul 30, 2021)
Changelog

v0.2.0 (2021-07-30)

Full Changelog

Breaking changes:

Simplified new signature of growable API #238 (jorgecarleitao)

Add support to merge sort with a limit #222 (sundy-li)

Generalized sort to accept indices other than i32. #220 (jorgecarleitao)

Added support for limited sort #218 (jorgecarleitao)

New features:

Merge sort support limit option #221

Introduce limit option to sort #215

Added support for take of interval of days_ms #219 (jorgecarleitao)

Added FFI for remaining types #213 (jorgecarleitao)

Fixed bugs:

Filter operation on sliced utf8 arrays are incorrect #233

Fixed error in slicing bitmap. #237 (jorgecarleitao)

Fixed nested FFI. #212 (jorgecarleitao)

Enhancements:

Avoid materialization of indices in filter_record_batch for single arrays #234

Add integration tests for writing to parquet #80

Short-circuited boolean evaluation in GrowableList #228 (ritchie46)

Add extra inlining to speed up take #226 (Dandandan)

Removed un-needed unsafe #225 (jorgecarleitao)

Documentation updates:

Add documentation to guide #96

Add git submodule command to correct the test doc #223 (sundy-li)

Added badges to README #216 (sundy-li)

Clarified differences with arrow crate #210 (alamb)

Clarified differences with arrow crate #209 (alamb)

* This Changelog was automatically generated by github_changelog_generator
Source code(tar.gz)
Source code(zip)