transmute-free Rust library to work with the Arrow format

Related tags

Utilities arrow2
Overview

Arrow2: Transmute-free Arrow

test

This repository contains a Rust library to work with the Arrow format. It is a re-write of the official Arrow crate using transmute-free operations. See FAQ for details.

See the guide.

Design

This repo and crate's primary goal is to offer a safe Rust implementation to interoperate with the Arrow. As such, it

  • MUST NOT implement any logical type other than the ones defined on the arrow specification, schema.fbs.
  • MUST lay out memory according to the arrow specification
  • MUST support reading from and writing to the C data interface at zero-copy.
  • MUST support reading from, and writing to, the IPC specification, which it MUST verify against golden files available here.

Design documents about each of the parts of this repo are available on their respective READMEs.

Run unit tests

git clone [email protected]:jorgecarleitao/arrow2.git
cd arrow2
cargo test

The test suite is a superset of all integration tests that the original implementation has against golden files from the arrow project. It currently makes no attempt to test the implementation against other implementations in arrow's master; it assumes that arrow's golden files are sufficient to cover the specification. This crate uses both little and big endian golden files, as it supports both endianesses at IPC boundaries.

Features in this crate and not in the original

  • Uses Rust's compiler whenever possible to prove that memory reads are sound
  • MIRI checks on non-IO components (MIRI and file systems are a bit funny atm)
  • IPC supports big endian
  • More predictable JSON reader
  • Generalized parsing of CSV based on logical data types
  • conditional compilation based on cargo features to reduce dependencies and size
  • single repository dedicated to Rust
  • faster IPC reader (different design that avoids an extra copy of all data)

Features in the original not availabe in this crate

  • Parquet IO
  • some compute kernels
  • SIMD (no plans to support: favor auto-vectorization instead)

Roadmap

  1. parquet IO
  2. bring documentation up to speed
  3. compute kernels
  4. auto-vectorization of bitmap operations

How to develop

This is a normal Rust project. Clone and run tests with cargo test.

FAQ

Why?

The arrow crate uses Buffer, a generic struct to store contiguous memory regions (of bytes). This construct is used to store data from all arrays in the rust implementation. The simplest example is a buffer containing 1i32, that is represented as &[0u8, 0u8, 0u8, 1u8] or &[1u8, 0u8, 0u8, 0u8] depending on endianness.

When a user wishes to read from a buffer, e.g. to perform a mathematical operation with its values, it needs to interpret the buffer in the target type. Because Buffer is a contiguous region of bytes with no type information, users must transmute its data into the respective type.

Arrow currently transmutes buffers on almost all operations, and very often does not verify that there is type alignment nor correct length when we transmute it to a slice of type &[T].

Just as an example, in v3.0.0, the following code compiles, does not panic, is unsound and results in UB:

let buffer = Buffer::from_slic_ref(&[0i32, 2i32])
let data = ArrayData::new(DataType::Int64, 10, 0, None, 0, vec![buffer], vec![]);
let array = Float64Array::from(Arc::new(data));

println!("{:?}", array.value(1));

Note how this initializes a buffer with bytes from i32, initializes an ArrayData with dynamic type Int64, and then an array Float64Array from Arc. Float64Array's internals will essentially consume the pointer from the buffer, re-interpret it as f64, and offset it by 1.

Still within this example, if we were to use ArrayData's datatype, Int64, to transmute the buffer, we would be creating &[i64] out of a buffer created out of i32.

Any Rust developer acknowledges that this behavior goes very much against Rust's core premise that a functions' behvavior must not be undefined depending on whether the arguments are correct. The obvious observation is that transmute is one of the most unsafe Rust operations and not allowing the compiler to verify the necessary invariants is a large burden for users and developers to take.

This simple example indicates a broader problem with the current design, that we now explore in detail.

Root cause analysis

At its core, Arrow's current design is centered around two main structs:

  1. untyped Buffer
  2. untyped ArrayData
1. untyped Buffer

The crate's buffers are untyped, which implies that once created, the type information is lost. Consequently, the compiler has no way of verifying that a certain read can be performed. As such, any read from it requires an alignment and size check at runtime. This is not only detrimental to performance, but also very cumbersome.

For the past 4 months, I have identified and fixed more than 10 instances of unsound code derived from the misuse, within the crate itself, of Buffer. This hints that downstream dependencies using this crate and use this API are likely do be even more affected by this.

2. untyped ArrayData

ArrayData is a struct containing buffers and child data that does not differentiate which type of array it represents at compile time.

Consequently, all buffer reads from ArrayData's buffers are effectively unsafe, as they require certain invariants to hold. These invariants are strictly related to ArrayData::datatype: this enum differentiates how to transmute the ArrayData::buffers. For example, an ArrayData::datatype equal to DataType::UInt32 implies that the buffer should be transmuted to u32.

The challenge with the above struct is that it is not possible to prove that ArrayData's creation is sound at compile time. As the sample above showed, there was nothing wrong, during compilation, with passing a buffer with i32 to an ArrayData expecting i64. We could of course check it at runtime, and we should, but we are defeating the whole purpose of using a typed system as powerful as Rust offers.

The main consequence of this observation is that the current code has a significant maintenance cost, as we have to be rigorously check the types of the buffers we are working with. The example above shows that, even with that rigour, we fail to identify obvious problems at runtime.

Overall, there are many instances of our code where we expose public APIs marked as safe that are unsafe and lead to undefined behavior if used incorrectly. This goes against the core goals of the Rust language, and significantly weakens Arrow Rust's implementation core premise that the compiler and borrow checker proves many of the memory safety concerns that we may have.

Equally important, the inability of the compiler to prove certain invariants is detrimental to performance. As an example, the implementation of the take kernel in this repo is semantically equivalent to the current master, but 1.3-2x faster.

How?

Contrarily to the original implementation, this implementation does not transmutate byte buffers based on runtime types, and instead requires all buffers to be typed (in Rust's sense of a generic).

This removes many existing bugs and enables the compiler to prove all type invariants with the only exception of FFI and IPC boundaries.

This crate also has a different design towards arrays' offsets that removes many out of bound reads consequent of using byte and slot offsets interchangibly.

This crate's design of primitive types is also more explicit about its logical and physical representation, enabling support for Timestamps with timezones and a safe implementation of the Interval type.

Consequently, this crate is easier to use, develop, maintain, and debug.

Any plans to merge with the Apache Arrow project?

Yes. The primary reason to have this repo and crate is to be able to propotype and mature using a fundamentally different design based on a transmute-free implementation. This requires breaking backward compatibility and loss of features that is impossible to achieve on the Arrow repo.

Furthermore, the arrow project currently has a release mechanism that is unsuitable for this type of work:

  • The Apache Arrow project has a single git repository with all 10+ implementations, ranging from C++, Python, C#, Julia, Rust, and execution engines such as Grandiva and DataFusion. A git ref corresponds to all of them, and a commit is about any/all of them.

The implication is this work would require a proibitive number of Jira issues for each PR to the crate, as well as an inhumane number of PRs, reviews, etc.

Another consequence is that it is impossible to release a different design of the arrow crate without breaking every dependency within the project which makes it difficult to iterate.

  • A release of the Apache consists of a release of all implementations of the arrow format at once, with the same version. It is currently at 3.0.0.

This implies that the crate version is independent of the changelog or its API stability, which violates SemVer. This procedure makes the crate incompatible with Rusts' (and many others') ecosystem that heavily relies on SemVer to constraint software versions.

Secondly, this implies the arrow crate is versioned as >0.x. This places expectations about API stability that are incompatible with this effort.

Comments
  • Replaced own allocator by `std::Vec`.

    Replaced own allocator by `std::Vec`.

    This PR moves the custom allocator to a feature gate cache_aligned, replacing it by std::Vec in the default, full, etc.

    This allows users to create Buffer and MutableBuffer directly from a Vec at zero cost (MutableBuffer becomes a thin wrapper of Vec). This opens a whole range of possibilities related to crates that assume that the backing container is a Vec.

    Note that this is fully compatible with arrow spec; we just do not follow the recommendation that the allocations follow 64 bytes (which neither this nor arrow-rs was following, since we have been using 128 byte alignments in x86_64). I was unable to observe any performance difference between 128-byte, 64-byte and std's (i.e. aligned with T) allocations.

    This is backward incompatible, see #449 for details. Closes #449.

    feature 
    opened by jorgecarleitao 17
  • Read Decimal from Parquet File

    Read Decimal from Parquet File

    Hi,

    So far we have the capability to write decimal to parquet. I wonder if we can implement reading decimal value from parquet file as well.

    Thank you very much.

    good first issue feature 
    opened by potter420 15
  • Added COW semantics to `Buffer`, `Bitmap` and some arrays

    Added COW semantics to `Buffer`, `Bitmap` and some arrays

    I make this PR to have some context for the discussion. I understand this hits the core of all memory operations so we must ensure its correct.

    This allows for a PrimitiveArray to get a MutablePrimitiveArray zero copy. Allowing for mutations, pushes etc.

    This conversion from PrimitiveArray to MutablePrimitiveArray can only be done if

    • The Arc<> pointer is not shared. This is checked with Arc::get_mut and the borrow checker
    • The data is allocated in Rust by a Vec and not by an FFI.
    • the data has an offset of 0. (let's keep it simple)
    • Both the validity Bitmap and the values Buffer<T> suffice above requirements.
    feature 
    opened by ritchie46 14
  • Added support for `Extension`

    Added support for `Extension`

    This PR adds support for Arrow's "Extension type".

    This is represented by DataType::Extension and ExtensionArray, respectively.

    For now, extension arrays are only supported to be shared via IPC (i.e. not FFI, since metadata is still not supported in FFI).

    This PR adds an example demonstrating how to use it.

    All of this is pending passing integration tests, as usual, as well as more tests covering the feature.

    feature 
    opened by jorgecarleitao 13
  • Simplified reading parquet

    Simplified reading parquet

    This PR simplifies the code to read parquet, making it a bit more future proof and opening the doors to improve performance in writing by re-using buffers (improvements upstream).

    I do not observe differences in performance (vs main) in the following parquet configurations:

    • single page vs multiple pages
    • compressed vs uncompressed
    • different types

    It generates flamegraphs that imo are quite optimized:

    Screenshot 2021-10-16 at 06 44 55

    This corresponds to

    cargo flamegraph --features io_parquet,io_parquet_compression \
        --example parquet_read fixtures/pyarrow3/v1/multi/snappy/benches_1048576.parquet \
        1 0
    

    i.e. reading a f64 column from a single row group with 1M rows with a page size of 1Mb (default in pyarrow).

    Screenshot 2021-10-16 at 06 53 44

    (same but for a utf8 column (column index 2))

    The majority of the time is used deserializing the data to arrow, which means that the main gains to have continue to be on that front.

    Backward incompatible

    • The API to read parquet now uses FallibleStreamingIterator instead of StreamingIterator (of Result<Page>). As before, we re-export these APIs in io::parquet::read.
    • The API to write parquet now expects the user to decompress the pages. This is only relevant when not using RowGroupIterator (i.e. in parallelizing). This is now enforced by the type system (DataPage vs CompressedDataPage), so that we do not get it wrong.
    backwards-incompatible 
    opened by jorgecarleitao 12
  • IPC's `StreamReader` may abort due to excessive memory by overflowing a `usize`d variable

    IPC's `StreamReader` may abort due to excessive memory by overflowing a `usize`d variable

    Hi,

    I've been trying to use this crate as means to transfer real time streaming data between pyarrow and a Rust process. However, I've been having hard-to-debug issues with the data readout on the Rust side - basically after a few seconds of stream I always get a message similar to: memory allocation of 18446744071884874941 bytes failed and the process aborts without a traceback (Windows 10).

    To debug it I forked and inserted a few logging statements in stream.rs which revealed this line as the culprit. Basically, sometime during my application meta_len becomes negative, and when its cast to a usize it obviously wraps around and causes the memory overflow. Now, if meta_len is negative it also means that meta_size is corrupt as well, but I'm still not quite sure what causes this corruption.

    During the live streaming I get this error pretty reliably, but if I try to re-run this script without the Python side of things, i.e. only reading from the Arrow IPC file and not writing anything new to it, the file parses correctly. In other words, it appears that concurrent access to the file of the Python process - which writes the data - and the Rust process causes this issue. I was under the impression that Arrow's NativeFile format should help in these situations, but I might've interpreted the pyarrow docs incorrectly. I'm not quite sure how to lock a file between two different processes (if that's even possible), so I'm kinda lost with respect to that issue.

    With all of this background, my question is a bit more straight-forward: Assuming that there happened some corruption at the meta_size level, how should I recover from it? Should I move the file Cursor back a few steps? Should I try to re-read the data? Or is aborting the streaming all that's left?

    I hope my issue and specific question are clear enough. As a side note, I tried using arrow-rs's IPC implementation, but they have more severe bugs which are much less subtle than this one.

    opened by HagaiHargil 12
  • IPC writing writes all values in sliced arrays with offsets.

    IPC writing writes all values in sliced arrays with offsets.

    I am new to Rust and Arrow (so apologies for my lack of clarity in explaining this issue) and have a use case where I need to transform columnar record batches to row-oriented record batches, i.e. a single record batch for each row.

    Full code to reproduce: https://github.com/dectl/arrow2-slice-bug

    It is expected that the rows vector of arrays would each contain values for a single row as I sliced the columns using a row number offset with a length of 1.

    In this example, I read 5 rows from a csv file and create a record batch. The row-oriented record batches produced are all the same as the original record batch of 5. Looking at the debug output for rows[0] below, it appears that the arrays point to the full record batch buffer, containing data for all 5 rows, rather than the slice of data for the single row. Is there perhaps an issue with the offsets in this implementation or am I doing something wrong?

    RecordBatch { schema: Schema { fields: [Field { name: "trip_id", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: None }, Field { name: "lpep_pickup_datetime", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: None }, Field { name: "lpep_dropoff_datetime", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: None }, Field { name: "passenger_count", data_type: Float64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: None }, Field { name: "trip_distance", data_type: Float64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: None }, Field { name: "fare_amount", data_type: Float64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: None }, Field { name: "tip_amount", data_type: Float64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: None }, Field { name: "total_amount", data_type: Float64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: None }, Field { name: "payment_type", data_type: Float64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: None }, Field { name: "trip_type", data_type: Float64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: None }], metadata: {} }, columns: [Utf8Array { data_type: Utf8, offsets: Buffer { data: Bytes { ptr: 0x555aec9da480, len: 6, data: [0, 36, 72, 108, 144, 180] }, offset: 0, length: 2 }, values: Buffer { data: Bytes { ptr: 0x555aec9da780, len: 180, data: [97, 49, 55, 99, 53, 50, 97, 49, 45, 100, 53, 52, 49, 45, 52, 54, 99, 56, 45, 57, 100, 49, 101, 45, 51, 51, 51, 57, 97, 56, 55, 49, 53, 100, 50, 48, 48, 49, 101, 100, 100, 48, 52, 56, 45, 97, 48, 100, 48, 45, 52, 100, 49, 49, 45, 56, 102, 99, 99, 45, 99, 54, 100, 51, 52, 48, 51, 51, 52, 57, 53, 50, 56, 55, 49, 54, 97, 56, 97, 98, 45, 98, 99, 56, 51, 45, 52, 53, 50, 56, 45, 97, 57, 98, 100, 45, 54, 55, 101, 48, 51, 53, 49, 54, 50, 49, 51, 100, 99, 56, 99, 99, 48, 48, 57, 97, 45, 57, 57, 97, 52, 45, 52, 49, 48, 101, 45, 56, 56, 56, 54, 45, 52, 101, 55, 57, 56, 101, 57, 49, 56, 51, 52, 50, 101, 50, 54, 54, 54, 97, 57, 101, 45, 53, 102, 55, 48, 45, 52, 54, 98, 53, 45, 57, 55, 98, 99, 45, 53, 55, 53, 98, 100, 52, 56, 100, 53, 57, 57, 100] }, offset: 0, length: 180 }, validity: None, offset: 0 }, Utf8Array { data_type: Utf8, offsets: Buffer { data: Bytes { ptr: 0x555aec9daa80, len: 6, data: [0, 19, 38, 57, 76, 95] }, offset: 0, length: 2 }, values: Buffer { data: Bytes { ptr: 0x555aec9dac80, len: 95, data: [50, 48, 49, 57, 45, 49, 50, 45, 49, 56, 32, 49, 53, 58, 53, 50, 58, 51, 48, 50, 48, 50, 48, 45, 48, 49, 45, 48, 49, 32, 48, 48, 58, 52, 53, 58, 53, 56, 50, 48, 50, 48, 45, 48, 49, 45, 48, 49, 32, 48, 48, 58, 52, 49, 58, 51, 56, 50, 48, 50, 48, 45, 48, 49, 45, 48, 49, 32, 48, 48, 58, 53, 50, 58, 52, 54, 50, 48, 50, 48, 45, 48, 49, 45, 48, 49, 32, 48, 48, 58, 49, 57, 58, 53, 55] }, offset: 0, length: 95 }, validity: None, offset: 0 }, Utf8Array { data_type: Utf8, offsets: Buffer { data: Bytes { ptr: 0x555aec9dae80, len: 6, data: [0, 19, 38, 57, 76, 95] }, offset: 0, length: 2 }, values: Buffer { data: Bytes { ptr: 0x555aec9db080, len: 95, data: [50, 48, 49, 57, 45, 49, 50, 45, 49, 56, 32, 49, 53, 58, 53, 52, 58, 51, 57, 50, 48, 50, 48, 45, 48, 49, 45, 48, 49, 32, 48, 48, 58, 53, 54, 58, 51, 57, 50, 48, 50, 48, 45, 48, 49, 45, 48, 49, 32, 48, 48, 58, 53, 50, 58, 52, 57, 50, 48, 50, 48, 45, 48, 49, 45, 48, 49, 32, 48, 49, 58, 49, 52, 58, 50, 49, 50, 48, 50, 48, 45, 48, 49, 45, 48, 49, 32, 48, 48, 58, 51, 48, 58, 53, 54] }, offset: 0, length: 95 }, validity: None, offset: 0 }, PrimitiveArray { data_type: Float64, values: Buffer { data: Bytes { ptr: 0x555aec9db300, len: 5, data: [5.0, 2.0, 1.0, 2.0, 1.0] }, offset: 0, length: 1 }, validity: None, offset: 0 }, PrimitiveArray { data_type: Float64, values: Buffer { data: Bytes { ptr: 0x555aec9f4180, len: 5, data: [0.0, 1.28, 2.47, 6.3, 2.3] }, offset: 0, length: 1 }, validity: None, offset: 0 }, PrimitiveArray { data_type: Float64, values: Buffer { data: Bytes { ptr: 0x555aec9f4380, len: 5, data: [3.5, 20.0, 10.5, 21.0, 10.0] }, offset: 0, length: 1 }, validity: None, offset: 0 }, PrimitiveArray { data_type: Float64, values: Buffer { data: Bytes { ptr: 0x555aec9f4580, len: 5, data: [0.01, 4.06, 3.54, 0.0, 0.0] }, offset: 0, length: 1 }, validity: None, offset: 0 }, PrimitiveArray { data_type: Float64, values: Buffer { data: Bytes { ptr: 0x555aec9f4780, len: 5, data: [4.81, 24.36, 15.34, 25.05, 11.3] }, offset: 0, length: 1 }, validity: None, offset: 0 }, PrimitiveArray { data_type: Float64, values: Buffer { data: Bytes { ptr: 0x555aec9f4980, len: 5, data: [1.0, 1.0, 1.0, 2.0, 1.0] }, offset: 0, length: 1 }, validity: None, offset: 0 }, PrimitiveArray { data_type: Float64, values: Buffer { data: Bytes { ptr: 0x555aec9f4c80, len: 5, data: [1.0, 2.0, 1.0, 1.0, 1.0] }, offset: 0, length: 1 }, validity: None, offset: 0 }] }
    

    This seems wrong: Bytes { ptr: 0x555aec9db300, len: 5, data: [5.0, 2.0, 1.0, 2.0, 1.0] }, offset: 0, length: 1 }

    use std::path::Path;
    use std::sync::Arc;
    
    use arrow2::io::csv::read;
    use arrow2::array::Array;
    use arrow2::record_batch::RecordBatch;
    
    
    fn main() {
        let mut reader = read::ReaderBuilder::new().from_path(&Path::new("sample.csv")).unwrap();
        let schema = Arc::new(read::infer_schema(&mut reader, Some(10), true, &read::infer).unwrap());
        
        let mut rows = vec![read::ByteRecord::default(); 5];
        read::read_rows(&mut reader, 0, &mut rows).unwrap();
        let rows = rows.as_slice();
    
        let record_batch = read::deserialize_batch(
            rows,
            schema.fields(),
            None,
            0,
            read::deserialize_column,
        ).unwrap();
    
        println!("full record batch:");
        println!("{:?}", record_batch);
    
        // Convert columnar record batch to a vector of "row" (single record) record batches
        let mut rows = Vec::new();
    
        for i in 0..record_batch.num_rows() {
            let mut row = Vec::new();
            for column in record_batch.columns() {
                let arr: Arc<dyn Array> = column.slice(i, 1).into();
                row.push(arr);
            }
            rows.push(row);
        }
        
        println!("row record batches:");
        for (i, row) in rows.into_iter().enumerate() {
            println!("row {}", i);
            let row_record_batch = RecordBatch::try_new(schema.clone(), row).unwrap();
            println!("{:?}", row_record_batch);
        }
    }
    
    

    Below is my equivalent code for the official arrow-rs implementation, which appears to work as expected.

    use std::fs::File;
    use std::path::Path;
    
    use arrow::csv;
    use arrow::util::pretty::print_batches;
    use arrow::record_batch::RecordBatch;
    
    
    fn main() {
        let file = File::open(&Path::new("sample.csv")).unwrap();
        let builder = csv::ReaderBuilder::new()
            .has_header(true)
            .infer_schema(Some(10));
        let mut csv = builder.build(file).unwrap();
    
        let record_batch = csv.next().unwrap().unwrap();
        let schema = record_batch.schema().clone();
    
        println!("full record batch:");
        print_batches(&[record_batch.clone()]).unwrap();
    
        // Convert columnar record batch to a vector of "row" (single record) record batches
        let mut rows = Vec::new();
    
    
        for i in 0..record_batch.num_rows() {
            let mut row = Vec::new();
            for column in record_batch.columns() {
                row.push(column.slice(i, 1));
            }
            rows.push(row);
        }
        
    
        // Using new record_batch.slice() method now implemented in SNAPSHOT-5.0.0
        /*
        for i in 0..record_batch.num_rows() {
            rows.push(record_batch.slice(i, 1));
        }
        */
    
        println!("row record batches:");
        for row in rows {
            let row_record_batch = RecordBatch::try_new(schema.clone(), row).unwrap();
            print_batches(&[row_record_batch]).unwrap();
        }
    
        /*
        for row in rows {
            print_batches(&[row]).unwrap();
        }
         */
    }
    
    opened by declark1 12
  • Support to read/write from/to ODBC

    Support to read/write from/to ODBC

    This PR adds support to reading from, and writing to, an ODBC driver.

    I anticipate this to be one of the most efficient ways of loading data into arrow, as odbc-api offers an API to load data in a columnar format whereby most buffers are copied back-to-back into arrow (even when nulls are present). variable length and validity needs a small O(N) deserialization, so not as fast as Arrow IPC (but likely much faster than parquet).

    feature 
    opened by jorgecarleitao 11
  • Consider removing `RecordBatch`

    Consider removing `RecordBatch`

    For historical reasons, we have RecordBatch. RecordBatch represents a collection of columns with a schema.

    I see a couple of problems with RecordBatch:

    1. it mixes metadata (Schema) with data (Array). In all IO cases we have, the Schema is known when the metadata from the file is read, way before data is read. I.e. the user has access to the Schema very early, and does not really need to pass it to an iterator or stream of data for the stream to contain the metadata. However, it is required to do so by our APIs, because our APIs currently return a RecordBatch (and thus need a schema on them) even though all the schemas are the same.

    2. it is not part of the arrow spec. A RecordBatch is only mentioned in the IPC, and it does not contain a schema (only columns)

    3. it is a struct that can easily be recreated by users that need it

    4. It indirectly drives design decisions to use it as the data carrier, even though it is not a good one. For example, in DataFusion (apache/arrow-datafusion) the physical nodes return a stream of RecordBatch, which requires piping schemas all the way to the physical nodes so that they can in turn use them to create a RecordBatch. This could have been replaced by Vec<Arc<dyn Array>>, or even more exotic carriers (e.g. an enum with a scalar and vector variants).

    help wanted no-changelog investigation 
    opened by jorgecarleitao 11
  • Added example showing parallel writes to parquet (x num_cores)

    Added example showing parallel writes to parquet (x num_cores)

    Encoding + compression is embarrassingly parallel across columns, and thus results in a speedup factor equal to the number of available cores, up to the number of columns to be written.

    documentation 
    opened by jorgecarleitao 11
  • Add support to read JSON

    Add support to read JSON

    Currently we write and read JSON lines.

    https://jsonlines.org/

    I believe it would be a minor modification to also be able to read and write JSON. We could wrap the JSON Line values in an array and separate them with a , instead of a new line char.

    E.g. now we write:

    JSON Lines

    {c1:"a"}
    {c1:null}
    

    JSON

    [{c1:"a"},
    {c1:null}]
    

    Would this fit the scope of arrow2, as this is something different then what pyarrow does? I don't mean that we should drop the JSON Lines functionality, but that we also allow reading and writing JSON.

    enhancement no-changelog 
    opened by ritchie46 10
  • Dependencies update

    Dependencies update

    There's a few dependencies that are quite outdated (including some hefty ones, like zstd). The only one left outdated is odbc-api since I'm unsure about that.

    opened by aldanor 1
  • StructArray convenience methods (redux)

    StructArray convenience methods (redux)

    I see #61 is marked as completed, but the methods discussed don't seem to exist. Besides that, I'd guess that most of the time when a caller knows a certain field exists, they probably also know its type. So I'd love to be able to do something like this:

    arr.child::<PrimitiveArray<u64>>("id").unwrap()
    

    Thoughts? I can take a stab at a PR, if it seems worthwhile.

    opened by hohav 0
  • Reusing chunks when writing?

    Reusing chunks when writing?

    I'm currently using IPC file writer, where the write() method requires you pass a Chunk<Box<dyn Any>> (note: specifically a box).

    If I have a bunch of MutablePrimitiveArray columns that I collect the data to until the chunk fills up, the only choice I have is to move them (i.e. std::mem::take) into Chunk, which will then require reallocation once the data for the next chunk gets collected. There's no feasible way of getting the original arrays out of Chunk either (you can get a &PrimitiveArray by downcasting, but you can't convert it back to a concrete sized type).

    Wonder if it's an oversight of some sort (in which case, where exactly, in the IPC API?) or whether there's a good reason? E.g. write() requires Chunk<Box<dyn Any>> whereas it could probably accept a Chunk<impl AsRef<dyn Any>>?

    opened by aldanor 0
  • fix csv infer_schema on empty fields

    fix csv infer_schema on empty fields

    The default infer function provided by src/io/csv/utils.rs infers an empty field as DataType::Utf8. If the same column contains non empty fields containing valid DataType::Float64 data the column in infer_schema will then contain both DataTypes. When merge_schema is called it will decide that DataType::Utf8 has precedence over DataType::Float64 and set the column type to DataType::Float64.

    If we instead decline to infer anything for a field without any data we will in the end get DataType::Float64 if the column contained other fields with valid f64 data and if we only had empty data for the column it will still default to DataType::Utf8.

    opened by tripokey 3
  • Arrow2 read parquet file did not reuse the page decoder buffer to array

    Arrow2 read parquet file did not reuse the page decoder buffer to array

    Let's look at these codes in
    https://github.com/jorgecarleitao/arrow2/blob/main/src/io/parquet/read/deserialize/primitive/basic.rs#L219-L226

      State::Required(page) => {
                    values.extend(
                        page.values
                            .by_ref()
                            .map(decode)
                            .map(self.op)
                            .take(remaining),
                    );
                }
    

    It had extra memcpy in values.extend and decode, I think maybe we could optimize it by using Buffer clone.

    The first motivation is to move

    #[derive(Debug, Clone)]
    pub struct DataPage {
        pub(super) header: DataPageHeader,
        pub(super) buffer: Vec<u8>,
        ...
    }
    

    to

    #[derive(Debug, Clone)]
    pub struct DataPage {
        pub(super) header: DataPageHeader,
        pub(super) buffer: Buffer<u8>,
        ...
    }
    

    @jorgecarleitao what do you think about this?

    I found arrow-rs had addressed this improvement in https://github.com/apache/arrow-rs/blob/master/parquet/src/arrow/array_reader/byte_array.rs#L115-L138

    opened by sundy-li 3
Releases(v0.15.0)
Owner
Jorge Leitao
Jorge Leitao
This crate allows writing a struct in Rust and have it derive a struct of arrays layed out in memory according to the arrow format.

Arrow2-derive - derive for Arrow2 This crate allows writing a struct in Rust and have it derive a struct of arrays layed out in memory according to th

Jorge Leitao 29 Dec 27, 2022
Convert an MCU register description from the EDC format to the SVD format

edc2svd Convert an MCU register description from the EDC format to the SVD format EDC files are used to describe the special function registers of PIC

Stephan 4 Oct 9, 2021
Apache Arrow in WebAssembly

WASM Arrow This package compiles the Rust library of Apache Arrow to WebAssembly. This might be a viable alternative to the pure JavaScript library. R

Dominik Moritz 61 Jan 1, 2023
Utility library to work with tuples.

Utility library to work with tuples.

RenΓ© Kijewski 9 Nov 30, 2022
A lite tool to make systemd work in any container(Windows Subsystem for Linux 2, Docker, Podman, etc.)

Angea Naming from hydrangea(γ‚’γ‚Έγ‚΅γ‚€) A lite tool to make systemd work in any container(Windows Subsystem for Linux 2, Docker, Podman, etc.) WSL1 is not s

いんしさくら 16 Dec 5, 2022
Easily sync your clipboard between devices. This is a work in progress app.

Clipboard Sync Description Easily sync your clipboard between devices. This is a work in progress app. Stack Frontend: React Tauri isomorphic-ws TSX,

Steveplays 2 Mar 2, 2022
cargo-lambda a Cargo subcommand to help you work with AWS Lambda

cargo-lambda cargo-lambda is a Cargo subcommand to help you work with AWS Lambda. This subcommand compiles AWS Lambda functions natively and produces

David Calavera 184 Jan 5, 2023
cargo-lambda is a Cargo subcommand to help you work with AWS Lambda.

cargo-lambda cargo-lambda is a Cargo subcommand to help you work with AWS Lambda. The new subcommand creates a basic Rust package from a well defined

null 184 Jan 5, 2023
Free and open-source reimplementation of Native Mouse Lock (display_mouse_lock) in rust.

dml-rs display_mouse_lock in rust. Free, open-source reimplementation of display_mouse_lock (Native Mouse Lock) in Rust. Written because I felt like i

Tomat 4 Feb 12, 2023
Removes generated and downloaded files from code projects to free up space

makeclean Removes generated and downloaded files from code projects to free up space. Features: List, cleans and archives projects depending on how lo

Kevin Bader 2 Mar 11, 2022
An efficient async condition variable for lock-free algorithms

async-event An efficient async condition variable for lock-free algorithms, a.k.a. "eventcount". Overview Eventcount-like primitives are useful to mak

Asynchronics 3 Jul 10, 2023
IDX is a Rust crate for working with RuneScape .idx-format caches.

This image proudly made in GIMP License Licensed under GNU GPL, Version 3.0, (LICENSE-GPL3 or https://choosealicense.com/licenses/gpl-3.0/) Contributi

Ceikry 5 May 27, 2022
Serializer and deserializer for the VCR Cassette format

vcr-cassette Serializer and deserializer for the VCR Cassette format API Docs | Releases | Contributing Examples Given the following .json VCR Cassett

http-rs 12 Sep 15, 2021
Parses .off (Object File Format) files. This implementation follows this spec from the Princeton Shape Benchmark.

off-rs - A simple .off file parser Parses .off (Object File Format) files. This implementation follows this spec from the Princeton Shape Benchmark. S

Michael Lohr 2 Jun 12, 2022
Small utility to display hour in a binary format on the Novation's Launchpad X.

lpx-binary-clock Small utility to display hour in a binary format on the Novation's Launchpad X. Hours, minutes and seconds are displayed one digit pe

Alexis LOUIS 1 Feb 13, 2022
A decoder and utility for the Flipnote Studios .ppm animation format

A decoder and utility for the Flipnote Studios .ppm animation format

Usugata 11 Dec 12, 2022
A library to compile USDT probes into a Rust library

sonde sonde is a library to compile USDT probes into a Rust library, and to generate a friendly Rust idiomatic API around it. Userland Statically Defi

Ivan Enderlin 40 Jan 7, 2023
A Rust library for calculating sun positions

sun A rust port of the JS library suncalc. Install Add the following to your Cargo.toml [dependencies] sun = "0.2" Usage pub fn main() { let unixti

Markus Kohlhase 36 Dec 28, 2022
A cross-platform serial port library in Rust.

Introduction serialport-rs is a general-purpose cross-platform serial port library for Rust. It provides a blocking I/O interface and port enumeration

Bryant Mairs 143 Nov 5, 2021