This crate allows writing a struct in Rust and have it derive a struct of arrays layed out in memory according to the arrow format.

Overview

Arrow2-derive - derive for Arrow2

This crate allows writing a struct in Rust and have it derive a struct of arrays layed out in memory according to the arrow format.

, is_deleted: bool, a1: Option , a2: i64, // binary a3: Option >, // date32 a4: NaiveDate, // optional list array of optional strings nullable_list: Option

In the example above, the derived struct is

#[derive(Default, Debug)]
pub struct FooArray {
    name: MutableUtf8Array<i32>,
    is_deleted: MutableBooleanArray<i32>,
    a1: MutablePrimitiveArray<f64>,
    a2: MutablePrimitiveArray<i64>,
    a3: MutableBinaryArray<i32>,
    nullable_list: MutableListArray<i32, MutableUtf8Array<i32>>,
    required_list: MutableListArray<i32, MutableUtf8Array<i32>>,
    other_list: MutableListArray<i32, MutablePrimitiveArray<i32>>,
}

FooArray::push lays data in memory according to the arrow spec and can be used for all kinds of IPC, FFI, etc. supported by arrow2.

License

Licensed under either of

at your option.

Contribution

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.

Comments
  • trait bound not satisfied

    trait bound not satisfied

    Hey! This seems to be exactly what I need for translating structured log data that I've packed into structs cleanly into the arrow format, so I'm really excited by the potential of this crate.

    I seem to be getting an issue in my project:

    error[E0277]: the trait bound `arrow2::array::struct_::StructArray: From<LogDataArray>` is not satisfied
      --> src/main.rs:37:35
       |
    37 | #[derive(Debug, Clone, PartialEq, StructOfArrow)]
       |                                   ^^^^^^^^^^^^^ the trait `From<LogDataArray>` is not implemented for `arrow2::array::struct_::StructArray`
       | 
      ::: /home/weaton/.cargo/git/checkouts/arrow2-derive-1c9d82f1d5ff8f2d/d4f7231/src/lib.rs:9:24
       |
    9  | pub trait ArrowStruct: Into<StructArray> {
       |                        ----------------- required by this bound in `ArrowStruct`
       |
       = help: the following implementations were found:
                 <arrow2::array::struct_::StructArray as From<arrow2::array::growable::structure::GrowableStruct<'a>>>
                 <arrow2::array::struct_::StructArray as From<arrow2::record_batch::RecordBatch>>
       = note: required because of the requirements on the impl of `Into<arrow2::array::struct_::StructArray>` for `LogDataArray`
       = note: this error originates in the derive macro `StructOfArrow` (in Nightly builds, run with -Z macro-backtrace for more info)
    

    Any idea what could causing this? When I drop my struct definition into your test suite it compiles, which is causing me to scratch my head.

    Here's my cargo.toml entry, am I missing something silly?

    arrow2-derive = { git = "https://github.com/jorgecarleitao/arrow2-derive.git", branch = "main" }
    

    Thanks again for your work on this!

    question 
    opened by wseaton 10
  • Support conversion of rust struct to an arrow2 chunk

    Support conversion of rust struct to an arrow2 chunk

    Created from the discussion in https://github.com/jorgecarleitao/arrow2/issues/1092.

    A rust struct can conceptually represent either an Arrow Struct or an arrow2::Chunk (a column group). The arrow2::Chunk is important since it's used in the deserialization/serialization API for parquet and flight conversion.

    We can extend the arrow2_convert::TryIntoArrow and arrow2_convert::FromArrow traits to convert to/from arrow2::Chunk, but there are two possible mappings from a vector of structs, Vec<S> to Chunk:

    1. The Chunk has a single field of type Struct
    2. The Chunk contains the same number of fields as the struct.

    1 can be easily supported by wrapping the an arrow2::Array in a Chunk.

    2 has a couple of approaches:

    a. A new derive macro to generate the mapping to a Chunk (eg. ArrowChunk or ArrowRoot). b. Providing a helper method to convert a arrow2::StructArray to a Chunkby unwrapping the fields.

    One related use-case that could guide this design is to support generic typed versions of the arrow2 csv, json, parquet, and flight serialize/deserialize methods, where the schema is specified by a rust struct (opened https://github.com/DataEngineeringLabs/arrow2-convert/issues/41 for this). To achieve this, it would be useful to access the deserialize/serialize methods of each column separately for parallelism which is cleaner via 2a.

    opened by ncpenke 8
  • Add support for large types

    Add support for large types

    The complexity is mostly in the serialize path since for deserialize we can just look at the arrow type (LargeList, LargeUtf8, etc) and cast to the appropriate array type.

    Couple of ways I can think of support this for serialize:

    1. Only support i64 offsets, and provide a conversion method that converts large types to small types in another pass

    2. Support an attribute either on a container or per field to use the large offset.

    opened by ncpenke 7
  • Renaming crates before publishing?

    Renaming crates before publishing?

    One thing that stands out is that we have two crates: arrow2_derive, and derive_internals, both both need to be published.

    If we were to follow the typical convention, the derive_internals crate should actually be arrow2_derive, and the current arrow2_derive, which contains the recently added traits should be called something else, perhaps arrow2_convert?

    There is some additional functionality that could go into arrow2_convert that provide additional helper API that's higher-level than what the arrow2 crate provides.

    @jorgecarleitao thoughts?

    opened by ncpenke 2
  • Fix #20 Prepare crate for publishing

    Fix #20 Prepare crate for publishing

    • Rename to arrow2_convert
    • Improve error reporting in proc macro
    • Add tests for error reporing in proc macro
    • Beginnings of macro attributes for serialize only and deserialize only
    • Beginnings of optionally enabling serialize/deserialize
    • More modular tests
    • Add licenses and symlinks to readme and licenses
    • Updated documentation
    • Follow the serde serializer pattern for passing in mutable arrays, which enables serializing borrowed values
    opened by ncpenke 1
  • [question] Why are union variant without payload mapped to a bool array instead of a null array ?

    [question] Why are union variant without payload mapped to a bool array instead of a null array ?

    The following enum gets mapped to a bool array currently:

    enum Foo {
        Variant1,
    }
    

    Couldn't it be mapped to a null array instead ? I'm just recently started using arrow and your lib so I'm trying to figure out the landscape of how things are done and how to take best advantage of what's available.

    opened by douglas-raillard-arm 0
  • Crash while serializing

    Crash while serializing

    thread 'main' panicked at 'attempt to subtract with overflow', src/analysis/tasks.rs:270:75
    stack backtrace:
       0: rust_begin_unwind
                 at /rustc/e0098a5cc3a87d857e597af824d0ce1ed1ad85e0/library/std/src/panicking.rs:575:5
       1: core::panicking::panic_fmt
                 at /rustc/e0098a5cc3a87d857e597af824d0ce1ed1ad85e0/library/core/src/panicking.rs:65:14
       2: core::panicking::panic
                 at /rustc/e0098a5cc3a87d857e597af824d0ce1ed1ad85e0/library/core/src/panicking.rs:114:5
       3: <lisa_rust_analysis::analysis::tasks::MutableTaskStateArray as arrow2::array::TryPush<core::option::Option<__T>>>::try_push
                 at ./tools/analysis/src/analysis/tasks.rs:270:75
       4: <lisa_rust_analysis::analysis::tasks::TaskState as arrow2_convert::serialize::ArrowSerialize>::arrow_serialize
                 at ./tools/analysis/src/analysis/tasks.rs:270:75
       5: <lisa_rust_analysis::analysis::tasks::MutableTasksStatesRowArray as arrow2::array::TryPush<core::option::Option<__T>>>::try_push
                 at ./tools/analysis/src/analysis/tasks.rs:575:24
       6: <lisa_rust_analysis::analysis::tasks::TasksStatesRow as arrow2_convert::serialize::ArrowSerialize>::arrow_serialize
                 at ./tools/analysis/src/analysis/tasks.rs:575:24
       7: arrow2_convert::serialize::arrow_serialize_extend_internal
                 at /home/.cargo/registry/src/github.com-1ecc6299db9ec823/arrow2_convert-0.3.2/src/serialize.rs:328:9
       8: arrow2_convert::serialize::arrow_serialize_to_mutable_array
                 at /home/.cargo/registry/src/github.com-1ecc6299db9ec823/arrow2_convert-0.3.2/src/serialize.rs:343:5
       9: <Collection as arrow2_convert::serialize::TryIntoArrow<alloc::boxed::Box<dyn arrow2::array::Array>,Element>>::try_into_arrow
                 at /home/.cargo/registry/src/github.com-1ecc6299db9ec823/arrow2_convert-0.3.2/src/serialize.rs:427:12
    

    This crashes in debug, and compiling in --release mode silently leads to a corrupted file that pyarrow chokes on when using pyarrow.compute.struct_field:

    pyarrow.lib.ArrowIndexError: Index -1 out of bounds
    
    opened by douglas-raillard-arm 5
  • derive(ArrowField) breaks in macros

    derive(ArrowField) breaks in macros

    I'm trying to either:

    • apply #[derive(ArrowField)] to a struct defined in a macro_rules
    • apply my macro_rules on a struct using #[derive(ArrowField)]

    The struct def is parsed using this little book of rust macros: https://veykril.github.io/tlborm/decl-macros/building-blocks/parsing.html#struct

    use arrow2_convert::ArrowField;
    
    macro_rules! mymacro {
        (
            $( #[$meta:meta] )*
            //  ^~~~attributes~~~~^
                $vis:vis struct $name:ident {
                    $(
                        $( #[$field_meta:meta] )*
                        //          ^~~~field attributes~~~!^
                            $field_vis:vis $field_name:ident : $field_ty:ty
                        //          ^~~~~~~~~~~~~~~~~a single field~~~~~~~~~~~~~~~^
                    ),*
                        $(,)? }
        ) => {
            $( #[$meta] )*
                $vis struct $name {
                    $(
                        $( #[$field_meta] )*
                            $field_vis $field_name : $field_ty
                    ),*
                }
        }
    }
    
    mymacro! {
        #[derive(ArrowField)]
        struct Foo2 {
            myfield: u8,
        }
    }
    
    

    Sadly this fails with:

    error: proc-macro derive panicked
      --> src/main.rs:97:14
       |
    97 |     #[derive(ArrowField)]
       |              ^^^^^^^^^^
       |
       = help: message: Only types are supported atm
    
    error: could not compile `dataframe` due to previous error
    

    The only macro that seemed to work is this one:

    macro_rules! mymacro2 {
        ($($tts:tt)*) => { $($tts)* }
    }
    

    Other derive macros such as JsonSchema don't seem to have this issue: https://docs.rs/schemars/latest/schemars/

    EDIT: This also fails, but differently:

    mymacro! {
        struct Foo2 {
            myfield: u8,
        }
    }
    
    #[derive(ArrowField)]
    struct Foo3(Foo2);
    

    fails with:

    error: proc-macro derive panicked
       --> src/main.rs:114:10
        |
    114 | #[derive(ArrowField)]
        |          ^^^^^^^^^^
        |
        = help: message: called `Option::unwrap()` on a `None` value
    
    opened by douglas-raillard-arm 3
  • enable customizing list inner child element name?

    enable customizing list inner child element name?

    When Spark outputs a parquet file, I believe it always uses the inner list item name of element as opposed to item:

    message spark_schema {
      ....
      OPTIONAL group mylistcolumn (LIST) {
        REPEATED group list {
          OPTIONAL BYTE_ARRAY element (UTF8);
        }
      }
      ...
    }
    

    It appears this crate (or one of its dependencies, perhaps arrow2 itself?), is always assuming that the inner field name of a list is item rather than element.

    Expected: Struct([Field { name: "mylistcolumn", data_type: List(Field { name: "item", data_type: Int32, is_nullable: false, metadata: {} }), is_nullable: false, metadata: {} }])

    Actual: Struct([Field { name: "mylistcolumn", data_type: List(Field { name: "element", data_type: Int32, is_nullable: false, metadata: {} }), is_nullable: false, metadata: {} }])

    I'm guessing this is because of this line of code?

    https://github.com/DataEngineeringLabs/arrow2-convert/blob/7d9e13254a74b853019ad2e731814bdb16284932/arrow2_convert/src/field.rs#L214

    1. If this is controlled by arrow2-convert, can we perhaps customize this via an annotation on the struct member?
    2. Should the default by re-evaluated if parquet-mr / Spark uses element?

    P.S. Likely not related, but I ran into a very similar error in this other crate as well: https://github.com/timvw/qv/issues/31

    opened by AlJohri 3
  • improve

    improve "Data type mismatch" error message

    Currently the error message doesn't give enough information to debug what's going on.

    https://github.com/DataEngineeringLabs/arrow2-convert/blob/6c37e294ee7bab90ea94637e6bda79440a527f3b/arrow2_convert/src/deserialize.rs#L305-L307

    Something like this helped me debug an issue I was running into:

    Err(arrow2::error::Error::InvalidArgumentError(format!(
        "Data type mismatch. Expected: {:?} | Found: {:?}",
        &<ArrowType as ArrowField>::data_type(),
        arr.data_type()
    )))
    

    It produced an error message that looks like:

    thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: InvalidArgumentError("Data type mismatch. Expected: Struct([...redacted...]) | Found: Struct(...redacted...)")', examples/read_parquet_specific_columns.rs:95:79
    

    Could potentially make it even better by doing a diff.

    opened by AlJohri 2
  • is it possible to run `try_into_collection` on a `Chunk` instead of an `Array?`

    is it possible to run `try_into_collection` on a `Chunk` instead of an `Array?`

    Starting with the parquet_read_parallel example from arrow2, I am trying to deserialize a Chunk into a Vec of structs.

    Using the deserialize_parallel function as defined in the above example, the following code currently works for me:

    
    pub struct Document {
        content: String,
    }
    
    ...
    let chunk = deserialize_parallel(&mut columns)?;
    let array = StructArray::new(
        DataType::Struct(fields.clone()),
        chunk.arrays().to_vec(),
        None,
    );
    let documents: Vec<Document> = array.to_boxed().try_into_collection().unwrap();
    

    Questions:

    1. With the currently exposed APIs in arrow2 and arrow2-convert, is there a better way to convert the Chunk into a Struct? I think the extra conversion from Chunk to StructArray with the to_boxed at the end is perhaps not the most efficient.
    2. Would it be possible to expose TryIntoCollection::try_into_collection directly on the Chunk as well?
    opened by AlJohri 1
Releases(v0.3.2)
  • v0.3.2(Sep 29, 2022)

    What's Changed

    • Upgrade to arrow2 v0.14 @ncpenke https://github.com/DataEngineeringLabs/arrow2-convert/pull/66

    Full Changelog: https://github.com/DataEngineeringLabs/arrow2-convert/compare/v0.3.0...v0.3.1

    Source code(tar.gz)
    Source code(zip)
  • v0.3.0(Aug 25, 2022)

    Thank you @nielsmeima, and @teymour-aldridge for your contributions to this release!

    Features and Enhancements

    • Add support for converting to Chunk @ncpenke https://github.com/DataEngineeringLabs/arrow2-convert/pull/44
    • Add support for i128 @ncpenke https://github.com/DataEngineeringLabs/arrow2-convert/pull/48
    • Add support for enums @ncpenke https://github.com/DataEngineeringLabs/arrow2-convert/pull/37
    • Flatten chunks @nielsmeima https://github.com/DataEngineeringLabs/arrow2-convert/pull/56
    • Serialize escaped Rust identifiers unescaped. @teymour-aldridge https://github.com/DataEngineeringLabs/arrow2-convert/pull/59
    • Update arrow2 version. @teymour-aldridge https://github.com/DataEngineeringLabs/arrow2-convert/pull/61

    Full Changelog: https://github.com/DataEngineeringLabs/arrow2-convert/compare/v0.2.0...v0.3.0

    Source code(tar.gz)
    Source code(zip)
  • v0.2.0(Jun 13, 2022)

  • v0.1.0(Mar 3, 2022)

Owner
Jorge Leitao
Open source contributor; PMC member of Apache Arrow
Jorge Leitao
transmute-free Rust library to work with the Arrow format

Arrow2: Transmute-free Arrow This repository contains a Rust library to work with the Arrow format. It is a re-write of the official Arrow crate using

Jorge Leitao 708 Dec 30, 2022
bustd is a lightweight process killer daemon for out-of-memory scenarios for Linux!

bustd: Available memory or bust! bustd is a lightweight process killer daemon for out-of-memory scenarios for Linux! Features Small memory usage! bust

Pop!_OS 8 Oct 6, 2022
A rust library that makes reading and writing memory of the Dolphin emulator easier.

dolphin-memory-rs A crate for reading from and writing to the emulated memory of Dolphin in rust. A lot of internals here are directly based on aldela

Madison Barry 4 Jul 19, 2022
Convert an MCU register description from the EDC format to the SVD format

edc2svd Convert an MCU register description from the EDC format to the SVD format EDC files are used to describe the special function registers of PIC

Stephan 4 Oct 9, 2021
Derive macro for encoding/decoding instructions and operands as bytecode

bytecoding Derive macro for encoding and decoding instructions and operands as bytecode. Documentation License Licensed under either of Apache License

Niklas Sauter 15 Mar 20, 2022
Rust derive macros for automating the boring stuff.

derived: Macros for automating the boring stuff The derived crate provides macros that can simplify all the boring stuff, like writing constructors fo

Sayan 7 Oct 4, 2021
Yet another geter/setter derive macro.

Gusket Gusket is a getter/setter derive macro. Comparison with getset: gusket only exposes one derive macro. No need to derive(Getters, MutGetters, Se

Jonathan Chan Kwan Yin 2 Apr 12, 2022
Derive macro implementing 'From' for structs

derive-from-ext A derive macro that auto implements 'std::convert::From' for structs. The default behaviour is to create an instance of the structure

Andrew Lowndes 4 Sep 18, 2022
Apache Arrow in WebAssembly

WASM Arrow This package compiles the Rust library of Apache Arrow to WebAssembly. This might be a viable alternative to the pure JavaScript library. R

Dominik Moritz 61 Jan 1, 2023
Check Have I Been Pwned and see if it's time for you to change passwords.

checkpwn Check Have I Been Pwned and see if it's time for you to change passwords. Getting started Install: cargo install checkpwn Update: cargo inst

Johannes 93 Dec 13, 2022
This crate allows you to safely initialize Dynamically Sized Types (DST) using only safe Rust.

This crate allows you to safely initialize Dynamically Sized Types (DST) using only safe Rust.

Christofer Nolander 11 Dec 22, 2022
Rust Stream::buffer_unordered where each future can have a different weight.

buffer-unordered-weighted buffer_unordered_weighted is a variant of buffer_unordered, where each future can be assigned a different weight. This crate

null 15 Dec 28, 2022
IDX is a Rust crate for working with RuneScape .idx-format caches.

This image proudly made in GIMP License Licensed under GNU GPL, Version 3.0, (LICENSE-GPL3 or https://choosealicense.com/licenses/gpl-3.0/) Contributi

Ceikry 5 May 27, 2022
A proc-macro to get Vec from struct and vise versa

byteme A proc-macro to convert a struct into Vec and back by implemeting From trait on the struct. The conversion is Big Endian by default. We have ma

Breu Inc. 11 Nov 4, 2022
Generate Rust register maps (`struct`s) from SVD files

svd2rust Generate Rust register maps (structs) from SVD files This project is developed and maintained by the Tools team. Documentation API Minimum Su

Rust Embedded 518 Dec 30, 2022
Utilities to gather data out of roms. Written in Rust. It (should) support all types.

snesutilities Utilities to gather data out of roms. Written in Rust. It (should) support all types. How Have a look at main.rs: use snesutilities::Sne

Layle | Luca 5 Oct 12, 2022
A Rust utility library, making easier by taking the hassle out of working. :octocat:

reddish A Rust utility library, making easier by taking the hassle out of working. Usage Add this to your Cargo.toml: [dependencies] reddish = "0.2.0"

Rogério Araújo 12 Jan 21, 2023
Spot coupling by finding out which files are always in the same commit

git moves-together This tells you when files in the repository frequently move together. This lets you identify where the coupling is in the system. C

Billie Thompson 14 Oct 31, 2022
UnTeX is both a library and an executable that allows you to manipulate and understand TeX files.

UnTeX UnTeX is both a library and an executable that allows you to manipulate and understand TeX files. Usage Executable If you wish to use the execut

Jérome Eertmans 1 Apr 5, 2022