Official Rust implementation of Apache Arrow

Overview

Native Rust implementation of Apache Arrow

Coverage Status

Welcome to the implementation of Arrow, the popular in-memory columnar format, in Rust.

This part of the Arrow project is divided in 4 main components:

Crate Description Documentation
Arrow Core functionality (memory layout, arrays, low level computations) (README)
Parquet Parquet support (README)
Arrow-flight Arrow data between processes (README)
DataFusion In-memory query engine with SQL support (README)
Ballista Distributed query execution (README)

Independently, they support a vast array of functionality for in-memory computations.

Together, they allow users to write an SQL query or a DataFrame (using the datafusion crate), run it against a parquet file (using the parquet crate), evaluate it in-memory using Arrow's columnar format (using the arrow crate), and send to another process (using the arrow-flight crate).

Generally speaking, the arrow crate offers functionality to develop code that uses Arrow arrays, and datafusion offers most operations typically found in SQL, with the notable exceptions of:

  • join
  • window functions

There are too many features to enumerate here, but some notable mentions:

  • Arrow implements all formats in the specification except certain dictionaries
  • Arrow supports SIMD operations to some of its vertical operations
  • DataFusion supports async execution
  • DataFusion supports user-defined functions, aggregates, and whole execution nodes

You can find more details about each crate in their respective READMEs.

Arrow Rust Community

We use the official ASF Slack for informal discussions and coordination. This is a great place to meet other contributors and get guidance on where to contribute. Join us in the arrow-rust channel.

We use ASF JIRA as the system of record for new features and bug fixes and this plays a critical role in the release process.

For design discussions we generally collaborate on Google documents and file a JIRA linking to the document.

There is also a bi-weekly Rust-specific sync call for the Arrow Rust community. This is hosted on Google Meet at https://meet.google.com/ctp-yujs-aee on alternate Wednesday's at 09:00 US/Pacific, 12:00 US/Eastern. During US daylight savings time this corresponds to 16:00 UTC and at other times this is 17:00 UTC.

Developer's guide to Arrow Rust

How to compile

This is a standard cargo project with workspaces. To build it, you need to have rust and cargo:

cd /rust && cargo build

You can also use rust's official docker image:

docker run --rm -v $(pwd)/rust:/rust -it rust /bin/bash -c "cd /rust && cargo build"

The command above assumes that are in the root directory of the project, not in the same directory as this README.md.

You can also compile specific workspaces:

cd /rust/arrow && cargo build

Git Submodules

Before running tests and examples, it is necessary to set up the local development environment.

The tests rely on test data that is contained in git submodules.

To pull down this data run the following:

git submodule update --init

This populates data in two git submodules:

By default, cargo test will look for these directories at their standard location. The following environment variables can be used to override the location:

# Optionaly specify a different location for test data
export PARQUET_TEST_DATA=$(cd ../parquet-testing/data; pwd)
export ARROW_TEST_DATA=$(cd ../testing/data; pwd)

From here on, this is a pure Rust project and cargo can be used to run tests, benchmarks, docs and examples as usual.

Running the tests

Run tests using the Rust standard cargo test command:

# run all tests.
cargo test


# run only tests for the arrow crate
cargo test -p arrow

Code Formatting

Our CI uses rustfmt to check code formatting. Before submitting a PR be sure to run the following and check for lint issues:

cargo +stable fmt --all -- --check

Clippy Lints

We recommend using clippy for checking lints during development. While we do not yet enforce clippy checks, we recommend not introducing new clippy errors or warnings.

Run the following to check for clippy lints.

cargo clippy

If you use Visual Studio Code with the rust-analyzer plugin, you can enable clippy to run each time you save a file. See https://users.rust-lang.org/t/how-to-use-clippy-in-vs-code-with-rust-analyzer/41881.

One of the concerns with clippy is that it often produces a lot of false positives, or that some recommendations may hurt readability. We do not have a policy of which lints are ignored, but if you disagree with a clippy lint, you may disable the lint and briefly justify it.

Search for allow(clippy:: in the codebase to identify lints that are ignored/allowed. We currently prefer ignoring lints on the lowest unit possible.

  • If you are introducing a line that returns a lint warning or error, you may disable the lint on that line.
  • If you have several lints on a function or module, you may disable the lint on the function or module.
  • If a lint is pervasive across multiple modules, you may disable it at the crate level.

Git Pre-Commit Hook

We can use git pre-commit hook to automate various kinds of git pre-commit checking/formatting.

Suppose you are in the root directory of the project.

First check if the file already exists:

ls -l .git/hooks/pre-commit

If the file already exists, to avoid mistakenly overriding, you MAY have to check the link source or file content. Else if not exist, let's safely soft link pre-commit.sh as file .git/hooks/pre-commit:

ln -s  ../../rust/pre-commit.sh .git/hooks/pre-commit

If sometimes you want to commit without checking, just run git commit with --no-verify:

git commit --no-verify -m "... commit message ..."
Comments
  • Define eq_dyn_scalar API

    Define eq_dyn_scalar API

    Which issue does this PR close?

    Working on this in relation to #984 and #1068 with the end goal being to finalize how we want eq_dyn_scalar to work.

    Rationale for this change

    What changes are included in this PR?

    Are there any user-facing changes?

    parquet arrow arrow-flight 
    opened by matthewmturner 26
  • Decimal Precision Validation

    Decimal Precision Validation

    Which part is this question about

    Generally the approach taken by this crate is that a given ArrayData and by extension Array only contains valid data. For example, a StringArray is valid UTF-8 with each index at a codepoint boundary, a dictionary array only has valid indexes, etc... This allows eliding bound checks on access within kernels.

    However, in order for this to be sound, it must be impossible to create invalid ArrayData using safe APIs. This means that safe APIs must either:

    • Generate valid data by construction - e.g. the builder APIs
    • Validate data - e.g. ArrayData::try_new

    For the examples above incorrect validation can very clearly lead to UB. The situation for decimal values is a bit more confused, in particular I'm not really clear on what the implications of a value that exceeds the precision actually are. However, some notes:

    • As far as I can tell we don't protect against overflow of normal integer types
    • We don't have any decimal arithmetic kernels (yet)
    • The decimal types are fixed bit width and so the precision isn't used to impact their representation

    Describe your question

    My question boils down to:

    • What is the purpose of the precision argument? Is it just for interoperability with other non-arrow representations?
    • Is there a requirement to saturate/error at the bounds of the precision, or can we simply overflow/saturate at the bounds of the underlying representation
    • Does validating the precision on ingest to ArrayData actually elide any validation when performing computation?

    The answers to this will dictate if we can just take a relaxed attitude to precision, and let users opt into validation if they care, and otherwise simply ignore it.

    I tried to understand what the C++ implementation is doing, but I honestly got lost. It almost looks like it is performing floating point operations and then rounding them back, which seems surprising...

    Additional context

    question 
    opened by tustvold 24
  • Add `async` into doc features

    Add `async` into doc features

    Signed-off-by: remzi [email protected]

    Which issue does this PR close?

    Closes #1307. Closes https://github.com/apache/arrow-rs/issues/1617

    Rationale for this change

    What changes are included in this PR?

    Add async to default enabled features, so that the link arrow::async_reader is active.

    Are there any user-facing changes?

    parquet 
    opened by HaoYang670 23
  • Replace azure sdk with custom implementation

    Replace azure sdk with custom implementation

    Which issue does this PR close?

    closes #2176

    Rationale for this change

    See https://github.com/apache/arrow-rs/issues/2176

    What changes are included in this PR?

    Replaces azure sdk with a custom implementation based on reqwest. So far this is a rough draft, and surely needs cleanup and some more work on that auth part. I tried to make the aws and azure implementations look as comparable as can be. ~~I also pulled in a new dependency on oauth2 crate. Will evaluate a bit more whzen cleaning up auth, but my feeling was that implementing the oauth flows manually could be another significant piece of work.~~

    Any feedback is highly welcome.

    cc @tustvold @alamb

    Are there any user-facing changes?

    Not that I'm aware of, but there is a possibility

    api-change object-store 
    opened by roeap 22
  • Speed up `Decimal256` validation based on bytes comparison and add benchmark test

    Speed up `Decimal256` validation based on bytes comparison and add benchmark test

    Which issue does this PR close?

    Closes https://github.com/apache/arrow-rs/issues/2320

    Rationale for this change

    What changes are included in this PR?

    Are there any user-facing changes?

    parquet arrow 
    opened by liukun4515 20
  • support compression for IPC

    support compression for IPC

    Which issue does this PR close?

    Closes #1709 Closes #70

    Rationale for this change

    What changes are included in this PR?

    Are there any user-facing changes?

    arrow 
    opened by liukun4515 20
  • Split up Arrow Crate

    Split up Arrow Crate

    TLDR rather than fighting entropy lets just brute-force compilation

    Is your feature request related to a problem or challenge? Please describe what you are trying to do.

    The arrow crate is getting rather large, and is starting to show up as a non-trivial bottleneck when compiling code, see #2170. There have been some efforts to reduce the amount of generated code, see #1858, but this is going to be a perpetual losing battle against new feature additions.

    I think there are a couple of problems currently:

    1. Limited build parallelism, especially if codegen-units is set low
    2. Upstream crates have to "depend" on functionality they don't need, e.g. parquet depending on compute kernels
    3. Minor changes force large amounts of recompilation, with incremental compilation only helping marginally
    4. Codegen is rarely linear in complexity, consequently larger codegen units take longer than the same amount of code in smaller units

    All these conspire to often result in an arrow shaped hole in compilation, where CPUs are left idle.

    Some numbers from my local machine

    • Release with default features: 232 seconds
    • Release with default features without comparison kernels: 150 seconds
    • Release with default features without compute kernels: 70 seconds
    • Release without default features without compute kernels: 60 seconds

    The vast majority of the time all bar a single core is idle.

    Describe the solution you'd like

    I would like to propose we split up the arrow crate, into a number of sub-crates that are then re-exported by the top-level arrow crate. Users can then choose to depend on the batteries included arrow crate, or more granular crates.

    Initially I would propose the following split:

    • arrow-csv: CSV reader support
    • arrow-ipc: IPC support
    • arrow-json: JSON support (related to #2300)
    • arrow-compute: contents of compute module
    • arrow-test: arrow test_utils (not published)
    • arrow-core: everything else

    There is definitely scope for splitting up the crates further after this, in particular the comparison kernels might be a good candidate to live on their own, but I think lets start small and go from there. I suspect there is a fair amount of disentangling that will be necessary to achieve this.

    Describe alternatives you've considered

    Feature flags are another way this can be handled, however, they have a couple of limitations:

    • It is impractical to test the full combinatorial explosion of combinations, which allows for bugs to sneak through
    • They are unified for a target which limits build parallelism, just because say DataFusion depends on arrow with CSV support, shouldn't force the parquet crate to wait for this to compile before it can start compiling
    • Poor UX:
      • Discoverability is limited, it can be hard to determine what features gate what functionality
      • Hard to determine if the feature flag set is minimal, no equivalent of cargo-udeps
      • It can be a non-trivial detective exercise to determine why a given feature is being enabled
      • Necessitate counter-intuitive hacks to play nicely in multi-crate workspaces - see workspace hack

    Additional context

    @Jimexist recently drove an initiative to do something similar to DataFusion which has worked very well - https://github.com/apache/arrow-datafusion/issues/1750

    FYI @alamb @jhorstmann @nevi-me

    enhancement 
    opened by tustvold 19
  • FFI: ArrowArray::try_from_raw shouldn't clone

    FFI: ArrowArray::try_from_raw shouldn't clone

    Guys, not sure if my understanding is right, but I think this commit will break the design and create memory leak.

    If we clone the FFI struct, then it means we need to free the pointer by ourself, but if we free FFI_ArrowArray, then the data in this Array will also be free? Which means we can't free the pointer(until the data are used and ready to free, but in reality we can't hold this useless pointer in a big project for such a long time), which create memory leak.

    As to the question @viirya raised in #1333 , when manage memory, the one who allocate it should free it, which means in our case, we need to alloc the struct in rust and pass the pointer to java and then also free the memory in rust.

    • You can check my code in here: https://github.com/wangfenjin/duckdb-rs/blob/5083d39a4147f8017613304ae5f217a88ac42c2e/src/raw_statement.rs#L58
    • When I try to upgrade to version 10, memory leak detected and there is no easy way to fix it.

    I suggest we revert this commit. cc @alamb @sunchao

    Originally posted by @wangfenjin in https://github.com/apache/arrow-rs/issues/1334#issuecomment-1064828113

    arrow bug 
    opened by wangfenjin 19
  • Add FFI for Arrow C Stream Interface

    Add FFI for Arrow C Stream Interface

    Which issue does this PR close?

    Closes #1348.

    Rationale for this change

    What changes are included in this PR?

    Are there any user-facing changes?

    arrow 
    opened by viirya 19
  • Change Field::metadata to HashMap

    Change Field::metadata to HashMap

    Is your feature request related to a problem or challenge? Please describe what you are trying to do.

    Currently Schema::metadata is HashMap<String, String>, whereas Field::metadata is Option<BTreeMap<String, String>>. This is not only inconsistent, but it is unclear why there is an additional Option

    Describe the solution you'd like

    I would like to change Field::metadata to a HashMap for consistency with Schema

    Describe alternatives you've considered

    Additional context

    parquet arrow enhancement 
    opened by tustvold 18
  • `SchemaResult` in IPC deviates from other implementations

    `SchemaResult` in IPC deviates from other implementations

    Describe the bug

    The SchemaResult produced by SchemaAsIpc can be converted to a Schema by the Rust implementation of Apache Arrow Flight but not other implementations of Apache Arrow Flight (tested the Go, Java, and Python implementations).

    To Reproduce

    For the Rust server, implement FlightService.get_schema() as:

    async fn get_schema(
        &self,
        _request: Request<FlightDescriptor>,
    ) -> Result<Response<SchemaResult>, Status> {
        let tid = Field::new("tid", DataType::Int32, false);
        let timestamp = Field::new("timestamp", DataType::Timestamp(TimeUnit::Millisecond, None), false);
        let value = Field::new("value", DataType::Float32, false);
        let schema = Schema::new(vec![tid, timestamp, value]);
    
        let options = IpcWriteOptions::default();
        let schema_as_ipc = SchemaAsIpc::new(&schema, &options);
        let schema_result: SchemaResult = schema_as_ipc.into();
    
        Ok(Response::new(schema_result))
    }
    

    Attempt to retrieve and print the Schema using the following Rust code:

    use arrow::ipc::convert;
    use arrow_flight::flight_service_client::FlightServiceClient;
    use arrow_flight::FlightDescriptor;
    use tokio::runtime::Runtime;
    use tonic::Request;
    
    fn main() {
        let tokio = Runtime::new().unwrap();
    
        tokio.block_on(async {
            let mut flight_service_client = FlightServiceClient::connect("grpc://127.0.0.1:9999").await.unwrap();
            let flight_descriptor = FlightDescriptor::new_path(vec!["".to_owned()]);
            let request = Request::new(flight_descriptor);
            let schema_result = flight_service_client.get_schema(request).await.unwrap().into_inner();
            let schema = convert::schema_from_bytes(&schema_result.schema).unwrap();
            dbg!(schema);
        });
    }
    

    Attempt to retrieve and print the Schema using the following Python code:

    from pyarrow import flight
    client = flight.FlightClient('grpc://127.0.0.1:9999')
    descriptor = flight.FlightDescriptor.for_path("")
    schema_result = client.get_schema(descriptor)
    print(schema_result.schema)
    

    Expected behavior

    The Rust code should successfully retrieve and print the Schema while the Python code should fail due to the following OSError being raised:

    Traceback (most recent call last):
      File "get_schema.py", line 5, in <module>
        print(schema_result.schema)
      File "pyarrow/_flight.pyx", line 720, in pyarrow._flight.SchemaResult.schema.__get__
      File "pyarrow/_flight.pyx", line 80, in pyarrow._flight.check_flight_status
      File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status
    OSError: Invalid flatbuffers message.
    

    Additional context

    As the issue was first discovered using a client written in Go, we first raised apache/arrow#13853 as we believed the problem was in the Go implementation of Apache Arrow Flight. But from the comments provided by @zeroshade on that issue, it seems that the Rust implementation of Apache Arrow Flight deviates from the other implementations in how a Schema is serialized. For example, both the C++ and Go implementations include the continuation indicator (0xFFFFFFFF) followed by the message length as a 32-bit integer before the Schema:

    use arrow_flight::{SchemaAsIpc, SchemaResult};
    use arrow::datatypes::{Schema, TimeUnit, Field, DataType};
    use arrow::ipc::writer::IpcWriteOptions;
    
    fn main() {
        let tid = Field::new("tid", DataType::Int32, false);
        let timestamp = Field::new("timestamp", DataType::Timestamp(TimeUnit::Millisecond, None), false);
        let value = Field::new("value", DataType::Float32, false);
        let schema = Schema::new(vec![tid, timestamp, value]);
    
        let options = IpcWriteOptions::default();
        let schema_as_ipc = SchemaAsIpc::new(&schema, &options);
        let schema_result: SchemaResult = schema_as_ipc.into();
        dbg!(schema_result);
    }
    

    16 0 0 0 0 0 10 0 14 0 12 0 11 0 4 0 10 0 0 0 20 0 0 0 0 0 0 1 4 0 10 0 12 0 0 0 8 0 4 0 10 0 0 0 8 0 0 0 8 0 0 0 0 0 0 0 3 0 0 0 136 0 0 0 52 0 0 0 4 0 0 0 148 255 255 255 16 0 0 0 20 0 0 0 0 0 0 3 16 0 0 0 206 255 255 255 0 0 1 0 0 0 0 0 5 0 0 0 118 97 108 117 101 0 0 0 192 255 255 255 28 0 0 0 12 0 0 0 0 0 0 10 32 0 0 0 0 0 0 0 0 0 6 0 8 0 6 0 6 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 9 0 0 0 116 105 109 101 115 116 97 109 112 0 0 0 16 0 20 0 16 0 0 0 15 0 4 0 0 0 8 0 16 0 0 0 24 0 0 0 32 0 0 0 0 0 0 2 28 0 0 0 8 0 12 0 4 0 11 0 8 0 0 0 32 0 0 0 0 0 0 1 0 0 0 0 3 0 0 0 116 105 100 0

    #include <iostream>
    
    #include "arrow/type.h"
    #include "arrow/buffer.h"
    #include "arrow/ipc/writer.h"
    
    int main() {
      std::shared_ptr<arrow::Field> tid = arrow::field("tid", arrow::int32());
      std::shared_ptr<arrow::Field> timestamp = arrow::field("timestamp", arrow::timestamp(arrow::TimeUnit::MILLI));
      std::shared_ptr<arrow::Field> value = arrow::field("value", arrow::float32());
    
      std::shared_ptr<arrow::Schema> schema_ptr = arrow::schema({tid, timestamp, value});
      arrow::Schema schema = *schema_ptr.get();
      std::shared_ptr<arrow::Buffer> serialized_schema = arrow::ipc::SerializeSchema(schema).ValueOrDie();
    
      size_t serialized_schema_size = serialized_schema->size();
      for (int index = 0; index < serialized_schema_size; index++) {
        std::cout << unsigned((*serialized_schema)[index]) << ' ';
      }
      std::cout << std::endl;
    }
    

    255 255 255 255 224 0 0 0 16 0 0 0 0 0 10 0 12 0 6 0 5 0 8 0 10 0 0 0 0 1 4 0 12 0 0 0 8 0 8 0 0 0 4 0 8 0 0 0 4 0 0 0 3 0 0 0 124 0 0 0 52 0 0 0 4 0 0 0 160 255 255 255 0 0 1 3 16 0 0 0 24 0 0 0 4 0 0 0 0 0 0 0 5 0 0 0 118 97 108 117 101 0 0 0 210 255 255 255 0 0 1 0 204 255 255 255 0 0 1 10 16 0 0 0 32 0 0 0 4 0 0 0 0 0 0 0 9 0 0 0 116 105 109 101 115 116 97 109 112 0 6 0 8 0 6 0 6 0 0 0 0 0 1 0 16 0 20 0 8 0 6 0 7 0 12 0 0 0 16 0 16 0 0 0 0 0 1 2 16 0 0 0 28 0 0 0 4 0 0 0 0 0 0 0 3 0 0 0 116 105 100 0 8 0 12 0 8 0 7 0 8 0 0 0 0 0 0 1 32 0 0 0

    package main
    
    import (
        "fmt"
        "github.com/apache/arrow/go/arrow"
        "github.com/apache/arrow/go/arrow/flight"
        "github.com/apache/arrow/go/arrow/memory"
    )
    
    func main() {
         schema := arrow.NewSchema(
    		[]arrow.Field{
    			{Name: "tid", Type: arrow.PrimitiveTypes.Int32},
    			{Name: "timestamp", Type: arrow.FixedWidthTypes.Timestamp_ms},
    			{Name: "value", Type: arrow.PrimitiveTypes.Float32},
    		},
    		nil,
    	)
        serialized_schema := flight.SerializeSchema(schema,memory.DefaultAllocator)
        fmt.Println(serialized_schema)
    }
    

    255 255 255 255 248 0 0 0 16 0 0 0 0 0 10 0 12 0 10 0 9 0 4 0 10 0 0 0 16 0 0 0 0 1 4 0 8 0 8 0 0 0 4 0 8 0 0 0 4 0 0 0 3 0 0 0 148 0 0 0 60 0 0 0 4 0 0 0 136 255 255 255 16 0 0 0 24 0 0 0 0 0 0 3 24 0 0 0 0 0 0 0 0 0 6 0 8 0 6 0 6 0 0 0 0 0 1 0 5 0 0 0 118 97 108 117 101 0 0 0 188 255 255 255 16 0 0 0 24 0 0 0 0 0 0 10 36 0 0 0 0 0 0 0 8 0 12 0 10 0 4 0 8 0 0 0 8 0 0 0 0 0 1 0 3 0 0 0 85 84 67 0 9 0 0 0 116 105 109 101 115 116 97 109 112 0 0 0 16 0 20 0 16 0 0 0 15 0 8 0 0 0 4 0 16 0 0 0 16 0 0 0 24 0 0 0 0 0 0 2 28 0 0 0 0 0 0 0 8 0 12 0 8 0 7 0 8 0 0 0 0 0 0 1 32 0 0 0 3 0 0 0 116 105 100 0 255 255 255 255 0 0 0 0

    arrow arrow-flight bug help wanted 
    opened by skejserjensen 18
  • Use concurrency groups instead if the cancle workflow

    Use concurrency groups instead if the cancle workflow

    Is your feature request related to a problem or challenge? Please describe what you are trying to do. The cancel.yml workflow is no longer necessary as GitHub has integrated this feature into gha: https://docs.github.com/en/actions/using-jobs/using-concurrency#example-only-cancel-in-progress-jobs-or-runs-for-the-current-workflow

    Describe the solution you'd like Add concurrency groups to the separate workflows, that way the cancel action used in cancel.yml as an external dependency can be removed. Example: https://github.com/apache/arrow/blob/master/.github/workflows/cpp.yml#L44-L46

    enhancement 
    opened by assignUser 0
  • Fix: Added support to cast string without time

    Fix: Added support to cast string without time

    Which issue does this PR close?

    Closes #3492

    Rationale for this change

    Support cast string like 2022-01-08

    What changes are included in this PR?

    arrow-rs/arrow-cast/src/parse.rs

    Are there any user-facing changes?

    No

    arrow 
    opened by csphile 0
  • Support casting strings like `'2001-01-01'` to timestamp

    Support casting strings like `'2001-01-01'` to timestamp

    Is your feature request related to a problem or challenge? Please describe what you are trying to do.

    We are trying to use '2001-01-01' as a timestamp (as an argument to the date_bin function in DataFusion).

    However, we get the following error

    Arrow error: Cast error: Error parsing '2001-01-01' as timestamp
    

    As a workaround we can add 00:00:00 to the end and that works:

    '2001-01-01 00:00:00'

    Describe the solution you'd like I would like '2001-01-01' to be parsed the same as '2001-01-01 00:00:00'

    Describe alternatives you've considered I can special case this downstream in datafusion

    Additional context I believe this can be achieved by adding the appropriate support to string_to_timestamp_nanos and adding a few tests

    https://github.com/apache/arrow-rs/blob/c74665808439cb7020fb1cfb74b376a136c73259/arrow-cast/src/parse.rs#L23-L71

    good first issue enhancement 
    opened by alamb 4
  • Fix negative interval prettyprint

    Fix negative interval prettyprint

    Which issue does this PR close?

    Related to https://github.com/apache/arrow-datafusion/issues/4220

    Rationale for this change

    Current issue where nanoseconds/milliseconds part can print the minus sign if negative, which can double up with the seconds, not to mention is incorrectly placed (after the decimal). This fixes it to only have a single sign before the seconds, if negative

    What changes are included in this PR?

    Are there any user-facing changes?

    arrow 
    opened by Jefffrey 0
  • Support GCP Workload Identity

    Support GCP Workload Identity

    Is your feature request related to a problem or challenge? Please describe what you are trying to do.

    Currently the object_store crate only supports obtaining credentials using a provided service account, it would be beneficial if it could also optionally obtain credentials from its environment. This would be consistent with the behaviour of the aws and azure implementations, and avoids requiring users to handle sensitive long-term service account credentials.

    Describe the solution you'd like

    If no service account is specified, it should fallback to trying to get credentials from a metadata endpoint.

    This is documented here

    Describe alternatives you've considered

    Additional context

    good first issue enhancement help wanted object-store 
    opened by tustvold 2
  • feat: Allow providing a service account key directly for GCS

    feat: Allow providing a service account key directly for GCS

    Which issue does this PR close?

    Closes https://github.com/apache/arrow-rs/issues/3488

    Rationale for this change

    Use case:

    We're storing service accounts keys external to where the object store client is being created. We do not want to have to write the key to a file before creating the object store client. This change allows for providing the key directly.

    What changes are included in this PR?

    Adds an appropriate method to the GCS object store builder for supplying the service account key directly. Only one of service account path or service account key may be provided, otherwise build will return an appropriate error.

    Are there any user-facing changes?

    An additional method on GCS object store builder.

    There are currently no breaking changes, however I believe the ServiceAccount variant for the GoogleConfigKey should be renamed to ServiceAccountPath to better represent what that option is for. I held off on making that change because I saw that the changelog was already generated for 0.5.3 which includes the new GoogleConfigKey stuff, making that a breaking change. If that's an acceptable breaking change, I'm down to go ahead and do that in this PR as well.

    object-store 
    opened by scsmithr 2
Owner
The Apache Software Foundation
The Apache Software Foundation
Fill Apache Arrow record batches from an ODBC data source in Rust.

arrow-odbc Fill Apache Arrow arrays from ODBC data sources. This crate is build on top of the arrow and odbc-api crate and enables you to read the dat

Markus Klein 21 Dec 27, 2022
Polars is a blazingly fast DataFrames library implemented in Rust using Apache Arrow Columnar Format as memory model.

Polars Python Documentation | Rust Documentation | User Guide | Discord | StackOverflow Blazingly fast DataFrames in Rust, Python & Node.js Polars is

null 11.8k Jan 8, 2023
Apache Arrow DataFusion and Ballista query engines

DataFusion is an extensible query execution framework, written in Rust, that uses Apache Arrow as its in-memory format.

The Apache Software Foundation 2.9k Jan 2, 2023
Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing

Apache Arrow Powering In-Memory Analytics Apache Arrow is a development platform for in-memory analytics. It contains a set of technologies that enabl

The Apache Software Foundation 10.9k Jan 6, 2023
🦖 Evolve your fixed length data files into Apache Arrow tables, fully parallelized!

?? Evolve your fixed length data files into Apache Arrow tables, fully parallelized! ?? Overview ... ?? Installation The easiest way to install evolut

Firelink Data 3 Dec 22, 2023
A new arguably faster implementation of Apache Spark from scratch in Rust

vega Previously known as native_spark. Documentation A new, arguably faster, implementation of Apache Spark from scratch in Rust. WIP Framework tested

raja sekar 2.1k Jan 5, 2023
Integration between arrow-rs and extendr

arrow_extendr arrow-extendr is a crate that facilitates the transfer of Apache Arrow memory between R and Rust. It utilizes extendr, the {nanoarrow} R

Josiah Parry 8 Nov 24, 2023
Arrow User-Defined Functions Framework on WebAssembly.

Arrow User-Defined Functions Framework on WebAssembly Example Build the WebAssembly module: cargo build --release -p arrow-udf-wasm-example --target w

RisingWave Labs 3 Dec 14, 2023
Apache TinkerPop from Rust via Rucaja (JNI)

Apache TinkerPop from Rust An example showing how to call Apache TinkerPop from Rust via Rucaja (JNI). This repository contains two directories: java

null 8 Sep 27, 2022
Fastest and safest Rust implementation of parquet. `unsafe` free. Integration-tested against pyarrow

Parquet2 This is a re-write of the official parquet crate with performance, parallelism and safety in mind. The five main differentiators in compariso

Jorge Leitao 237 Jan 1, 2023
bspipe A Rust implementation of Bidirectional Secure Pipe

bspipe A Rust implementation of Bidirectional Secure Pipe

xufanglu 2 Nov 14, 2022
Yet Another Technical Analysis library [for Rust]

YATA Yet Another Technical Analysis library YaTa implements most common technical analysis methods and indicators. It also provides you an interface t

Dmitry 197 Dec 29, 2022
Dataframe structure and operations in Rust

Utah Utah is a Rust crate backed by ndarray for type-conscious, tabular data manipulation with an expressive, functional interface. Note: This crate w

Suchin 139 Sep 26, 2022
Rust DataFrame library

Polars Blazingly fast DataFrames in Rust & Python Polars is a blazingly fast DataFrames library implemented in Rust. Its memory model uses Apache Arro

Ritchie Vink 11.9k Jan 8, 2023
Rayon: A data parallelism library for Rust

Rayon Rayon is a data-parallelism library for Rust. It is extremely lightweight and makes it easy to convert a sequential computation into a parallel

null 7.8k Jan 8, 2023
A Rust crate that reads and writes tfrecord files

tfrecord-rust The crate provides the functionality to serialize and deserialize TFRecord data format from TensorFlow. Features Provide both high level

null 22 Nov 3, 2022
sparse linear algebra library for rust

sprs, sparse matrices for Rust sprs implements some sparse matrix data structures and linear algebra algorithms in pure Rust. The API is a work in pro

Vincent Barrielle 311 Dec 18, 2022
PyO3-based Rust binding of NumPy C-API

rust-numpy Rust bindings for the NumPy C-API API documentation Latest release (possibly broken) Current Master Requirements Rust >= 1.41.1 Basically,

PyO3 759 Jan 3, 2023
DataFrame / Series data processing in Rust

black-jack While PRs are welcome, the approach taken only allows for concrete types (String, f64, i64, ...) I'm not sure this is the way to go. I want

Miles Granger 30 Dec 10, 2022