Fastest and safest Rust implementation of parquet. `unsafe` free. Integration-tested against pyarrow

Jorge Leitao

Last update: Jan 1, 2023

Related tags

Overview

Parquet2

This is a re-write of the official parquet crate with performance, parallelism and safety in mind.

The five main differentiators in comparison with parquet are:

does not use unsafe
delegates parallelism downstream
decouples reading (IO intensive) from computing (CPU intensive)
deletages decompressing and decoding batches downstream
it is faster (10-20x when reading to arrow format)
Is integration-tested against pyarrow 3 and (py)spark 3

The overall idea is to offer the ability to read compressed parquet pages and a toolkit to decompress them to their favourite in-memory format.

This allows this crate's iterators to perform minimal CPU work, thereby maximizing throughput. It is up to the consumers to decide whether they want to take advantage of this through parallelism at the expense of memory usage (e.g. decompress and deserialize pages in threads) or not.

Functionality implemented

Read dictionary pages
Read and write V1 pages
Read and write V2 pages
Compression and de-compression (all)

Functionality not (yet) implemented

The parquet format has multiple encoding strategies for the different physical types. This crate currently reads from almost all of them, and supports encoding to a subset of them. They are:

Supported decoding

Delta-encodings are still experimental, as I have been unable to generate large pages encoded with them from spark, thereby hindering robust integration tests.

Encoding

Organization

read: read metadata and pages
write: write metadata and pages
metadata: parquet files metadata (e.g. FileMetaData)
schema: types metadata declaration (e.g. ConvertedType)
types: physical type declaration (i.e. how things are represented in memory). So far unused.
compression: compression (e.g. Gzip)
error: errors declaration
serialization: convert from bytes to rust native typed buffers (Vec<Option<T>>).

Note that serialization is not very robust. It serves as a playground to understand the specification and how to serialize and deserialize pages.

Run unit tests

There are tests being run against parquet files generated by pyarrow. To ignore these, use PARQUET2_IGNORE_PYARROW_TESTS= cargo test. To run then, you will need to run

python3 -m venv venv
venv/bin/pip install pip --upgrade
venv/bin/pip install pyarrow==3
venv/bin/python integration/write_pyarrow.py

before. This is only needed once (per change in the integration/write_pyarrow.py).

How to use

use std::fs::File;

use parquet2::read::{Page, read_metadata, get_page_iterator};

let mut file = File::open("testing/parquet-testing/data/alltypes_plain.parquet").unwrap();

/// here we read the metadata.
let metadata = read_metadata(&mut file)?;

/// Here we get an iterator of pages (each page has its own data)
/// This can be heavily parallelized; not even the same `file` is needed here...
/// feel free to wrap `metadata` under an `Arc`
let row_group = 0;
let column = 0;
let mut iter = get_page_iterator(&metadata, row_group, column, &mut file)?;

/// A page. It is just (compressed) bytes at this point.
let page = iter.next().unwrap().unwrap();
println!("{:#?}", page);

/// from here, we can do different things. One of them is to convert its buffers to native Rust.
/// This consumes the page.
use parquet2::serialization::native::page_to_array;
let array = page_to_array(page, &descriptor).unwrap();

How to implement page readers

In general, the in-memory format used to consume parquet pages strongly influences how the pages should be deserialized. As such, this crate does not commit to a particular in-memory format. Consumers are responsible for converting pages to their target in-memory format.

There is an implementation that uses the arrow format here.

Higher Parallelism

The function above creates an iterator over a row group, iter. In Arrow, this corresponds to a RecordBatch, divided in Parquet pages. Typically, converting a page into in-memory is expensive and thus consider how to distribute work across threads. E.g.

let handles = vec![];
for column in columns {
    let compressed_pages = get_page_iterator(&metadata, row_group, column, &mut file, file)?.collect()?;
    // each compressed_page has a buffer; cloning is expensive(!). We move it so that the memory
    // is released at the end of the processing.
    handles.push(thread::spawn move {
        page_iter_to_array(compressed_pages.into_iter())
    })
}
let columns_from_all_groups = handles.join_all();

this will read the file as quickly as possible in the main thread and send CPU-intensive work to other threads, thereby maximizing IO reads (at the cost of storing multiple compressed pages in memory; buffering is also an option here).

Decoding flow

Generally, a parquet file is read as follows:

Read metadata
Seek a row group and column
iterate over (compressed) pages within that (group, column)

This is IO-intensive, requires parsing thrift, and seeking within a file.

Once a compressed page is loaded into memory, it can be decompressed, decoded and deserialized into a specific in-memory format. All of these operations are CPU-intensive and are thus left to consumers to perform, as they may want to send this work to threads.

read -> compressed page -> decompressed page -> decoded bytes -> deserialized

License

Licensed under either of

Apache License, Version 2.0 (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
MIT license (LICENSE-MIT or http://opensource.org/licenses/MIT)

at your option.

Contribution

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.

Comments

use lz4_flex?

I like your no unsafe approach, the LZ4 implementation of lz4_flex also uses foorbid(unsafe_code) with default feature flags. If you switch to it, even more parts of code would be completely safe.

In the Readme I posted some benchmarks, the performance is a little bit slower than the unsafe version, but still decent.
feature no-changelog

opened by PSeitz 12
WIP: Use lz4_flex for `wasm32`

Disclaimer: I'm new to Rust but motivated to get LZ4 Parquet decompression working in wasm. 🙂

I'm working on wasm bindings to arrow2/parquet2. So far it appears that all other compressions other than LZ4 are working. I tried to switch to lz4_flex on this branch but I'm not sure I gated the wasm target correctly. Tests seem to pass on this branch but when I try to load a Parquet file with LZ4 encoding on the web I get an error Uncaught (in promise) External format error: underlying IO error: WrongMagicNumber. Maybe I'm using the wrong lz4_flex APIs.

Ref #85

opened by kylebarron 10
Simple parquet-tools bin
Parquet tool companion for the parquet2 crate. This bin creates an executable that can be used to extract information from a parquet file. It has the next subcommands:

rowcount: show row count from parquet file

meta: shows metadata from file

dump: sample data from columns
opened by elferherrera 10
Added `serde` support for `RowGroupMetaData`.

use thrift for types in parquet_format_safe.

https://github.com/jorgecarleitao/parquet2/issues/200

it turns out that there are as many as 16 types that need to impl De/Serialize, so it is very troublesome to work it around outside.
feature

opened by youngsofun 7

Reading large metadata-only `_metadata` file much slower than PyArrow

👋

I'm working with some large partitioned Parquet datasets that have a top-level _metadata file that contains the FileMetaData for every row group in every Parquet file in the directory. This _metadata file can have up to 30,000 row groups. In my experience, parsing these files with parquet2::read::read_metadata can be up to 70x slower than with pyarrow.parquet.read_metadata.

Python:

In [1]: import pyarrow.parquet as pq

In [2]: %timeit pq.read_metadata('_metadata')
20.1 ms ± 762 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Arrow2:

use std::{fs::File, time::Instant};

use parquet2::read::read_metadata;

fn main() {
    let mut file = File::open("_metadata").unwrap();

    let now = Instant::now();
    let meta = read_metadata(&mut file).unwrap();
    println!("Time to parse metadata: {}", now.elapsed().as_secs_f32());
}

> cargo run
Time to parse metadata: 1.465529

Anecdotally, for a _metadata file internally with 30,000 row groups, it was taking ~11s to parse in arrow2 and ~160ms to parse in pyarrow. (Though in the making of this repro example, I learned that pyarrow.parquet.write_metadata is O(n^2) 😬, so I didn't create a full 30,000 setup for this example.)

I haven't looked at the code for read_metadata yet; do you have any ideas where this might be slower than with pyarrow?

Repro:

from io import BytesIO

import pyarrow as pa
import pyarrow.parquet as pq


def create_example_file_meta_data():
    data = {
        "str": pa.array(["a", "b", "c", "d"], type=pa.string()),
        "uint8": pa.array([1, 2, 3, 4], type=pa.uint8()),
        "int32": pa.array([0, -2147483638, 2147483637, 1], type=pa.int32()),
        "bool": pa.array([True, True, False, False], type=pa.bool_()),
    }
    table = pa.table(data)
    metadata_collector = []
    pq.write_table(table, BytesIO(), metadata_collector=metadata_collector)
    return table.schema, metadata_collector[0]

def main():
    schema, meta = create_example_file_meta_data()
    print('created collector')
    metadata_collector = [meta] * 5_000
    print('writing meta')
    pq.write_metadata(schema, '_metadata', metadata_collector=metadata_collector)

if __name__ == '__main__':
    main()

question no-changelog

opened by kylebarron 7

Compression levels for Parquet Compression
Would there be a desire to allow compression levels for parquet compression?

For example, to enable Zstd compression, an option for the interface could be something like:

pub enum Compression { Uncompressed, Snappy, Gzip, Lzo, Brotli, Lz4, Zstd(ZstdLevel), } pub struct ZstdLevel(u8); // Maintains 'compatibility' impl Default for ZstdLevel { fn default() -> Self { Self(1) } } impl ZstdLevel { pub fn try_new(level: u8) -> Result<Self, ()>; }

However, that would be a breaking change. I'd be happy to set it up and open a PR if this seems worthwhile.
feature no-changelog
opened by TurnOfACard 7
Removed `AsyncSeek` requirement from page stream

When streaming pages over the network it is quite awkward to fulfill the AsyncSeek constraint.

This PR exposes methods that work nicely with usages that only provide AsyncRead (for example, a Rusoto S3 StreamingBody /w TryStreamExt::into_async_read).

cargo test passes, but I note there are no tests that exercise the page filtering. A naive attempt to add additional pages in test_column_async didn't work. If the lack of tests here are of concern then I need some guidance on where/how to add them.
enhancement

opened by medwards 6
Added optional support for LZ4 via LZ4-flex crate (thus enabling wasm)

This PR was done together with @kylebarron and adds an optional dependency lz4-flex by @PSeitz as a LZ4 compressor/decompressor.

This implementation a bit slower but uses no unsafe and is written in native Rust, therefore supporting being compiled to wasm.
feature

opened by jorgecarleitao 6

Issue decoding Parquet file with a column ending with a null page

When trying to decompress a numeric column in a parquet file that ends with a page of nulls, parquet2 panics. Here is a test case that reproduces the conditions of the failure.

https://github.com/jorgecarleitao/parquet2/commit/85a092f68326da1d555742f59e6aac2caf91b53e

The backtrace of the error is here:

---- stage::tests::retrieve_s3_parquet_test stdout ----
thread 'tokio-runtime-worker' panicked at 'called `Option::unwrap()` on a `None` value', /home/ubuntu/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet2-0.9.0/src/encoding/hybrid_rle/mod.rs:41:32
stack backtrace:
   0: rust_begin_unwind
             at /rustc/f1edd0429582dd29cccacaf50fd134b05593bd9c/library/std/src/panicking.rs:517:5
   1: core::panicking::panic_fmt
             at /rustc/f1edd0429582dd29cccacaf50fd134b05593bd9c/library/core/src/panicking.rs:100:14
   2: core::panicking::panic
             at /rustc/f1edd0429582dd29cccacaf50fd134b05593bd9c/library/core/src/panicking.rs:50:5
   3: core::option::Option<T>::unwrap
             at /rustc/f1edd0429582dd29cccacaf50fd134b05593bd9c/library/core/src/option.rs:746:21
   4: parquet2::encoding::hybrid_rle::read_next
             at /home/ubuntu/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet2-0.9.0/src/encoding/hybrid_rle/mod.rs:41:17
   5: parquet2::encoding::hybrid_rle::HybridRleDecoder::new
             at /home/ubuntu/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet2-0.9.0/src/encoding/hybrid_rle/mod.rs:63:21
   6: arrow2::io::parquet::read::primitive::basic::read_dict_buffer_optional
             at /home/ubuntu/.cargo/registry/src/github.com-1ecc6299db9ec823/arrow2-0.9.0/src/io/parquet/read/primitive/basic.rs:38:9
   7: arrow2::io::parquet::read::primitive::basic::extend_from_page
             at /home/ubuntu/.cargo/registry/src/github.com-1ecc6299db9ec823/arrow2-0.9.0/src/io/parquet/read/primitive/basic.rs:184:13
   8: arrow2::io::parquet::read::primitive::iter_to_array
             at /home/ubuntu/.cargo/registry/src/github.com-1ecc6299db9ec823/arrow2-0.9.0/src/io/parquet/read/primitive/mod.rs:86:13
   9: arrow2::io::parquet::read::page_iter_to_array
             at /home/ubuntu/.cargo/registry/src/github.com-1ecc6299db9ec823/arrow2-0.9.0/src/io/parquet/read/mod.rs:359:20
  10: arrow2::io::parquet::read::column_iter_to_array
             at /home/ubuntu/.cargo/registry/src/github.com-1ecc6299db9ec823/arrow2-0.9.0/src/io/parquet/read/mod.rs:438:25
  11: ingest::stage::load_parquet_row_group::{{closure}}::{{closure}}::{{closure}}
             at ./src/stage.rs:404:17
  12: tokio::runtime::enter::exit
             at /home/ubuntu/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.14.0/src/runtime/enter.rs:90:9
  13: tokio::runtime::thread_pool::worker::block_in_place
             at /home/ubuntu/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.14.0/src/runtime/thread_pool/worker.rs:332:9
  14: tokio::task::blocking::block_in_place
             at /home/ubuntu/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.14.0/src/task/blocking.rs:77:9
  15: ingest::stage::load_parquet_row_group::{{closure}}::{{closure}}
             at ./src/stage.rs:396:37
  16: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
             at /rustc/f1edd0429582dd29cccacaf50fd134b05593bd9c/library/core/src/future/mod.rs:80:19
  17: ingest::stage::load_parquet_row_group::{{closure}}
             at ./src/stage.rs:357:1
  18: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
             at /rustc/f1edd0429582dd29cccacaf50fd134b05593bd9c/library/core/src/future/mod.rs:80:19
  19: ingest::stage::load_parquet_shard::{{closure}}::{{closure}}
             at ./src/stage.rs:335:25
  20: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
             at /rustc/f1edd0429582dd29cccacaf50fd134b05593bd9c/library/core/src/future/mod.rs:80:19
  21: ingest::stage::load_parquet_shard::{{closure}}
             at ./src/stage.rs:311:1
  22: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
             at /rustc/f1edd0429582dd29cccacaf50fd134b05593bd9c/library/core/src/future/mod.rs:80:19
  23: ingest::stage::load_shard::{{closure}}::{{closure}}
             at ./src/stage.rs:156:17
  24: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
             at /rustc/f1edd0429582dd29cccacaf50fd134b05593bd9c/library/core/src/future/mod.rs:80:19
  25: ingest::stage::load_shard::{{closure}}
             at ./src/stage.rs:116:1
  26: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
             at /rustc/f1edd0429582dd29cccacaf50fd134b05593bd9c/library/core/src/future/mod.rs:80:19
  27: <sentry_core::futures::SentryFuture<F> as core::future::future::Future>::poll::{{closure}}
             at /home/ubuntu/.cargo/git/checkouts/sentry-rust-523ec8d3fcd6cdc5/2a15988/sentry-core/src/futures.rs:40:30
  28: <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once
             at /rustc/f1edd0429582dd29cccacaf50fd134b05593bd9c/library/core/src/panic/unwind_safe.rs:271:9
  29: std::panicking::try::do_call
             at /rustc/f1edd0429582dd29cccacaf50fd134b05593bd9c/library/std/src/panicking.rs:403:40
  30: __rust_try
  31: std::panicking::try
             at /rustc/f1edd0429582dd29cccacaf50fd134b05593bd9c/library/std/src/panicking.rs:367:19
  32: std::panic::catch_unwind
             at /rustc/f1edd0429582dd29cccacaf50fd134b05593bd9c/library/std/src/panic.rs:133:14
  33: sentry_core::hub::Hub::run
             at /home/ubuntu/.cargo/git/checkouts/sentry-rust-523ec8d3fcd6cdc5/2a15988/sentry-core/src/hub.rs:229:26
  34: <sentry_core::futures::SentryFuture<F> as core::future::future::Future>::poll
             at /home/ubuntu/.cargo/git/checkouts/sentry-rust-523ec8d3fcd6cdc5/2a15988/sentry-core/src/futures.rs:40:13
  35: <tokio::task::task_local::TaskLocalFuture<T,F> as core::future::future::Future>::poll::{{closure}}
             at /home/ubuntu/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.14.0/src/task/task_local.rs:280:28
  36: tokio::task::task_local::TaskLocalFuture<T,F>::with_task
             at /home/ubuntu/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.14.0/src/task/task_local.rs:272:9
  37: <tokio::task::task_local::TaskLocalFuture<T,F> as core::future::future::Future>::poll
             at /home/ubuntu/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.14.0/src/task/task_local.rs:280:9
  38: service::context::Context::scope::{{closure}}
             at /home/ubuntu/core/service/src/context.rs:111:9
  39: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
             at /rustc/f1edd0429582dd29cccacaf50fd134b05593bd9c/library/core/src/future/mod.rs:80:19
  40: service::context::InContext::in_current_context::{{closure}}
             at /home/ubuntu/core/service/src/context.rs:364:9
  41: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
             at /rustc/f1edd0429582dd29cccacaf50fd134b05593bd9c/library/core/src/future/mod.rs:80:19
  42: <core::pin::Pin<P> as core::future::future::Future>::poll
             at /rustc/f1edd0429582dd29cccacaf50fd134b05593bd9c/library/core/src/future/future.rs:119:9
  43: <tracing::instrument::Instrumented<T> as core::future::future::Future>::poll
             at /home/ubuntu/.cargo/registry/src/github.com-1ecc6299db9ec823/tracing-0.1.29/src/instrument.rs:272:9
  44: tokio::runtime::task::core::CoreStage<T>::poll::{{closure}}
             at /home/ubuntu/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.14.0/src/runtime/task/core.rs:161:17
  45: tokio::loom::std::unsafe_cell::UnsafeCell<T>::with_mut
             at /home/ubuntu/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.14.0/src/loom/std/unsafe_cell.rs:14:9
  46: tokio::runtime::task::core::CoreStage<T>::poll
             at /home/ubuntu/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.14.0/src/runtime/task/core.rs:151:13
  47: tokio::runtime::task::harness::poll_future::{{closure}}
             at /home/ubuntu/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.14.0/src/runtime/task/harness.rs:461:19
  48: <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once
             at /rustc/f1edd0429582dd29cccacaf50fd134b05593bd9c/library/core/src/panic/unwind_safe.rs:271:9
  49: std::panicking::try::do_call
             at /rustc/f1edd0429582dd29cccacaf50fd134b05593bd9c/library/std/src/panicking.rs:403:40
  50: __rust_try
  51: std::panicking::try
             at /rustc/f1edd0429582dd29cccacaf50fd134b05593bd9c/library/std/src/panicking.rs:367:19
  52: std::panic::catch_unwind
             at /rustc/f1edd0429582dd29cccacaf50fd134b05593bd9c/library/std/src/panic.rs:133:14
  53: tokio::runtime::task::harness::poll_future
             at /home/ubuntu/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.14.0/src/runtime/task/harness.rs:449:18
  54: tokio::runtime::task::harness::Harness<T,S>::poll_inner
             at /home/ubuntu/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.14.0/src/runtime/task/harness.rs:98:27
  55: tokio::runtime::task::harness::Harness<T,S>::poll
             at /home/ubuntu/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.14.0/src/runtime/task/harness.rs:53:15
  56: tokio::runtime::task::raw::poll
             at /home/ubuntu/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.14.0/src/runtime/task/raw.rs:113:5
  57: tokio::runtime::task::raw::RawTask::poll
             at /home/ubuntu/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.14.0/src/runtime/task/raw.rs:70:18
  58: tokio::runtime::task::LocalNotified<S>::run
             at /home/ubuntu/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.14.0/src/runtime/task/mod.rs:343:9
  59: tokio::runtime::thread_pool::worker::Context::run_task::{{closure}}
             at /home/ubuntu/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.14.0/src/runtime/thread_pool/worker.rs:420:13
  60: tokio::coop::with_budget::{{closure}}
             at /home/ubuntu/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.14.0/src/coop.rs:106:9
  61: std::thread::local::LocalKey<T>::try_with
             at /rustc/f1edd0429582dd29cccacaf50fd134b05593bd9c/library/std/src/thread/local.rs:399:16
  62: std::thread::local::LocalKey<T>::with
             at /rustc/f1edd0429582dd29cccacaf50fd134b05593bd9c/library/std/src/thread/local.rs:375:9
  63: tokio::coop::with_budget
             at /home/ubuntu/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.14.0/src/coop.rs:99:5
  64: tokio::coop::budget
             at /home/ubuntu/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.14.0/src/coop.rs:76:5
  65: tokio::runtime::thread_pool::worker::Context::run_task
             at /home/ubuntu/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.14.0/src/runtime/thread_pool/worker.rs:419:9
  66: tokio::runtime::thread_pool::worker::Context::run
             at /home/ubuntu/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.14.0/src/runtime/thread_pool/worker.rs:386:24
  67: tokio::runtime::thread_pool::worker::run::{{closure}}
             at /home/ubuntu/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.14.0/src/runtime/thread_pool/worker.rs:371:17
  68: tokio::macros::scoped_tls::ScopedKey<T>::set
             at /home/ubuntu/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.14.0/src/macros/scoped_tls.rs:61:9
  69: tokio::runtime::thread_pool::worker::run
             at /home/ubuntu/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.14.0/src/runtime/thread_pool/worker.rs:368:5
  70: tokio::runtime::thread_pool::worker::block_in_place::{{closure}}::{{closure}}
             at /home/ubuntu/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.14.0/src/runtime/thread_pool/worker.rs:324:41
  71: <tokio::runtime::blocking::task::BlockingTask<T> as core::future::future::Future>::poll
             at /home/ubuntu/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.14.0/src/runtime/blocking/task.rs:42:21
  72: tokio::runtime::task::core::CoreStage<T>::poll::{{closure}}
             at /home/ubuntu/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.14.0/src/runtime/task/core.rs:161:17
  73: tokio::loom::std::unsafe_cell::UnsafeCell<T>::with_mut
             at /home/ubuntu/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.14.0/src/loom/std/unsafe_cell.rs:14:9
  74: tokio::runtime::task::core::CoreStage<T>::poll
             at /home/ubuntu/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.14.0/src/runtime/task/core.rs:151:13
  75: tokio::runtime::task::harness::poll_future::{{closure}}
             at /home/ubuntu/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.14.0/src/runtime/task/harness.rs:461:19
  76: <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once
             at /rustc/f1edd0429582dd29cccacaf50fd134b05593bd9c/library/core/src/panic/unwind_safe.rs:271:9
  77: std::panicking::try::do_call
             at /rustc/f1edd0429582dd29cccacaf50fd134b05593bd9c/library/std/src/panicking.rs:403:40
  78: __rust_try
  79: std::panicking::try
             at /rustc/f1edd0429582dd29cccacaf50fd134b05593bd9c/library/std/src/panicking.rs:367:19
  80: std::panic::catch_unwind
             at /rustc/f1edd0429582dd29cccacaf50fd134b05593bd9c/library/std/src/panic.rs:133:14
  81: tokio::runtime::task::harness::poll_future
             at /home/ubuntu/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.14.0/src/runtime/task/harness.rs:449:18
  82: tokio::runtime::task::harness::Harness<T,S>::poll_inner
             at /home/ubuntu/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.14.0/src/runtime/task/harness.rs:98:27
  83: tokio::runtime::task::harness::Harness<T,S>::poll
             at /home/ubuntu/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.14.0/src/runtime/task/harness.rs:53:15
  84: tokio::runtime::task::raw::poll
             at /home/ubuntu/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.14.0/src/runtime/task/raw.rs:113:5
  85: tokio::runtime::task::raw::RawTask::poll
             at /home/ubuntu/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.14.0/src/runtime/task/raw.rs:70:18
  86: tokio::runtime::task::UnownedTask<S>::run
             at /home/ubuntu/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.14.0/src/runtime/task/mod.rs:379:9
  87: tokio::runtime::blocking::pool::Inner::run
             at /home/ubuntu/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.14.0/src/runtime/blocking/pool.rs:264:17
  88: tokio::runtime::blocking::pool::Spawner::spawn_thread::{{closure}}
             at /home/ubuntu/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.14.0/src/runtime/blocking/pool.rs:244:17

The problematic file is sensitive, but the it's a mostly-null numeric column. The last page contains ~300 null values.

bug no-changelog

opened by mdrach 6

Simplified API to write files

This simplifies the API to write a file by giving the user more control over when to emit a Row group to write.

Thanks to /u/dexterduck for the feedback on reddit that led to this PR!
backwards-incompatible

opened by jorgecarleitao 5
duckdb fails to read generated files

duckdb won't read files generated by this library, with OPTIONAL/ZSTD columns.

I think it's attempting to read the validity map as page data; it appears to try to decompress an array of vec![0xffu8; 3000] as if it was compressed page data.

I haven't had a go at reproducing this with a reasonable size file, nor checked whether it's duckdb or parquet2 which deviates from the specification, as I have absolutely no idea what I'm doing!

I will attempt to fill this bug report with some vaguely useful information later, but I thought this was better than nothing.

That is, duckdb prints:

Error: ZSTD Decompression failure

..for any column in my file, all of which are Utf8Array<i32> with (optional) nulls.
question

opened by FauxFaux 5
containt ==> contain typo

containt ==> contain typo:

https://github.com/jorgecarleitao/parquet2/blob/7a5fc27039b192f255908154a0aba2e75f6ed5a1/src/read/metadata.rs#L40 https://github.com/jorgecarleitao/parquet2/blob/7a5fc27039b192f255908154a0aba2e75f6ed5a1/src/read/stream.rs#L33

opened by ghuls 0
Write bloom filters

This pr mentions this requires big changes: https://github.com/jorgecarleitao/parquet2/pull/99. But this seems like a feature that is important to implement for performance. How doable is it in the current state of the library? I would like to work on it if possible

opened by ozgrakkurt 0

Error running tests with `--all-features` (cannot find function `lz4_decompress_to_buffer`)

Here's just cloning the repo and trying to run the tests (note that the same error occurs if you try to run all tests for arrow2):

   Compiling parquet2 v0.17.0
error[E0425]: cannot find function `lz4_decompress_to_buffer` in this scope
   --> src/compression.rs:196:13
    |
196 |             lz4_decompress_to_buffer(input_buf, Some(output_buf.len() as i32), output_buf)
    |             ^^^^^^^^^^^^^^^^^^^^^^^^ not found in this scope

error[E0425]: cannot find function `lz4_decompress_to_buffer` in this scope
   --> src/compression.rs:265:33
    |
265 |         let decompressed_size = lz4_decompress_to_buffer(
    |                                 ^^^^^^^^^^^^^^^^^^^^^^^^ not found in this scope

I understand this has to do with lz4_flex and inconsistent feature sets, but having --all-features cause invalid feature sets is not very nice if it can be avoided.

lz4_decompress_to_buffer() is only defined for lz4 && !lz4_flex or for !lz4_flex && lz4, however that's not what the match checks. Perhaps lz4_flex && lz4 should be treated as lz4_flex for all intents and purposes? (so that --all-features would make sense)

Here's what the match checks:

        #[cfg(all(feature = "lz4_flex", not(feature = "lz4")))]
        Compression::Lz4Raw => lz4_flex::block::decompress_into(...),
        #[cfg(feature = "lz4")]
        Compression::Lz4Raw => lz4::block::decompress_to_buffer(...),
        #[cfg(all(not(feature = "lz4"), not(feature = "lz4_flex")))]
        Compression::Lz4Raw => Err(Error::FeatureNotActive(...)),
        #[cfg(any(feature = "lz4_flex", feature = "lz4"))]
        Compression::Lz4 => try_decompress_hadoop(...),
        // ^ lz4_decompress_to_buffer() also used here, error!
        #[cfg(all(not(feature = "lz4_flex"), not(feature = "lz4")))]
        Compression::Lz4 => Err(Error::FeatureNotActive(...)),

(I know it's a breaking change but, just saying, another option is converting a feature, like lz4_flex, into a 'negative' feature, so that --all-features wouldn't enable it by default)

opened by aldanor 0

Add support for byte_stream_split encoding
Hi @jorgecarleitao ,

It doesn't look like there is support for byte_stream_split encoding in parquet2.

Is it hard to add support?

Looks like he C++ lib has added an implementation a couple years ago.

https://issues.apache.org/jira/browse/PARQUET-1716
opened by damelLP 0

Reading pages with large statistics with an IndexedPageReader fails

IndexedPageReader puts a hard limit of 1MB on the size of the page headers it can deserialize https://github.com/jorgecarleitao/parquet2/blob/main/src/read/page/indexed_reader.rs#L63

If a page contains a value larger than 512KB and is written out with statistics, the page header will be larger than 1MB.

This is not a problem when using an unfiltered PageReader whose limit scales with the size of the page https://github.com/jorgecarleitao/arrow2/blob/main/src/io/parquet/read/row_group.rs#L240

This would not be much of an issue to me if I could disable statistics per field, but parquet2 can only either write all pages with statistics or none of them https://github.com/jorgecarleitao/parquet2/blob/main/src/write/indexes/serialize.rs#L47

Repro

use arrow2::array::Utf8Array;
use arrow2::chunk::Chunk;
use arrow2::datatypes::{DataType, Field, Schema};
use arrow2::io::parquet::read;
use arrow2::io::parquet::write::{
    transverse, CompressionOptions, Encoding, FileWriter, RowGroupIterator, Version, WriteOptions,
};
use parquet2::indexes;
use parquet2::indexes::Interval;
use std::error::Error;
use std::fs::File;

#[test]
fn write_large_statistics() -> Result<(), Box<dyn Error>> {
    let array = Utf8Array::<i32>::from_slice(["foo".repeat(1_000_000)]);
    let field = Field::new("strings", DataType::Utf8, false);
    let schema = Schema::from(vec![field.clone()]);
    let chunk = Chunk::new(vec![array.boxed()]);

    let options = WriteOptions {
        write_statistics: true,
        compression: CompressionOptions::Uncompressed,
        version: Version::V2,
    };

    let iter = vec![Ok(chunk)];

    let encodings = schema
        .fields
        .iter()
        .map(|f| transverse(&f.data_type, |_| Encoding::Plain))
        .collect();

    let row_groups = RowGroupIterator::try_new(iter.into_iter(), &schema, options, encodings)?;

    let path = "large_statistics.parquet";
    let mut file = File::create(path)?;
    let mut writer = FileWriter::try_new(&mut file, schema, options)?;
    for group in row_groups {
        writer.write(group?)?;
    }
    writer.end(None)?;

    let mut reader = File::open(path)?;
    let metadata = read::read_metadata(&mut reader)?;
    let target_group = &metadata.row_groups[0];
    let intervals = vec![Interval {
        start: 0,
        length: target_group.num_rows(),
    }];
    let locations = read::read_pages_locations(&mut reader, target_group.columns())?;
    let columns = read::read_columns(&mut reader, target_group.columns(), &field.name)?;
    let field_pages = read::get_field_pages(target_group.columns(), &locations, &field.name);
    let filtered_pages = field_pages
        .into_iter()
        .map(|field_page| indexes::select_pages(&intervals, field_page, target_group.num_rows()))
        .collect::<Result<Vec<_>, _>>()?;

    let mut iter = read::to_deserializer(
        columns,
        field,
        target_group.num_rows(),
        None,
        Some(filtered_pages),
    )?;
    let array = iter.next().unwrap()?;
    println!("{:?}", array);
    Ok(())
}

bug

opened by tjwilson90 1

Releases(v0.17.0)

v0.17.0(Nov 30, 2022)
A new release is out there!

Thanks everyone for the fixes and improvements resulting in this stabler, easier to use and faster version of parquet2!

Breaking changes:

Improved hybrid rle decoding performance ~-40% #203 (ritchie46)

Improved API to read column chunks #195 (jorgecarleitao)

Removed EncodedPage #191 (jorgecarleitao)

New features:

Added serde support for RowGroupMetaData. #202 (youngsofun)

Fixed bugs:

Removed un-necessary conversion #197 (jorgecarleitao)

Avoid OOM on page streams #194 (jorgecarleitao)

Fixed error requiring stats #193 (jorgecarleitao)

Enhancements:

elide bound check in RLE decoder #201 (ritchie46)

Replaced panics by errors on invalid pages #188 (evanrichter)

Source code(tar.gz)
Source code(zip)
v0.16.0(Aug 17, 2022)
Yet another release of parquet2, mostly focused on avoiding panics and oom. No impact on performance, but improves reliability.

v0.16.1 (2022-08-17)

Full Changelog

Fixed bugs:

Fixed error in FilteredHybridBitmapIter's trait bounds #187 (jorgecarleitao)

v0.16.0 (2022-08-17)

Full Changelog

Breaking changes:

Improved Error #181 (jorgecarleitao)

Made decoding fallible #178 (jorgecarleitao)

Improved bitpacking #176 (jorgecarleitao)

New features:

Added DELTA_BYTE_ARRAY encoder #183 (jorgecarleitao)

Fixed bugs:

FixedLenByteArray max_precision integer overflow #184 (evanrichter)

Documentation updates:

enable doc_cfg feature #186 (ritchie46)

Improved decoding documentation #180 (jorgecarleitao)

Source code(tar.gz)
Source code(zip)
v0.15.0(Aug 10, 2022)
We have a new release of parquet2 available!

Breaking changes:

Add max_size to get_page_stream and get_page_iterator #173

Optional async #174 (jorgecarleitao)

Privatized CompressionLevel #170 (jorgecarleitao)

Delay deserialization of dictionary pages #160 (jorgecarleitao)

New features:

Added feature flag to use zlib-ng backend for gzip #165 (ritchie46)

Fixed bugs:

Fixed OOM on malicious/malformed thrift #172 (jorgecarleitao)

Enhancements:

Made compute_page_row_intervals public #171 (jorgecarleitao)

Simplified interal code #168 (jorgecarleitao)

cargo fmt #166 (ritchie46)

Testing updates:

Improved coverage report #175 (jorgecarleitao)

Source code(tar.gz)
Source code(zip)
v0.14.2(Jul 26, 2022)
A couple of bug fixes, by @jhorstmann and @v0y4g3r

Full Changelog

Fixed bugs:

Fixed FileStreamer's end method to flush Parquet magic #163 (v0y4g3r)

Fix compilation of parquet-tools #161 (jhorstmann)

Enhancements:

Added Compressor::into_inner #158 (jorgecarleitao)

Source code(tar.gz)
Source code(zip)
v0.14.1(Jul 2, 2022)
A small but important release to support legacy lz4, by @dantengsky 🚀

Full Changelog

New features:

Added support for legacy lz4 decompression #151 (dantengsky)

Enhancements:

Improved performance of reading #157 (jorgecarleitao)

Source code(tar.gz)
Source code(zip)
v0.14.0(Jun 27, 2022)
A new release is here and in crates.io! 🎉🎉🎉

Full Changelog

Breaking changes:

split_buffer should return Result #156

Fixed bugs:

Removed panics on read #150 (jorgecarleitao)

Enhancements:

Reduced reallocations #153 (jorgecarleitao)

Removed AsyncSeek requirement from page stream #149 (medwards)

Source code(tar.gz)
Source code(zip)
v0.13.0(May 31, 2022)
Another release of parquet2 is here!

We can now control the compression level of both GZIP and BROTLI compression thanks to @TurnOfACard 🙇

Thank you to everyone that contributed to this release!

Breaking changes:

Removed unused cargo feature #145 (jorgecarleitao)

Fix potential misuse of FileWriter API's (sync + async) #138 (TurnOfACard)

New features:

Added new_with_page_meta to PageReader #136 (ygf11)

Added compression options/levels for GZIP and BROTLI codecs. #132 (TurnOfACard)

Fixed bugs:

Async FileStreamer does not write statistics #139

Fixed error in compressing lz4raw with large offsets #140 (jorgecarleitao)

Enhancements:

Improved read of metadata #143 (jorgecarleitao)

Simplified async metadata read #137 (jorgecarleitao)

Testing updates:

Lifted duplicated code to a function #141 (jorgecarleitao)

Improved Integration test documentation and expanded tests #133 (TurnOfACard)

Source code(tar.gz)
Source code(zip)
v0.12.1(May 15, 2022)
Full Changelog

Fixed bugs:

Fixed error in compressing lz4raw with large offsets #140 (jorgecarleitao)

Source code(tar.gz)
Source code(zip)
v0.12.0(Apr 22, 2022)
Full Changelog

Breaking changes:

Add CompressionOptions, which allows for zstd compression levels. #128 (TurnOfACard)

Enhancements:

Improved performance of RLE decoding (-18%) #130 (jorgecarleitao)

Improved perf of bitpacking decoding (3.5x) #129 (jorgecarleitao)

Source code(tar.gz)
Source code(zip)
v0.11.0(Apr 15, 2022)
Here we are for a new release of parquet2. This release has 3 main features:

added optional support LZ4 compression and decompression in WASM builds (via LZ4-flex by @PSeitz)

added support to read bloom filters

added support to read and write page indexes

A summary of the Full Changelog is available below.

Thank you for everyone that contributed to this release! (credits to individual PRs below)

Breaking changes:

Renamed ParquetError to Error #109

Made .end not consume the parquet FileWriter #127 (jorgecarleitao)

Removed compression from WriteOptions #125 (kornholi)

Simplified API and converted some panics on read to errors #112 (jorgecarleitao)

Improved typing to reduce clones and use of unwraps #106 (jorgecarleitao)

Simplified PageIterator #103 (jorgecarleitao)

New features:

Added support for page-level filter pushdown (indexes) #102

Added support for bloom filters #98

Added optional support for LZ4 via LZ4-flex crate (thus enabling wasm) #124 (jorgecarleitao)

Added support for page-level filter pushdown (column and offset indexes) #107 (jorgecarleitao)

Added support to read column and page indexes #100 (jorgecarleitao)

Fixed bugs:

Fixed minimum version for LZ4 #122 (kornholi)

Fixed Lz4Raw compression error (if input is tiny) #118 (dantengsky)

Fixed LZ4 #95 (jorgecarleitao)

Enhancements:

Made offsets be always written #123 (jorgecarleitao)

Added specialized deserialization of one-level filtered pages #120 (jorgecarleitao)

Added support to read and use bloom filters #99 (jorgecarleitao)

Added ordinal and total_compressed_size to column meta #96 (jorgecarleitao)

Added non-consuming function to get values of delta-decoder #94 (jorgecarleitao)

Disabled bitpacking default-features and upgraded to edition 2021 #93 (light4)

Documentation updates:

Fix deployment of guide #115 (jorgecarleitao)

Testing updates:

Added tests for reducing statistics #116 (jorgecarleitao)

Simplified tests #104 (jorgecarleitao)

Source code(tar.gz)
Source code(zip)
v0.10.1(Feb 12, 2022)
Full Changelog

Enhancements:

Update zstd dependency #82 (jhorstmann)

Added file offset #81 (barrotsteindev)

v0.10.0 (2022-02-02)
Source code(tar.gz)
Source code(zip)
v0.10.0(Feb 2, 2022)
A new release is now available in crates.io.

Breaking changes:

Simplified API to write files #78 (jorgecarleitao)

Fixed bugs:

Fixed panic in reading empty values in hybrid-RLE #80 (jorgecarleitao)

Source code(tar.gz)
Source code(zip)
v0.9.2(Jan 25, 2022)
Fixed bugs:

Fixed panic in reading empty values in hybrid-RLE #80 (jorgecarleitao)

Source code(tar.gz)
Source code(zip)
v0.9.0(Jan 11, 2022)
A new release is here, with a performance improvement to bitpacking by @danburkert , an important fix to write data, and one backward incompatible change to allow serializing and writing to happen on separate thread pools (e.g. via tokio_rayon).

Breaking changes:

Changed stream of groups to stream of futures of groups #71 (jorgecarleitao)

Enhancements:

bitpacking: use stack-allocated temporary buffer #76 (danburkert)

Added constructor to RowGroupMetaData and ColumnChunkMetaData #74 (yjshen)

Improved performance of reading multiple pages #73 (jorgecarleitao)

Fixed bugs:

Fixed error in declaring size of compressed dict page. #72 (jorgecarleitao)

Full Changelog
Source code(tar.gz)
Source code(zip)
v0.8.1(Dec 9, 2021)
Full Changelog

Fixed bugs:

Fixed error in declaring size of compressed dict page. #72 (jorgecarleitao)

Source code(tar.gz)
Source code(zip)
v0.8.0(Nov 24, 2021)
Full Changelog

Breaking changes:

Improved error message when a feature is not active #69 (jorgecarleitao)

Fixed bugs:

Fixed error in finishing iterator. #68 (jorgecarleitao)

Source code(tar.gz)
Source code(zip)
v0.7.0(Nov 13, 2021)
Full Changelog

Breaking changes:

Use i64s for delta-bitpacked's interface #67 (kornholi)

New features:

Added basic support to read nested types #64 (jorgecarleitao)

Fixed bugs:

Fix off-by-one error in delta-bitpacked decoder #66 (kornholi)

Source code(tar.gz)
Source code(zip)
v0.6.0(Oct 18, 2021)
Another release is here. It contains 3 breaking changes and a couple of bug fixes (backported to 0.5).

Full Changelog

Breaking changes:

Improved performance of codec initialization #63 (jorgecarleitao)

Made PageFilter Send #62 (dantengsky)

Allowed reusing compression buffer #60 (jorgecarleitao)

Fixed bugs:

Fixed delta-bitpacked mini-block decoding #56 (kornholi)

Add descriptor to FixedLenStatistics #54 (potter420)

Fixed error in reading zero-width bit from hybrid RLE. #53 (jorgecarleitao)

Enhancements:

Added writing reduced statistics for FixedLenByteArray #55 (potter420)

Source code(tar.gz)
Source code(zip)
v0.5.2(Oct 6, 2021)
Full Changelog

Fixed bugs:

Fixed delta-bitpacked mini-block decoding #56 (kornholi)

Add descriptor to FixedLenStatistics #54 (potter420)

Enhancements:

Added writing reduced statistics for FixedLenByteArray #55 (potter420)

Source code(tar.gz)
Source code(zip)
v0.5.1(Sep 29, 2021)
Full Changelog

Fixed bugs:

Fixed error in reading zero-width bit from hybrid RLE. #53 (jorgecarleitao)

Source code(tar.gz)
Source code(zip)
v0.5.0(Sep 18, 2021)
A small release to keep things tidy.

Full Changelog

Breaking changes:

Renamed Compression::Zsld to Compression::Zstd (typo) #48 (vincev)

Enhancements:

Add null_count method to trait Statistics #49 (yjshen)

Source code(tar.gz)
Source code(zip)
v0.4.0(Aug 28, 2021)
This release expands on the async support by offering full support for async writing of parquet files via the futures::AsyncWrite. Furthermore, in this release we remove the need of Seek to writing, which some writers do not support. Finally, this release adds a small guide to help users.

Full Changelog

Breaking changes:

Make write_* return the number of written bytes. #45

move HybridRleDecoder from read::levels to encoding::hybrid_rle #41

Simplified split of page buffer #37 (jorgecarleitao)

Simplified API to get page iterator #36 (jorgecarleitao)

New features:

Added support to write to async writers. #35 (jorgecarleitao)

Fixed bugs:

Fixed edge case of a small bitpacked. #43 (jorgecarleitao)

Fixed error in decoding RLE-hybrid. #40 (jorgecarleitao)

Enhancements:

Removed requirement of "Seek" on write. #44 (jorgecarleitao)

Documentation updates:

Added guide to read #38 (jorgecarleitao)

Testing updates:

Made tests deserializer use the correct decoder. #46 (jorgecarleitao)

Source code(tar.gz)
Source code(zip)
v0.3.0(Aug 9, 2021)
Full Changelog

Breaking changes:

Added option to apply filter pushdown to data pages. #34 (jorgecarleitao)

Add support to read async #33 (jorgecarleitao)

New features:

Add support for async read #32

Added option to apply filter pushdown to data pages. #34 (jorgecarleitao)

Source code(tar.gz)
Source code(zip)
v0.2.0(Aug 3, 2021)
Changelog

v0.2.0 (2021-08-03)

Full Changelog

Enhancements:

Add support to write dictionary-encoded pages #29

Upgrade zstd to ^0.9 #31 (Dandandan)

* This Changelog was automatically generated by github_changelog_generator
Source code(tar.gz)
Source code(zip)