Lossless compressor and decompressor for numerical data using quantiles

Martin

Last update: Dec 14, 2022

Related tags

Compression quantile-compression

Overview

Quantile Compression

This rust library compresses and decompresses sequences of numerical data very well. It currently supports the following data types: i32, i64, u32, u64, f32, f64. Smaller data types like i16 can be efficiently compressed by casting to i32. Timestamp support may come soon in the future.

For natural data, it typically shrinks data to 25-40% smaller than what gzip -9 produces, compresses much faster, and decompresses equally quickly.

The intended use case for this algorithm is compressing columnar data, especially for use by Spark and other execution engines.

This IS:

lossless
order-preserving
moderately fast

This is NOT:

lossy
for multisets
optimal for time series with high mutual information between consecutive elements

Usage

See the following basic usage. To run something right away, see the example.

use q_compress:{BitReader, I64Compressor, I64Decompressor};

fn main() {
  // your data
  let mut my_ints = Vec::new();
  for i in 0..100000 {
    my_ints.push(i as i64);
  }
  
  // compress
  // max_depth is basically compression level - 6 is generally good.
  // Max depths between 4 and 8 are reasonable.
  let max_depth = 6;
  let compressor = I64Compressor::train(&my_ints, max_depth).expect("failed to train");
  let bytes = compressor.compress(&my_ints).expect("out of range");
  println!("compressed down to {} bytes", bytes.len());
  
  // decompress
  let bit_reader = &mut BitReader::from(bytes);
  let decompressor = I64Decompressor::from_reader(bit_reader).expect("couldn't read compression scheme");
  let recovered = decompressor.decompress(bit_reader);
  println!("got back {} ints from {} to {}", recovered.len(), recovered[0], recovered.last().unwrap());
}

Method

This works by describing each number with a range and an offset. The range specifies an inclusive range [lower, upper] that the number might be in, and the offset specifies the exact position within that range. The compressor chooses a prefix for each range via Huffman codes.

For data sampled from a random distribution, this compression algorithm can reduce byte size to near the theoretical limit of the distribution's Shannon entropy. Ideally it encodes a number k in b bits if 2^-b ~= P(k). We can plot Q(k) = 2^-b to see how close quantile compression gets to the ideal in this example with max_depth=3:

The inefficiency of quantile compression in bits per number is the KL divergence from the approximated distribution Q to the true distribution P.

`.qco` File Format

Quantile-compressed files consist of a lightweight header (usually <1KB) and then very many short number blocks, each of which usually encodes a single number.

The header is expected to start with a magic sequence of 4 bytes for "qco!" in ascii. The next byte encodes the data type (e.g. i64). The next few bytes encode the count of numbers in the file, and then the count of ranges (or prefixes) used to compress. It then contains metadata for each range: lower and upper bound, the length of its prefix in bits, the prefix. The next bit for each range encodes whether it uses repetitions, which are only used for the most common range in a sparse distribution. If so, the file then encodes a "jumpstart", which is used in number blocks to describe how many repetitions of the range to use.

Each number block has just 2 or 3 parts. First comes the prefix, which indicates a range to use. If that range uses repetitions, a varint for the exact number of repetitions follows, leveraging the jumpstart from earlier. Then an offset (for each repetition if necessary) follows, specifying the exact value within the range.

Comments

Rework BitReader to use byte slices

Currently BitReader requires callers to give ownership of the data. Since it actually never needs to have ownership, rework this so that BitReader instead stores a reference to a byte slice. This allows callers not having to do unnecessary copies of the raw data.

Note that this changes the public API (I haven't updated the changelog accordingly).

opened by mcuelenaere 7
Wrapped Format Help
I have a use-case for wrapped formats, and I'd like some help, since the example is not clear to me.

I have a structure like the following

#[derive(Default, Debug, Clone, PartialEq)] pub struct Timeseries { pub metric_name: MetricName, pub values: Vec<f64>, pub timestamps: Arc<Vec<i64>>, //Arc used vs Rc since Rc is !Send }

In certain cases, I have a Vec<Timeseries> which cover exactly the same timestamp range (note the Arc), and I'd like to serialize it as efficiently as possible. In practice, I'd like to serialize the timestamps once (delta level 2), then iteratively compress each value chunk.

Any pointers ?
opened by ccollie 5
Add boolean support
Support for 8-bit (1 byte) booleans via a new q_compress::types::boolean8::B8DataType type

Some references:

https://users.rust-lang.org/t/solved-what-is-byte-value-for-boolean/47347

https://doc.rust-lang.org/std/primitive.bool.html
opened by Kommandat 3
C++ API

Hi,

I have timeseries data set comprising of (int16_t, char, double, int32_t and int64_t). int64_t is timestamp in nanoseconds. We are planning to migrate to zstd as that seemed to be the best choice. I wanted to try Qcompress also. Is this production ready for large data (multiple TBs of read and write everyday)? Also, application is in C++ so is corresponding library available?

opened by alphanso 1
Streaming Decompression Bug

It seems the decompressors sometimes get into unhealthy states when reaching the end of the available compressed data, causing either panics or silent decompression errors. The auto_decompress, simple_decompress, and chunked decompression API's are unaffected.

opened by mwlon 0
Small data compression performance
Reduced number of prefixes used for small data. Default compression ratio should balance compression time vs compressed size for all n (defaulting to 256 was especially slow in the 100 < n < 16k range). E.g. in the 2^10 <= n < 2^12 range this uses up to 64 prefixes by default.

Improved prefix optimization speed by about 30%.

Added a new dataset of decimal-valued floats
opened by mwlon 0
Support `u8`/`i8`?

Code support all Rust native numeric types except u8/i8 right now. I don't have a use case for these yet, but I can imagine someone doing so in the future: this is a placeholder issue for that situation for now.

Downsides of adding this support are a bit more code with a bit more maintenance, and perhaps a bit more confusion about what this compressor is intended for — not for data that isn't in some sense "time series".

opened by BartMassey 1

Owner

Martin

GitHub

A utility that can download JavaScript and TypeScript module graphs and store them locally in a special zip file.

eszip A utility that can download JavaScript and TypeScript module graphs and store them locally in a special zip file. To create a new archive: > esz

162 Dec 24, 2022

Basic (and naïve) LZW and Huffman compression algorithms in Rust.

Naive implementation of the LZW and Huffman compression algorithms. To run, install the Rust toolchain. Cargo may be used to compile the source. Examp

9 May 22, 2023

A Brotli implementation in pure and safe Rust

Brotli-rs - Brotli decompression in pure, safe Rust Documentation Compression provides a <Read>-struct to wrap a Brotli-compressed stream. A consumer

59 Oct 7, 2022

SIMD Floating point and integer compressed vector library

compressed_vec Floating point and integer compressed vector library, SIMD-enabled for fast processing/iteration over compressed representations. This

56 Nov 24, 2022

DEFLATE, gzip, and zlib bindings for Rust

flate2 A streaming compression/decompression library DEFLATE-based streams in Rust. This crate by default uses the miniz_oxide crate, a port of miniz.

619 Jan 8, 2023

Like pigz, but rust - a cross platform, fast, compression and decompression tool.

?? crabz Like pigz, but rust. A cross platform, fast, compression and decompression tool. Synopsis This is currently a proof of concept CLI tool using

232 Jan 2, 2023

Convenience library for reading and writing compressed files/streams

compress_io Convenience library for reading and writing compressed files/streams The aim of compress_io is to make it simple for an application to sup

0 Dec 16, 2021

A simple rust library to read and write Zip archives, which is also my pet project for learning Rust

rust-zip A simple rust library to read and write Zip archives, which is also my pet project for learning Rust. At the moment you can list the files in

2 Jan 5, 2022

A Rust application that compress files and folders

Quick Storer This is a Rust application that compress files and folders. Usage Download or build the binary and place it on your desktop, or any other

1 Feb 2, 2022

Obvious Unified Compression Helper is a CLI tool to help you compress and decompress files of several formats

Ouch! ouch stands for Obvious Unified Compression Helper and is a CLI tool to help you compress and decompress files of several formats. Features Usag

734 Dec 30, 2022

Brotli compressor and decompressor written in rust that optionally avoids the stdlib

rust-brotli What's new in 3.2 into_inner conversions for both Reader and Writer classes What's new in 3.0 A fully compatible FFI for drop-in compatibi

659 Dec 29, 2022

A quasi-lossless Balkanoidal meta-lingual compressor.

A quasi-lossless Balkanoidal meta-lingual compressor. Background It has long been accepted that Serbian is a compact variant of Russian, with less lib

3 Aug 17, 2022

Easy c̵̰͠r̵̛̠ö̴̪s̶̩̒s̵̭̀-t̶̲͝h̶̯̚r̵̺͐e̷̖̽ḁ̴̍d̶̖̔ ȓ̵͙ė̶͎ḟ̴͙e̸̖͛r̶̖͗ë̶̱́ṉ̵̒ĉ̷̥e̷͚̍ s̷̹͌h̷̲̉a̵̭͋r̷̫̊ḭ̵̊n̷̬͂g̵̦̃ f̶̻̊ơ̵̜ṟ̸̈́ R̵̞̋ù̵̺s̷̖̅ţ̸͗!̸̼͋

Rust S̵̓i̸̓n̵̉ I̴n̴f̶e̸r̵n̷a̴l mutability! Howdy, friendly Rust developer! Ever had a value get m̵̯̅ð̶͊v̴̮̾ê̴̼͘d away right under your nose just when

294 Dec 23, 2022

A rustc plugin to check for numerical instability

Herbie lint for Rust What This plugin can add warnings or errors to your crate when using a numerically unstable floating point expression. Quick exam

172 Oct 31, 2022

A library and application for lossless, format-preserving, two-pass optimization and repair of Vorbis data, reducing its size without altering any audio information.

OptiVorbis A library and application for lossless, format-preserving, two-pass optimization and repair of Vorbis data, reducing its size without alter

27 Jan 3, 2023

Lossless compressor and decompressor for numerical data using quantiles

Related tags

Overview

Quantile Compression

Usage

Method

`.qco` File Format

Comments

Rework BitReader to use byte slices

Wrapped Format Help

Add boolean support

C++ API

Streaming Decompression Bug

Small data compression performance

Support `u8`/`i8`?

Owner

Martin

A utility that can download JavaScript and TypeScript module graphs and store them locally in a special zip file.

Basic (and naïve) LZW and Huffman compression algorithms in Rust.

A Brotli implementation in pure and safe Rust

SIMD Floating point and integer compressed vector library

DEFLATE, gzip, and zlib bindings for Rust

Like pigz, but rust - a cross platform, fast, compression and decompression tool.

Convenience library for reading and writing compressed files/streams

A simple rust library to read and write Zip archives, which is also my pet project for learning Rust

A Rust application that compress files and folders

Obvious Unified Compression Helper is a CLI tool to help you compress and decompress files of several formats

Brotli compressor and decompressor written in rust that optionally avoids the stdlib

A quasi-lossless Balkanoidal meta-lingual compressor.

Easy c̵̰͠r̵̛̠ö̴̪s̶̩̒s̵̭̀-t̶̲͝h̶̯̚r̵̺͐e̷̖̽ḁ̴̍d̶̖̔ ȓ̵͙ė̶͎ḟ̴͙e̸̖͛r̶̖͗ë̶̱́ṉ̵̒ĉ̷̥e̷͚̍ s̷̹͌h̷̲̉a̵̭͋r̷̫̊ḭ̵̊n̷̬͂g̵̦̃ f̶̻̊ơ̵̜ṟ̸̈́ R̵̞̋ù̵̺s̷̖̅ţ̸͗!̸̼͋

A rustc plugin to check for numerical instability

A library and application for lossless, format-preserving, two-pass optimization and repair of Vorbis data, reducing its size without altering any audio information.

A Rust encoder/decoder for Dominic Szablewski's QOI format for fast, lossless image compression.

Oxipng - a multithreaded lossless PNG compression optimizer

Fast encoder/decoder for the lossless DTM 16 bit image format

A Rust no-std (de)compressor based on PAQ

The H3 Compressor: A compression scheme tailored for H3 cell indexes.

Lossless compressor and decompressor for numerical data using quantiles

Related tags

Overview

Quantile Compression

Usage

Method

.qco File Format

Comments

Rework BitReader to use byte slices

Wrapped Format Help

Add boolean support

C++ API

Streaming Decompression Bug

Small data compression performance

Support `u8`/`i8`?

Owner

Martin

A utility that can download JavaScript and TypeScript module graphs and store them locally in a special zip file.

Basic (and naïve) LZW and Huffman compression algorithms in Rust.

A Brotli implementation in pure and safe Rust

SIMD Floating point and integer compressed vector library

DEFLATE, gzip, and zlib bindings for Rust

Like pigz, but rust - a cross platform, fast, compression and decompression tool.

Convenience library for reading and writing compressed files/streams

A simple rust library to read and write Zip archives, which is also my pet project for learning Rust

A Rust application that compress files and folders

Obvious Unified Compression Helper is a CLI tool to help you compress and decompress files of several formats

Brotli compressor and decompressor written in rust that optionally avoids the stdlib

A quasi-lossless Balkanoidal meta-lingual compressor.

Easy c̵̰͠r̵̛̠ö̴̪s̶̩̒s̵̭̀-t̶̲͝h̶̯̚r̵̺͐e̷̖̽ḁ̴̍d̶̖̔ ȓ̵͙ė̶͎ḟ̴͙e̸̖͛r̶̖͗ë̶̱́ṉ̵̒ĉ̷̥e̷͚̍ s̷̹͌h̷̲̉a̵̭͋r̷̫̊ḭ̵̊n̷̬͂g̵̦̃ f̶̻̊ơ̵̜ṟ̸̈́ R̵̞̋ù̵̺s̷̖̅ţ̸͗!̸̼͋

A rustc plugin to check for numerical instability

A library and application for lossless, format-preserving, two-pass optimization and repair of Vorbis data, reducing its size without altering any audio information.

A Rust encoder/decoder for Dominic Szablewski's QOI format for fast, lossless image compression.

Oxipng - a multithreaded lossless PNG compression optimizer

Fast encoder/decoder for the lossless DTM 16 bit image format

A Rust no-std (de)compressor based on PAQ

The H3 Compressor: A compression scheme tailored for H3 cell indexes.

`.qco` File Format