Entropy Encoding notebook. Simple implementations of the "tANS" encoder/decoder.

Overview

EntropyEncoding Experiments

This repository contains my Entropy Encoding notebook. Entropy encoding is an efficient lossless data compression scheme.

Huffman coding is a well known technique that uses a path in a binary decision tree to represent a symbol in some alphabet. Symbols are encoded using a sequence of complete bits, because the path in the binary tree is made of decisions to go left or right.

According to information theory the optimal cost in bits for some symbol is "-log2(P(sym))", which is usually not a round number. For a symbol probability of 33%, we get -log2(1/3) = 1.58 bits, but in Huffman encoding we would need to send two bits. Huffman encoding is optimal if all of the symbol frequencies are a power of two, but this is not the typical case. Arithmetic coding allows us to save fractional bits of information and get close to Shannon's limit.

This is a simple example of an Arithmetic Encoder that can handle non-power-of-two probabilities for an alphabet of 3 characters, with equal symbol frequencies. This example uses a state variable to combine bits from different symbols to overcome the two bits per symbol barrier.

fn encode(state : &mut u64, sym : u64) {
    *state *= 3;
    *state += sym;
}

fn decode(state : &mut u64) -> u64 {
    let sym = *state % 3;
    *state /= 3;
    sym
}

#[test]
fn test_encode_base3() {
    let mut state = 0;
    encode(&mut state, 2);
    encode(&mut state, 1);
    encode(&mut state, 0);

    assert_eq!(decode(&mut state), 0);
    assert_eq!(decode(&mut state), 1);
    assert_eq!(decode(&mut state), 2);
}

Asymmetric numeral systems (ANS) is a relatively new approach that generalizes the example above. ANS uses a state machine to represent bit fractions, and uses an implementation that does not require expensive arithmetic operations.

I spent some time reading about asymmetric numeral systems (ANS) and wrote some code to help me learn the topic. In this repository I implemented two varients of the tANS encoder. The first implementation is a very simple encoder, and the second implementation is a simplified version of the efficient implementation by Yann Collet: FSE.

The code is structured like a research notebook with unit tests that drive the code. The two interesting files are simple.rs and fse.rs. The Simple file implements a basic encoder that is not very memory efficient. The FSE file implements a subset of the the advanced FSE encoder that Yann wrote. This encoder has many optimizations that makes the C code more difficult to follow. I tried to remove some of the optimizations to make the code more readable, and I also attached a diagram that explains the flow of information within the encoder (below).

This unit test is a good starting point for exploring the implementation of the encoder.

#[test]
fn test_round_trip_simple_encoder() {
    let text =  "entropy encoding is typically the last stage of a compression algorithm";
    let input: Vec<u8> = text.as_bytes().to_vec();

    // Define an encoder with 8bit symbols, and 12bit states.
    let mut enc = Simple::<256, 4096>::new();
    // Initialize the encoder based on the statistical properties of the input.
    enc.from_data(&input);
    // Encode the test.
    enc.encode_data(&input);
    // Print the compressed binary representation.
    enc.bitvector.dump();
    // Decode the data.
    let out = enc.decode_data();

    println!("Decoded {:?}", out);
    println!("Input length = {}", 8 * input.len());
    println!("Compressed length = {}", enc.bitvector.len());
    assert_eq!(out, input);
}

Program output:

running 1 test
[00000001111001111101010000110101
.10010110000100101010001000110110
.11011001000101000101110010000011
....
.11001100110]
Decoded [101, 110, 116, 114, 111, 112, 121, 32, 101, 110,  .... ]
Input length = 568
Compressed length = 427
test simple::test_round_trip_simple_encoder ... ok

Here are a few useful links that I used while learning the topic:

  • Yann's Blog post describing FSE: Link
  • Python notebook that implements FSE: Link
  • Excellent blog post that explains the encoding technique: Link.
  • Blog posts by Charles Bloom: Link.
  • Academic paper by Jarek Duda that: Link

This char shows the flow of information in the FSE encoder:

You might also like...
TLV-C encoding support.

TLV-C: Tag - Length - Value - Checksum TLV-C is a variant on the traditional [TLV] format that adds a whole mess of checksums and whatnot. Why, you as

A binary encoder / decoder implementation in Rust.
A binary encoder / decoder implementation in Rust.

Bincode A compact encoder / decoder pair that uses a binary zero-fluff encoding scheme. The size of the encoded object will be the same or smaller tha

A Rust encoder/decoder for Dominic Szablewski's QOI format for fast, lossless image compression.

QOI - The “Quite OK Image” format This is a Rust encoder and decoder for Dominic Szablewski's QOI format for fast, lossless image compression. See the

mico (minimalistic config file format) encoder and decoder

mico This library implements a parser and emitter for mico (minimalistic config file format). Format example: Name: mico Description: minimalistic con

A binary encoder / decoder implementation in Rust.

Bincode A compact encoder / decoder pair that uses a binary zero-fluff encoding scheme. The size of the encoded object will be the same or smaller tha

A basic rust QOI decoder/encoder

libqoi A basic rust QOI decoder/encoder. Why QOI QOI is a lossless image format with a one page specification. It can achieve better compression than

Fast encoder/decoder for the lossless DTM 16 bit image format

DTM Image Format Fast encoder/decoder for the DTM image format. The DTM image format is a 16-bit lossless image format supporting one to four channels

Free Rust-only Xbox ADPCM encoder and decoder

XbadPCM Safe (and optionally no-std) Rust crate for encoding and decoding Xbox ADPCM blocks. Decoding example Here is example code for decoding stereo

decoder (and encoder) for quaternions sent from a joycon. Based heavily on reverse engineering done by hexkyz.

joycon-quat decoder (and encoder) for quaternions sent from a joycon. Based heavily on reverse engineering done by hexkyz. for those who want to use i

A notebook app integrated with todo lists utility. Developed with Rust, WebAssembly, Yew and Trunk.

Flow.er A notebook app integrated with todo-list utility. Project flow.er is a Rust WASM app running in browser. Taking advantage of Yew and Trunk, it

Rust-battery - Rust crate providing cross-platform information about the notebook batteries.

battery Rust crate providing cross-platform information about the notebook batteries. Table of contents Overview Supported platforms Install Examples

A tool to deserialize data from an input encoding, transform it and serialize it back into an output encoding.

dts A simple tool to deserialize data from an input encoding, transform it and serialize it back into an output encoding. Requires rust = 1.56.0. Ins

Databento Binary Encoding (DBZ) - Fast message encoding and storage format for market data

dbz A library (dbz-lib) and CLI tool (dbz-cli) for working with Databento Binary Encoding (DBZ) files. Python bindings for dbz-lib are provided in the

Private swaps for Secret Network using a private entropy pool & differential privacy.

WIP SecretSwap: Anon Edition Private swaps for Secret Network! Uses private entropy pool for differential privacy when reporting pools sizes. Swap amo

Private swaps for Secret Network using a private entropy pool & differential privacy.

WIP SecretSwap: Anon Edition Private swaps for Secret Network! Uses private entropy pool for differential privacy when reporting pools sizes. Swap amo

A password entropy calculator.

paspio — pasvorta entropio A (naive) password entropy calculator. Refrain from using this as a sole measure of password strength, it should be used in

Efficient state-based CRDT replication and anti-entropy

Merkle Search Tree This crate implements a Merkle Search Tree as described in the 2019 paper Merkle Search Trees: Efficient State-Based CRDTs in Open

The fastest and safest AV1 encoder.

rav1e The fastest and safest AV1 encoder. Table of Content Overview Features Documentation Releases Building Dependency: NASM Release binary Unstable

A DHCP parser and encoder for DHCPv4/DHCPv6

dhcproto A DHCP parser and encoder for DHCPv4/DHCPv6. dhcproto aims to be a functionally complete DHCP implementation. Many common option types are im

Owner
Nadav Rotem
Nadav Rotem
Free Rust-only Xbox ADPCM encoder and decoder

XbadPCM Safe (and optionally no-std) Rust crate for encoding and decoding Xbox ADPCM blocks. Decoding example Here is example code for decoding stereo

Snowy 5 Nov 20, 2022
Implementation of Bencode encoding written in rust

Rust Bencode Implementation of Bencode encoding written in rust. Project Status Not in active developement due to lack of time and other priorities. I

Arjan Topolovec 32 Aug 6, 2022
Encoding and decoding support for BSON in Rust

bson-rs Encoding and decoding support for BSON in Rust Index Overview of BSON Format Usage BSON Values BSON Documents Modeling BSON with strongly type

mongodb 304 Dec 30, 2022
A Gecko-oriented implementation of the Encoding Standard in Rust

encoding_rs encoding_rs an implementation of the (non-JavaScript parts of) the Encoding Standard written in Rust and used in Gecko (starting with Fire

Henri Sivonen 284 Dec 13, 2022
Character encoding support for Rust

Encoding 0.3.0-dev Character encoding support for Rust. (also known as rust-encoding) It is based on WHATWG Encoding Standard, and also provides an ad

Kang Seonghoon 264 Dec 14, 2022
A HTML entity encoding library for Rust

A HTML entity encoding library for Rust Example usage All example assume a extern crate htmlescape; and use htmlescape::{relevant functions here}; is

Viktor Dahl 41 Nov 1, 2022
A TOML encoding/decoding library for Rust

toml-rs A TOML decoder and encoder for Rust. This library is currently compliant with the v0.5.0 version of TOML. This library will also likely contin

Alex Crichton 1k Dec 30, 2022
Variable-length signed and unsigned integer encoding that is byte-orderable for Rust

ordered-varint Provides variable-length signed and unsigned integer encoding that is byte-orderable. This crate provides the Variable trait which enco

Khonsu Labs 7 Dec 6, 2022
A series of compact encoding schemes for building small and fast parsers and serializers

A series of compact encoding schemes for building small and fast parsers and serializers

Manfred Kröhnert 2 Feb 5, 2022
Astro Format is a library for efficiently encoding and decoding a set of bytes into a single buffer format.

Astro Format is a library for efficiently transcoding arrays into a single buffer and native rust types into strings

Stelar Labs 1 Aug 13, 2022