Entropy Encoding notebook. Simple implementations of the "tANS" encoder/decoder.

Nadav Rotem

Last update: Dec 21, 2022

Related tags

Encoding Multimedia EntropyEncoding

Overview

EntropyEncoding Experiments

This repository contains my Entropy Encoding notebook. Entropy encoding is an efficient lossless data compression scheme.

Huffman coding is a well known technique that uses a path in a binary decision tree to represent a symbol in some alphabet. Symbols are encoded using a sequence of complete bits, because the path in the binary tree is made of decisions to go left or right.

According to information theory the optimal cost in bits for some symbol is "-log2(P(sym))", which is usually not a round number. For a symbol probability of 33%, we get -log2(1/3) = 1.58 bits, but in Huffman encoding we would need to send two bits. Huffman encoding is optimal if all of the symbol frequencies are a power of two, but this is not the typical case. Arithmetic coding allows us to save fractional bits of information and get close to Shannon's limit.

This is a simple example of an Arithmetic Encoder that can handle non-power-of-two probabilities for an alphabet of 3 characters, with equal symbol frequencies. This example uses a state variable to combine bits from different symbols to overcome the two bits per symbol barrier.

fn encode(state : &mut u64, sym : u64) {
    *state *= 3;
    *state += sym;
}

fn decode(state : &mut u64) -> u64 {
    let sym = *state % 3;
    *state /= 3;
    sym
}

#[test]
fn test_encode_base3() {
    let mut state = 0;
    encode(&mut state, 2);
    encode(&mut state, 1);
    encode(&mut state, 0);

    assert_eq!(decode(&mut state), 0);
    assert_eq!(decode(&mut state), 1);
    assert_eq!(decode(&mut state), 2);
}

Asymmetric numeral systems (ANS) is a relatively new approach that generalizes the example above. ANS uses a state machine to represent bit fractions, and uses an implementation that does not require expensive arithmetic operations.

I spent some time reading about asymmetric numeral systems (ANS) and wrote some code to help me learn the topic. In this repository I implemented two varients of the tANS encoder. The first implementation is a very simple encoder, and the second implementation is a simplified version of the efficient implementation by Yann Collet: FSE.

The code is structured like a research notebook with unit tests that drive the code. The two interesting files are simple.rs and fse.rs. The Simple file implements a basic encoder that is not very memory efficient. The FSE file implements a subset of the the advanced FSE encoder that Yann wrote. This encoder has many optimizations that makes the C code more difficult to follow. I tried to remove some of the optimizations to make the code more readable, and I also attached a diagram that explains the flow of information within the encoder (below).

This unit test is a good starting point for exploring the implementation of the encoder.

#[test]
fn test_round_trip_simple_encoder() {
    let text =  "entropy encoding is typically the last stage of a compression algorithm";
    let input: Vec<u8> = text.as_bytes().to_vec();

    // Define an encoder with 8bit symbols, and 12bit states.
    let mut enc = Simple::<256, 4096>::new();
    // Initialize the encoder based on the statistical properties of the input.
    enc.from_data(&input);
    // Encode the test.
    enc.encode_data(&input);
    // Print the compressed binary representation.
    enc.bitvector.dump();
    // Decode the data.
    let out = enc.decode_data();

    println!("Decoded {:?}", out);
    println!("Input length = {}", 8 * input.len());
    println!("Compressed length = {}", enc.bitvector.len());
    assert_eq!(out, input);
}

Program output:

running 1 test
[00000001111001111101010000110101
.10010110000100101010001000110110
.11011001000101000101110010000011
....
.11001100110]
Decoded [101, 110, 116, 114, 111, 112, 121, 32, 101, 110,  .... ]
Input length = 568
Compressed length = 427
test simple::test_round_trip_simple_encoder ... ok

Here are a few useful links that I used while learning the topic:

Yann's Blog post describing FSE: Link
Python notebook that implements FSE: Link
Excellent blog post that explains the encoding technique: Link.
Blog posts by Charles Bloom: Link.
Academic paper by Jarek Duda that: Link

This char shows the flow of information in the FSE encoder:

Entropy Encoding notebook. Simple implementations of the "tANS" encoder/decoder.

Related tags

Overview

EntropyEncoding Experiments

You might also like...

TLV-C encoding support.

A binary encoder / decoder implementation in Rust.

A Rust encoder/decoder for Dominic Szablewski's QOI format for fast, lossless image compression.

mico (minimalistic config file format) encoder and decoder

A binary encoder / decoder implementation in Rust.

A basic rust QOI decoder/encoder

Fast encoder/decoder for the lossless DTM 16 bit image format

Free Rust-only Xbox ADPCM encoder and decoder

decoder (and encoder) for quaternions sent from a joycon. Based heavily on reverse engineering done by hexkyz.

A notebook app integrated with todo lists utility. Developed with Rust, WebAssembly, Yew and Trunk.

Rust-battery - Rust crate providing cross-platform information about the notebook batteries.

A tool to deserialize data from an input encoding, transform it and serialize it back into an output encoding.

Databento Binary Encoding (DBZ) - Fast message encoding and storage format for market data

Private swaps for Secret Network using a private entropy pool & differential privacy.

Private swaps for Secret Network using a private entropy pool & differential privacy.

A password entropy calculator.

Efficient state-based CRDT replication and anti-entropy

The fastest and safest AV1 encoder.

A DHCP parser and encoder for DHCPv4/DHCPv6

Owner

Nadav Rotem

Free Rust-only Xbox ADPCM encoder and decoder

Implementation of Bencode encoding written in rust

Encoding and decoding support for BSON in Rust

A Gecko-oriented implementation of the Encoding Standard in Rust

Character encoding support for Rust

A HTML entity encoding library for Rust

A TOML encoding/decoding library for Rust

Variable-length signed and unsigned integer encoding that is byte-orderable for Rust

A series of compact encoding schemes for building small and fast parsers and serializers

Astro Format is a library for efficiently encoding and decoding a set of bytes into a single buffer format.