Front-coding string dictionary in Rust

Shunsuke Kanda

Last update: Jul 14, 2022

Related tags

Text processing fcsd

Overview

Front-coding string dictionary in Rust

This is a Rust library of the (plain) front-coding string dictionary described in Martínez-Prieto et al., Practical compressed string dictionaries, INFOSYS 2016.

Features

Dictionary encoding. Fcsd provides a bijective mapping between strings and integer IDs. It is so-called dictionary encoding and useful for text compression in many applications.
Simple and fast compression. Fcsd maintains a set of strings in a compressed space through front-coding, a differential compression technique for strings, allowing for fast decompression operations.
Random access. Fcsd maintains strings through a bucketization technique enabling to directly decompress arbitrary strings and perform binary search for strings.

Example

use fcsd::*;

fn main() {
    // Sorted unique input key strings.
    let keys = [
        "deal",       // 0
        "idea",       // 1
        "ideal",      // 2
        "ideas",      // 3
        "ideology",   // 4
        "tea",        // 5
        "techie",     // 6
        "technology", // 7
        "tie",        // 8
        "trie",       // 9
    ];

    // Builds the FC string dictionary with bucket size 4.
    // Note that the bucket size needs to be a power of two.
    let dict = {
        let mut builder = FcBuilder::new(4).unwrap();
        for &key in &keys {
            builder.add(key.as_bytes()).unwrap();
        }
        FcDict::from_builder(builder)
    };

    // Locates the IDs associated with given keys.
    {
        let mut locator = dict.locator();
        assert_eq!(locator.run(keys[1].as_bytes()).unwrap(), 1);
        assert_eq!(locator.run(keys[7].as_bytes()).unwrap(), 7);
        assert!(locator.run("techno".as_bytes()).is_none());
    }

    // Decodes the key strings associated with given IDs.
    {
        let mut decoder = dict.decoder();
        assert_eq!(&decoder.run(4).unwrap(), keys[4].as_bytes());
        assert_eq!(&decoder.run(9).unwrap(), keys[9].as_bytes());
    }

    // Enumerates the stored keys and IDs in lex order.
    {
        let mut iterator = dict.iter();
        while let Some((id, dec)) = iterator.next() {
            assert_eq!(keys[id].as_bytes(), &dec);
        }
    }

    // Enumerates the stored keys and IDs, starting with prefix "idea", in lex order.
    {
        let mut iterator = dict.prefix_iter("idea".as_bytes());
        let (id, dec) = iterator.next().unwrap();
        assert_eq!(1, id);
        assert_eq!("idea".as_bytes(), &dec);
        let (id, dec) = iterator.next().unwrap();
        assert_eq!(2, id);
        assert_eq!("ideal".as_bytes(), &dec);
        let (id, dec) = iterator.next().unwrap();
        assert_eq!(3, id);
        assert_eq!("ideas".as_bytes(), &dec);
        assert!(iterator.next().is_none());
    }

    // Serialization / Deserialization
    {
        let mut bytes = Vec::<u8>::new();
        dict.serialize_into(&mut bytes).unwrap();
        assert_eq!(bytes.len(), dict.serialized_size_in_bytes());

        let other = FcDict::deserialize_from(&bytes[..]).unwrap();
        assert_eq!(bytes.len(), other.serialized_size_in_bytes());
    }
}

Note

Input keys must not contain \0 character because the character is used for the string delimiter.
The bucket size of 8 is recommended in space-time tradeoff by Martínez-Prieto's paper.

Todo

Add benchmarking codes.
Add RePair compressed veriants.

Licensing

This library is free software provided under MIT.

You might also like...

Natural language detection library for Rust. Try demo online: https://www.greyblake.com/whatlang/

Whatlang Natural language detection for Rust with focus on simplicity and performance. Content Features Get started Documentation Supported languages

805 Dec 28, 2022

Multilingual implementation of RAKE algorithm for Rust

RAKE.rs The library provides a multilingual implementation of Rapid Automatic Keyword Extraction (RAKE) algorithm for Rust. How to Use Append rake to

26 Dec 16, 2022

A Rust library for generically joining iterables with a separator

joinery A Rust library for generically joining iterables with a separator. Provides the tragically missing string join functionality to rust. extern c

72 Dec 16, 2022

Rust native ready-to-use NLP pipelines and transformer-based models (BERT, DistilBERT, GPT2,...)

rust-bert Rust native Transformer-based models implementation. Port of Hugging Face's Transformers library, using the tch-rs crate and pre-processing

1.3k Jan 8, 2023

👄 The most accurate natural language detection library in the Rust ecosystem, suitable for long and short text alike

Table of Contents What does this library do? Why does this library exist? Which languages are supported? How good is it? Why is it better than other l

569 Jan 3, 2023

Snips NLU rust implementation

Snips NLU Rust Installation Add it to your Cargo.toml: [dependencies] snips-nlu-lib = { git = "https://github.com/snipsco/snips-nlu-rs", branch = "mas

327 Dec 26, 2022

A fast, low-resource Natural Language Processing and Text Correction library written in Rust.

nlprule A fast, low-resource Natural Language Processing and Error Correction library written in Rust. nlprule implements a rule- and lookup-based app

496 Jan 8, 2023

A fast implementation of Aho-Corasick in Rust.

aho-corasick A library for finding occurrences of many patterns at once with SIMD acceleration in some cases. This library provides multiple pattern s

662 Dec 31, 2022

Natural Language Processing for Rust

rs-natural Natural language processing library written in Rust. Still very much a work in progress. Basically an experiment, but hey maybe something c

211 Dec 28, 2022

Comments

Add some Rusty functions

For convenience, please add the following functions:

FcDict::iter(&self) -> FcIterator
FcDict::prefix_iter(&self, key: &'a [u8]) -> FcPrefixIterator
FcDict::locater(&self) -> FcLocater

opened by vbkaisetsu 1

Front-coding string dictionary in Rust

Related tags

Overview

Front-coding string dictionary in Rust

Features

Example

Note

Todo

Licensing

You might also like...

Natural language detection library for Rust. Try demo online: https://www.greyblake.com/whatlang/

Multilingual implementation of RAKE algorithm for Rust

A Rust library for generically joining iterables with a separator

Rust native ready-to-use NLP pipelines and transformer-based models (BERT, DistilBERT, GPT2,...)

👄 The most accurate natural language detection library in the Rust ecosystem, suitable for long and short text alike

Snips NLU rust implementation

A fast, low-resource Natural Language Processing and Text Correction library written in Rust.

A fast implementation of Aho-Corasick in Rust.

Natural Language Processing for Rust

Comments

Add some Rusty functions

Owner

Shunsuke Kanda

Fast PDF password cracking utility equipped with commonly encountered password format builders and dictionary attacks.

Rust edit distance routines accelerated using SIMD. Supports fast Hamming, Levenshtein, restricted Damerau-Levenshtein, etc. distance calculations and string search.

This tool is for those who often want to search for a string deeply into a directory in recursive mode, but not with the great tool: grep, ack, ripgrep .........一个工具最大的价值不是它有多少功能，而是它能够让你以多快的速度达成所愿......

A lightweight and snappy crate to remove emojis from a string.

Rust-nlp is a library to use Natural Language Processing algorithm with RUST

Fast suffix arrays for Rust (with Unicode support).

Elastic tabstops for Rust.

An efficient and powerful Rust library for word wrapping text.

⏮ ⏯ ⏭ A Rust library to easily read forwards, backwards or randomly through the lines of huge files.

An implementation of regular expressions for Rust. This implementation uses finite automata and guarantees linear time matching on all inputs.