Front-coding string dictionary in Rust

Related tags

Text processing fcsd
Overview

Front-coding string dictionary in Rust

Documentation Crates.io License: MIT

This is a Rust library of the (plain) front-coding string dictionary described in Martínez-Prieto et al., Practical compressed string dictionaries, INFOSYS 2016.

Features

  • Dictionary encoding. Fcsd provides a bijective mapping between strings and integer IDs. It is so-called dictionary encoding and useful for text compression in many applications.
  • Simple and fast compression. Fcsd maintains a set of strings in a compressed space through front-coding, a differential compression technique for strings, allowing for fast decompression operations.
  • Random access. Fcsd maintains strings through a bucketization technique enabling to directly decompress arbitrary strings and perform binary search for strings.

Example

use fcsd::*;

fn main() {
    // Sorted unique input key strings.
    let keys = [
        "deal",       // 0
        "idea",       // 1
        "ideal",      // 2
        "ideas",      // 3
        "ideology",   // 4
        "tea",        // 5
        "techie",     // 6
        "technology", // 7
        "tie",        // 8
        "trie",       // 9
    ];

    // Builds the FC string dictionary with bucket size 4.
    // Note that the bucket size needs to be a power of two.
    let dict = {
        let mut builder = FcBuilder::new(4).unwrap();
        for &key in &keys {
            builder.add(key.as_bytes()).unwrap();
        }
        FcDict::from_builder(builder)
    };

    // Locates the IDs associated with given keys.
    {
        let mut locator = dict.locator();
        assert_eq!(locator.run(keys[1].as_bytes()).unwrap(), 1);
        assert_eq!(locator.run(keys[7].as_bytes()).unwrap(), 7);
        assert!(locator.run("techno".as_bytes()).is_none());
    }

    // Decodes the key strings associated with given IDs.
    {
        let mut decoder = dict.decoder();
        assert_eq!(&decoder.run(4).unwrap(), keys[4].as_bytes());
        assert_eq!(&decoder.run(9).unwrap(), keys[9].as_bytes());
    }

    // Enumerates the stored keys and IDs in lex order.
    {
        let mut iterator = dict.iter();
        while let Some((id, dec)) = iterator.next() {
            assert_eq!(keys[id].as_bytes(), &dec);
        }
    }

    // Enumerates the stored keys and IDs, starting with prefix "idea", in lex order.
    {
        let mut iterator = dict.prefix_iter("idea".as_bytes());
        let (id, dec) = iterator.next().unwrap();
        assert_eq!(1, id);
        assert_eq!("idea".as_bytes(), &dec);
        let (id, dec) = iterator.next().unwrap();
        assert_eq!(2, id);
        assert_eq!("ideal".as_bytes(), &dec);
        let (id, dec) = iterator.next().unwrap();
        assert_eq!(3, id);
        assert_eq!("ideas".as_bytes(), &dec);
        assert!(iterator.next().is_none());
    }

    // Serialization / Deserialization
    {
        let mut bytes = Vec::<u8>::new();
        dict.serialize_into(&mut bytes).unwrap();
        assert_eq!(bytes.len(), dict.serialized_size_in_bytes());

        let other = FcDict::deserialize_from(&bytes[..]).unwrap();
        assert_eq!(bytes.len(), other.serialized_size_in_bytes());
    }
}

Note

  • Input keys must not contain \0 character because the character is used for the string delimiter.
  • The bucket size of 8 is recommended in space-time tradeoff by Martínez-Prieto's paper.

Todo

  • Add benchmarking codes.
  • Add RePair compressed veriants.

Licensing

This library is free software provided under MIT.

You might also like...
Natural language detection library for Rust. Try demo online: https://www.greyblake.com/whatlang/
Natural language detection library for Rust. Try demo online: https://www.greyblake.com/whatlang/

Whatlang Natural language detection for Rust with focus on simplicity and performance. Content Features Get started Documentation Supported languages

Multilingual implementation of RAKE algorithm for Rust

RAKE.rs The library provides a multilingual implementation of Rapid Automatic Keyword Extraction (RAKE) algorithm for Rust. How to Use Append rake to

A Rust library for generically joining iterables with a separator

joinery A Rust library for generically joining iterables with a separator. Provides the tragically missing string join functionality to rust. extern c

Rust native ready-to-use NLP pipelines and transformer-based models (BERT, DistilBERT, GPT2,...)

rust-bert Rust native Transformer-based models implementation. Port of Hugging Face's Transformers library, using the tch-rs crate and pre-processing

👄 The most accurate natural language detection library in the Rust ecosystem, suitable for long and short text alike
👄 The most accurate natural language detection library in the Rust ecosystem, suitable for long and short text alike

Table of Contents What does this library do? Why does this library exist? Which languages are supported? How good is it? Why is it better than other l

Snips NLU rust implementation

Snips NLU Rust Installation Add it to your Cargo.toml: [dependencies] snips-nlu-lib = { git = "https://github.com/snipsco/snips-nlu-rs", branch = "mas

A fast, low-resource Natural Language Processing and Text Correction library written in Rust.

nlprule A fast, low-resource Natural Language Processing and Error Correction library written in Rust. nlprule implements a rule- and lookup-based app

A fast implementation of Aho-Corasick in Rust.

aho-corasick A library for finding occurrences of many patterns at once with SIMD acceleration in some cases. This library provides multiple pattern s

Natural Language Processing for Rust

rs-natural Natural language processing library written in Rust. Still very much a work in progress. Basically an experiment, but hey maybe something c

Comments
  • Add some Rusty functions

    Add some Rusty functions

    For convenience, please add the following functions:

    FcDict::iter(&self) -> FcIterator
    FcDict::prefix_iter(&self, key: &'a [u8]) -> FcPrefixIterator
    FcDict::locater(&self) -> FcLocater
    
    opened by vbkaisetsu 1
Owner
Shunsuke Kanda
Trie Pokémon
Shunsuke Kanda
Fast PDF password cracking utility equipped with commonly encountered password format builders and dictionary attacks.

PDFRip Fast PDF password cracking utility equipped with commonly encountered password format builders and dictionary attacks. ?? Table of Contents Int

Mufeed VH 226 Jan 4, 2023
Rust edit distance routines accelerated using SIMD. Supports fast Hamming, Levenshtein, restricted Damerau-Levenshtein, etc. distance calculations and string search.

triple_accel Rust edit distance routines accelerated using SIMD. Supports fast Hamming, Levenshtein, restricted Damerau-Levenshtein, etc. distance cal

Daniel Liu 75 Jan 8, 2023
A lightweight and snappy crate to remove emojis from a string.

A lightweight and snappy crate to remove emojis from a string.

Tejas Ravishankar 8 Jul 19, 2022
Rust-nlp is a library to use Natural Language Processing algorithm with RUST

nlp Rust-nlp Implemented algorithm Distance Levenshtein (Explanation) Jaro / Jaro-Winkler (Explanation) Phonetics Soundex (Explanation) Metaphone (Exp

Simon Paitrault 34 Dec 20, 2022
Fast suffix arrays for Rust (with Unicode support).

suffix Fast linear time & space suffix arrays for Rust. Supports Unicode! Dual-licensed under MIT or the UNLICENSE. Documentation https://docs.rs/suff

Andrew Gallant 207 Dec 26, 2022
Elastic tabstops for Rust.

tabwriter is a crate that implements elastic tabstops. It provides both a library for wrapping Rust Writers and a small program that exposes the same

Andrew Gallant 212 Dec 16, 2022
An efficient and powerful Rust library for word wrapping text.

Textwrap Textwrap is a library for wrapping and indenting text. It is most often used by command-line programs to format dynamic output nicely so it l

Martin Geisler 322 Dec 26, 2022
⏮ ⏯ ⏭ A Rust library to easily read forwards, backwards or randomly through the lines of huge files.

EasyReader The main goal of this library is to allow long navigations through the lines of large files, freely moving forwards and backwards or gettin

Michele Federici 81 Dec 6, 2022
An implementation of regular expressions for Rust. This implementation uses finite automata and guarantees linear time matching on all inputs.

regex A Rust library for parsing, compiling, and executing regular expressions. Its syntax is similar to Perl-style regular expressions, but lacks a f

The Rust Programming Language 2.6k Jan 8, 2023