tongrams-rs: Tons of N-grams in Rust

Overview

tongrams-rs: Tons of N-grams in Rust

Documentation Crates.io License: MIT

This is a Rust port of tongrams to index and query large language models in compressed space, in which the data structures are presented in the following papers:

What can do

  • Store N-gram language models with frequency counts.

  • Look up N-grams to get the frequency counts.

Features

  • Compressed language model. tongrams-rs can store large N-gram language models in very compressed space. For example, the word N-gram datasets (N=1..5) in test_data are stored in only 2.6 bytes per gram.

  • Time and memory efficiency. tongrams-rs employs Elias-Fano Trie, which cleverly encodes a trie data structure consisting of N-grams through Elias-Fano codes, enabling fast lookups in compressed space.

  • Pure Rust. tongrams-rs is written only in Rust and can be easily pluged into your Rust codes.

Input data format

The file format of N-gram counts files is the same as that used in tongrams, a modified Google format, where

  • one separate file for each distinct value of N (order) lists one gram per row,
  • each header row indicates the number of N-grams in the file,
  • tokens in a gram are sparated by a space (e.g., the same time), and
  • a gram and the count are sparated by a horizontal tab.

   

    
     
      

       
        
          
          
           
            
              ... 
            
           
          
         
        
       
      
     
    
   

For example,

61516
the // parent	1
the function is	22
the function a	4
the function to	1
the function and	1
...

Command line tools

tools provides some command line tools to enjoy this library. In the following, the example usages are presented using N-gram counts files in test_data copied from tongrams.

1. Sorting

To build the trie index, you need to sort your N-gram counts files. First, prepare unigram counts files sorted by the counts for making a resulting index smaller, as

$ cat test_data/1-grams.sorted
8761
the	3681
is	1869
a	1778
of	1672
to	1638
and	1202
...

By using the unigram file as a vocabulary, the executable sort_grams sorts a N-gram counts file.

Here, we sort an unsorted bigram counts file, as

$ cat test_data/2-grams
38900
ways than	1
may come	1
frequent causes	1
way has	1
in which	14
...

You can sort the bigram file (in a gzip format) and write test_data/2-grams.sorted with the following command:

$ cargo run --release -p tools --bin sort_grams -- -i test_data/2-grams.gz -v test_data/1-grams.sorted.gz -o test_data/2-grams.sorted
Loading the vocabulary: "test_data/1-grams.sorted.gz"
Loading the records: "test_data/2-grams.gz"
Sorting the records
Writing the index into "test_data/2-grams.sorted.gz"

The output file format can be specified with -f, and the default setting is .gz. The resulting file will be

$ cat test_data/2-grams.sorted
38900
the //	1
the function	94
the if	3
the code	126
the compiler	117
...

2. Indexing

The executable index builds a language model from (sorted) N-gram counts files, named -grams.sorted.gz , and writes it into a binary file. The input file format can be specified with -f, and the default setting is .gz.

For example, the following command builds a language model from N-gram counts files (N=1..5) placed in directory test_data and writes it into index.bin.

$ cargo run --release -p tools --bin index -- -n 5 -i test_data -o index.bin
Input files: ["test_data/1-grams.sorted.gz", "test_data/2-grams.sorted.gz", "test_data/3-grams.sorted.gz", "test_data/4-grams.sorted.gz", "test_data/5-grams.sorted.gz"]
Counstructing the index...
Elapsed time: 0.190 [sec]
252550 grams are stored.
Writing the index into "index.bin"...
Index size: 659366 bytes (0.629 MiB)
Bytes per gram: 2.611 bytes

As the standard output shows, the model file takes only 2.6 bytes per gram.

3. Lookup

The executable lookup provides a demo to lookup N-grams, as follows.

take advantage count = 8 > only 64-bit execution count = 1 > Elias Fano Not found > Good bye!">
$ cargo run --release -p tools --bin lookup -- -i index.bin 
Loading the index from "index.bin"...
Performing the lookup...
> take advantage
count = 8
> only 64-bit execution
count = 1
> Elias Fano
Not found
> 
Good bye!

4. Memory statistics

The executable stats shows the breakdowns of memory usages for each component.

$ cargo run --release -p tools --bin stats -- -i index.bin
Loading the index from "index.bin"...
{"arrays":[{"pointers":5927,"token_ids":55186},{"pointers":19745,"token_ids":92416},{"pointers":25853,"token_ids":107094},{"pointers":28135,"token_ids":111994}],"count_ranks":[{"count_ranks":5350},{"count_ranks":12106},{"count_ranks":13976},{"count_ranks":14582},{"count_ranks":14802}],"counts":[{"count":296},{"count":136},{"count":72},{"count":56},{"count":56}],"vocab":{"data":151560}}

Benchmark

At the directory bench, you can measure lookup times using N-gram data in test_data with the following command:

$ RUSTFLAGS="-C target-cpu=native" cargo bench
count_lookup/tongrams/EliasFanoTrieCountLm
                        time:   [3.1818 ms 3.1867 ms 3.1936 ms]

The reported time is the total elapsed time for looking up 5K random grams. The above result was actually obtained on my laptop PC (Intel i7, 16GB RAM), i.e., EliasFanoTrieCountLm can look up a gram in 0.64 micro sec on average.

Todo

  • Add fast elias-fano and pertitioned elias-fano
  • Add minimal perfect hashing
  • Add remapping
  • Support probability scores
  • Make sucds::EliasFano faster

Licensing

This library is free software provided under MIT.

You might also like...
SDL bindings for Rust

Rust-SDL Bindings for SDL in Rust Overview Rust-SDL is a library for talking to SDL from Rust. Low-level C components are wrapped in Rust code to make

SDL2 bindings for Rust

Rust-SDL2 Bindings for SDL2 in Rust Changelog for 0.34.2 Overview Rust-SDL2 is a library for talking to the new SDL2.0 libraries from Rust. Low-level

SFML bindings for Rust

rust-sfml Rust bindings for SFML, the Simple and Fast Multimedia Library. Requirements Linux, Windows, or OS X Rust 1.42 or later SFML 2.5 CSFML 2.5 D

Rust bindings for libtcod 1.6.3 (the Doryen library/roguelike toolkit)

Warning: Not Maintained This project is no longer actively developed or maintained. Please accept our apologies. Open pull requests may still get merg

Victorem - easy UDP game server and client framework for creating simple 2D and 3D online game prototype in Rust.

Victorem Easy UDP game server and client framework for creating simple 2D and 3D online game prototype in Rust. Example Cargo.toml [dependencies] vict

Rust-based replacement for the default Minecraft renderer

wgpu-mc 🚀 A blazing fast alternative renderer for Minecraft Intro WebGPU is a new web specification designed to provide modern graphics and compute c

grr and rust-gpu pbr rendering
grr and rust-gpu pbr rendering

grr-gltf Barebone gltf viewer using grr and rust-gpu. Currently only supports a single gltf model! Assets These files need to be downloaded and placed

A no-frills Tetris implementation written in Rust with the Piston game engine, and Rodio for music.
A no-frills Tetris implementation written in Rust with the Piston game engine, and Rodio for music.

rustris A no-frills Tetris implementation written in Rust with the Piston game engine, and Rodio for music. (C) 2020 Ben Cantrick. This code is distri

Some Rust bindings for Binary Ninja

Binary Ninja Rust Bindings Work in progress Rust bindings geared towards analysis. Features added as they come up. No promises anything works at this

Comments
  • Add sort tool

    Add sort tool

    This PR added a tool to sort ngrams. The current version takes a lot of memory, so it is necessary to make it memory-efficient with secondary memory (in future).

    And, I modified PathBuf to AsRef<Path>.

    opened by kampersanda 0
  • Enable CPU intrinsics

    Enable CPU intrinsics

    This PR enabled CPU intrinsics.

    $ RUSTFLAGS="-C target-cpu=native" cargo bench
    count_lookup/tongrams/EliasFanoTrieCountLm                                                                            
                            time:   [3.1818 ms 3.1867 ms 3.1936 ms]
                            change: [-18.659% -18.380% -18.117%] (p = 0.00 < 0.05)
                            Performance has improved.
    
    opened by kampersanda 0
  • N-gram probabilities support

    N-gram probabilities support

    Hi,

    Thanks so much for this code.

    Do you have plans to add functionality for N-gram probability lookup? Would be a great addition to the repo and I'm willing to help out with it.

    Thanks

    opened by tobygodwin 1
Owner
Shunsuke Kanda
TRIE Pokémon
Shunsuke Kanda
Rust-and-opengl-lessons - Collection of example code for learning OpenGL in Rust

rust-and-opengl-lessons Project requires Rust 1.31 Collection of example code for learning OpenGL in Rust 00 - Setup 01 - Window 02 - OpenGL Context 0

Nerijus Arlauskas 348 Dec 11, 2022
Simple retro game made using Rust bracket-lib by following "Herbert Wolverson's Hands on Rust" book.

Flappy Dragon Code from This program is a result of a tutorial i followed from Herbert Wolverson's Hands-on Rust Effective Learning through 2D Game De

Praneeth Chinthaka Ranasinghe 1 Feb 7, 2022
A rust chess implementation using a neural network scoring function built on huggingface/candle + rust + wasm

Rusty Chess What is it? Rusty Chess aims to be a high quality embeddable chess engine that runs entirely locally in the browser (no backend required).

Gareth 3 Nov 3, 2023
A Rust wrapper and bindings of Allegro 5 game programming library

RustAllegro A thin Rust wrapper of Allegro 5. Game loop example extern crate allegro; extern crate allegro_font; use allegro::*; use allegro_font::*;

null 80 Dec 31, 2022
High performance Rust ECS library

Legion aims to be a feature rich high performance Entity component system (ECS) library for Rust game projects with minimal boilerplate. Getting Start

Amethyst Engine 1.4k Jan 5, 2023
A refreshingly simple data-driven game engine built in Rust

What is Bevy? Bevy is a refreshingly simple data-driven game engine built in Rust. It is free and open-source forever! WARNING Bevy is still in the ve

Bevy Engine 21.1k Jan 4, 2023
Rust library to create a Good Game Easily

ggez What is this? ggez is a Rust library to create a Good Game Easily. The current version is 0.6.0-rc0. This is a RELEASE CANDIDATE version, which m

null 3.6k Jan 7, 2023
RTS game/engine in Rust and WebGPU

What is this? A real time strategy game/engine written with Rust and WebGPU. Eventually it will be able to run in a web browser thanks to WebGPU. This

Thomas SIMON 258 Dec 25, 2022
unrust - A pure rust based (webgl 2.0 / native) game engine

unrust A pure rust based (webgl 2.0 / native) game engine Current Version : 0.1.1 This project is under heavily development, all api are very unstable

null 368 Jan 3, 2023
Rust bindings for GDNative

GDNative bindings for Rust Rust bindings to the Godot game engine. Website | User Guide | API Documentation Stability The bindings cover most of the e

null 3.2k Jan 9, 2023