finalfusion embeddings in Rust

Overview

Introduction

crates.io docs.rs Travis CI

finalfusion is a crate for reading, writing, and using embeddings in Rust. finalfusion primarily works with its own format which supports a large variety of features. Additionally, the fastText, word2vec and GloVe file formats are also supported.

finalfusion is API stable since 0.11.0. However, we cannot tag version 1 yet, because several dependencies that are exposed through the API have not reached version 1 (particularly ndarray and rand). Future 0.x releases of finalfusion will be used to accomodate updates of these dependencies.

Heads-up: there is a small API change between finalfusion 0.11 and 0.12. The Error type has been moved from finalfusion::io to finalfusion::error. The separate ErrorKind enum has been merged with Error. Error is now marked as non-exhaustive, so that new error variants can be added in the future without changing the API.

Usage

To make finalfusion available in your crate, simply place the following in your Cargo.toml

finalfusion = 0.12

Loading embeddings and querying it is as simple as:

use std::fs::File;
use std::io::BufReader;

use finalfusion::prelude::*;

fn main() {
    let mut reader = BufReader::new(File::open("embeddings.fifu").unwrap());
    let embeds = Embeddings::<VocabWrap, StorageWrap>::read_embeddings(&mut reader).unwrap();
    embeds.embedding("Query").unwrap();
}

Features

finalfusion supports a variety of formats:

  • Vocabulary
    • Subwords
    • No subwords
  • Storage
    • Array
    • Memory-mapped
    • Quantized
  • Format

Moreover, finalfusion provides:

  • Similarity queries
  • Analogy queries
  • Quantizing embeddings through reductive
  • Conversion to the following formats:
    • finalfusion
    • word2vec
    • GloVe

For more information, please consult the API documentation.

Getting embeddings

Embeddings trained with finalfrontier starting with version 0.4 are in finalfusion format and compatible with his crate. A growing set of pretrained embeddings is offered on our website and we have converted the fastText Wikipedia and Common Crawl embeddings to finalfusion. More information can also be found at https://finalfusion.github.io.

Which type of storage should I use?

Quantized embeddings

Quantized embeddings store embeddings as discrete representations. Imagine that for a given embeddings space, you would find 256 prototypical embeddings. Each embedding could then be stored as a 1-byte pointer to one of these prototypical embeddings. Of course, having only 256 possible representations, this quantized embedding space would be very coarse-grained.

product quantizers (pq) solve this problem by splitting each embedding evenly into q subvectors and finding prototypical vectors for each set of subvectors. If we use 256 prototypical representations for each subspace, 256^q different word embeddings can be represented. For instance, if q = 150, we could represent 250^150 different embeddings. Each embedding would then be stored as 150 byte-sized pointers.

optimized product quantizers (opq) additionally applies a linear map to the embedding space to distribute variance across embedding dimensions.

By quantizing an embedding matrix, its size can be reduced both on disk and in memory.

Memory mapped embeddings

Normally, we read embeddings into memory. However, as an alternative the embeddings can be memory mapped. Memory mapping makes the on-disk embedding matrix available as pages in virtual memory. The operating system will then (transparently) load these pages into physical memory as necessary.

Memory mapping speeds up the initial loading time of word embeddings, since only the vocabulary needs to be read. The operating system will then load (part of the) embedding matrix a by-need basis. The operating system can additionally free up the memory again when no embeddings are looked up and other processes require memory.

Empirical comparison

The following empirical comparison of embedding types uses an embedding matrix with 2,807,440 embeddings (710,288 word, 2,097,152 subword) of dimensionality 300. The embedding lookup timings were done on an Intel Core i5-8259U CPU, 2.30GHz.

Known lookup and Unknown lookup time lookups of words that are inside/outside the vocabulary. Lookup contains a mixture of known and unknown words.

Storage Lookup Known lookup Unknown lookup Memory Disk
array 449 ns 232 ns 18 μs 3213 MiB 3213 MiB
array mmap 833 ns 494 ns 23 μs Variable 3213 MiB
opq 40 μs 21 μs 962 μs 402 MiB 402 MiB
opq mmap 41 μs 21 μs 960 μs Variable 402 MiB

Note: two units are used: nanoseconds (ns) and microseconds (μs).

Using a BLAS or LAPACK library

If you are using finalfusion in a binary crate, you can compile ndarray with BLAS support to speed up certain functionality in finalfusion-rust. In order to do so, enable the ndarray/blas feature and add one of the following crates as a dependency to select a BLAS/LAPACK implementation:

  • netlib-src: Use reference BLAS/LAPACK (slow, not recommended)
  • openblas-src: Use OpenBLAS
  • intel-mkl-src: Use Intel Math Kernel Library

If you want to quantize an embedding matrix using optimized product quantization, you must enable the the reductive/opq-train feature in addition to adding a BLAS/LAPACK implementation.

The Cargo.toml file of finalfusion-utils can be used as an example of how to use BLAS in a binary crate.

Example: embedding lookups in quantized matrices

Embedding lookups in embedding matrices that were quantized using the optimized product quantizer can be speeded up using a good BLAS implementation. The following table compares lookup times on an Intel Core i5-8259U CPU, 2.30GHz with finalfusion compiled with and without MKL/OpenBLAS support:

Storage Lookup Known lookup Unknown lookup
opq 40 μs 21 μs 962 μs
opq mmap 41 μs 21 μs 960 μs
opq (MKL) 14 μs 7 μs 309 μs
opq mmap (MKL) 14 μs 7 μs 309 μs
opq (OpenBLAS) 15 μs 7 μs 336 μs
opq mmap (OpenBLAS) 15 μs 7 μs 342 μs

Where to go from here

Comments
  • Release 0.11

    Release 0.11

    edit: Copied the list from the other comment to get an indicator for the list ticks.

    Going through the API:

    • [x] prelude.rs: #105
    1. prelude re-exports SimpleVocab and SubwordVocab but not the aliases for FinalfusionSubwordVocab etc. I think it makes more sense to re-export the aliases and leave SubwordVocab in chunks::vocab.
    2. MMapQuantizedArray is the only storage not re-exported.
    3. NdNorms is the only public chunk not being re-exported.
    • [x] chunks::storage::mod.rs:

    • [x] chunks::storage::array.rs: #93

    1. NdArray derives Debug but not MmapArray.
    2. NdArray should derive Clone
    • [x] chunks::storage::wrappers.rs:
    1. StorageViewWrap implies that it wraps a view while it actually wraps a viewable storage. Think about renaming?
    • [x] chunks::mod.rs:
    1. Might be nice to explain the concept, right now the module docs state finalfusion chunks
    • [x] chunks::io.rs: #92 Nothing public but from maintainers perspective:
    1. typeid_impl is lacking docs, choices 1 for u8 and 10 for f32 seem arbitrary - iirc in order to leave room for other int and float types. Could still use some docs to make that clear once this is forgotten.
    • [x] chunks::metadata.rs: #91
    1. Encapsulate the inner Value.
    2. Alternatives to lock in toml as our choice? Upside of toml is that we get easy serialization and heterogeneous collections. Downside is that we always need Values to construct.
    • [x] chunks::norms.rs: #90
    1. Encapsulate NdNorms' inner array
    2. ~~impl Index for NdNorms since Norms seems to be just that but without the [ ]-indexing~~ Index::index returns references. I'd still like to make norm a method directly on NdNorms since NdNorms is entirely useless without importing Norms otherwise.
    • [x] chunks::vocab.rs: #89
    1. Rename vocabtypes as discussed
    2. Think about reorganizing in separate modules, i.e. vocab::mod.rs, vocab::subword.rs, vocab::simple.rs.
    3. [x] Remove Clone from Vocab's requirements and the other places where it pops up because of this requirement. (e.g. Indexer bounds)
    • [x] compat::fasttext

    • [x] compat::{text.rs, word2vec.rs}

    • [x] embeddings.rs

    1. Go through trait-bounds.
    • [x] io.rs

    • [x] lib.rs

    • [x] similarity.rs

    1. Could be nicer to read if put into module with similarity::analogy.rs and similarity::similarity.rs
    • [x] subword.rs
    1. Indexers could live in their own indexer.rs
    • [x] util.rs

    What needs to be done until then?

    • Probably clean up chunks::Vocab::{NgramIndices, SubwordIndices}, consolidate into one trait?. Related to that, bracketing words per default forces us to collect the indices in both methods and return them as Vec.
    opened by sebpuetz 25
  • Support alignment of embedding matrices on 16-byte boundaries

    Support alignment of embedding matrices on 16-byte boundaries

    I spent a fair amount of time to implement this properly (everything gated through a special AlignedArray type, which has constructors guaranteeing alignment).

    On CI enabled tests using 32-byte alignment. Recent glibc versions allocate on 16-byte boundaries, so tests would never fail on my machone, even when the code was wrong. But even with 32-byte boundaries there is some non-determinism, since there is a small chance that memory that is not explicitly aligned ended up on a boundary by chance.

    Advantages:

    • We have explicit control over alignment.
    • By using feature gating, we can use default (f32) alignment as before.
    • We can give other libraries (Eigen, Tensorflow) the alignment that they need.

    Disadvantages:

    • As discussed, reading text files without dimensions, no requires two passes over the file.
    • We have to be very careful with new code to use Vec::with_capacity_aligned if an embedding matrix is initialized from a Vec.
    • Apart from non-determinism, we might miss cases where alignment is not enforced, when the code path is not covered by a unit test (I just did a grep for vec! and Vec::with_capacity).
    opened by danieldk 22
  • Add support for Floret embeddings

    Add support for Floret embeddings

    Add support for reading and using Floret embeddings. Floret combines subword embeddings with Bloom embeddings:

    https://github.com/explosion/floret#how-floret-works

    This allows Floret embedding matrices to be compact like Bloom embeddings, while also using subword units.

    Supporting Floret embeddings requires some architectural changes to finalfusion:

    1. ngrams -> index mappings are many-to-many

      finalfusion assumed that each n-gram maps to a single index. In Floret embeddings, an n-gram can map to multiple embeddings. finalfusion is extended in this change to support returning multiple indices for an n-gram.

    2. Floret also hashes the word itself, not just the subwords. This change adds a boolean flag. If it is enabled, the word itself is also hashed, and its index is returned.

    3. Floret permits different begin/end-of-word markers. These are made configurable for subword vocabs.

    opened by danieldk 11
  • Add MmapQuantizedArray

    Add MmapQuantizedArray

    Does what it says on the tin: memory-maps the quantized embeddings matrix and norms (if present).

    This change does not implement WriteChunk for this data type, which will be added through a separate PR.


    There is still a build error, because tempfile depends on some crate that does not build with Rust 1.31.0. This 'only support/check against the latest Rust stable' in the Rust ecosystem is becoming quite maddening.

    opened by danieldk 10
  • Conversion of fastText embeddings

    Conversion of fastText embeddings

    We could get a lot of additional pretrained embeddings for 'free' if we convert fastText embeddings from:

    https://fasttext.cc/docs/en/crawl-vectors.html

    Any opinions?

    opened by danieldk 10
  • Add NdNorms chunk

    Add NdNorms chunk

    This chunk stores the norms of in-vocabulary words. If an Embedding data structure stores NdNorms, the vector norms can be queried with the embedding_with_norm method.

    When converting from word2vec/text in ff-convert, always normalize embeddings and store norms.

    Some random points:

    • I still need to add some more unit tests.
    • The norms chunk is currently not a generic type of Embedding. This is to keep the type signature simple while we still can. This may change later. It will definitely be necessary when we want to make it memory mappable.
    • The norms chunk is optional, for compatibility with existing finalfusion embeddings. Since we can assume the (known) word embeddings to be normalized, the norm 1 is returned when the chunk is absent.
    • ff-convert from word2vec and text now always normalizes the embeddings and creates a norms chunk.
    • I am considering to multiply the unit vectors by their norms when converting to word2vec or text embeddings, to restore the raw, unnormalized embeddings. However, this is currently not done.
    • I tried to make the norms mandatory (normalizing the embeddings and computing norms when the chunk is not present). However, this poses problems, because mmaped and quantized embeddings cannot be updated.
    • The traits for reading word2vec/text embeddings still have an argument for normalization. These are still needed for roundtrip unit tests. I might make a crate-private trait for unnormalized reading and call this from the public trait that always normalizes. This would allow us to keep the roundtrip unit tests, while always normalizing in the public interface. (Done now for ReadWord2Vec)
    opened by danieldk 8
  • Add Similarity query for embeddings.

    Add Similarity query for embeddings.

    Add a method to query for the most similar words given a query vector.

    Thought, this could be nice to have since analogy queries are essentially doing the same.

    opened by sebpuetz 7
  • Conversion to ExplicitSubwordVocab and index storage.

    Conversion to ExplicitSubwordVocab and index storage.

    Add conversion from BucketVocab to ExplicitSubwordVocab. This requires storage of indices on disk, read and write methods now store indices, too.

    Depends on #89, implements option 2 described in https://github.com/finalfusion/finalfusion-rust/issues/87#issuecomment-543739524

    opened by sebpuetz 6
  • Modernize and improve error handling

    Modernize and improve error handling

    • Merge the Error and ErrorKind enums.
    • Move the Error enum to the error module.
    • Derive trait implementations using the thiserror crate.
    • Make the Error enum non-exhaustive
    • Replace the ChunkIdentifier::try_from method by an implementation of the TryFrom crate.
    opened by danieldk 4
  • What needs to be done before finalfusion 1?

    What needs to be done before finalfusion 1?

    I think that for downstream users it would be great if we had a first release. Due to semver, once we release 1.0.0, we cannot break APIs anymore in 1.x.y. Basically, once someone puts

    finalfusion = "1"
    

    in their Cargo.toml, it should always work for 1.x.y. Of course, we can extend the API.

    This issue is for discussing what still needs to be done before 1.0.0. What sore thumbs are sticking out. Also, we should probably look if there are any public APIs that need to be private or crate-private.

    opened by danieldk 4
  • Relicensing to MIT or Apache License version 2.0

    Relicensing to MIT or Apache License version 2.0

    finalfusion-rust is currently licensed under the following licenses (user's choice):

    • Apache License Version 2.0
    • Blue Oak Model License Version 1.0.0

    We are trying to relicense the finalfusion ecosystem to something that is more canonical for the Rust ecosystem, namely (user's choice):

    • Apache License Version 2.0
    • MIT License

    I, @sebpuetz, and SfS ISCL have agreed with relicensing all finalfusion projects in this way. However, finalfusion-rust has two additional contributors:

    • @RealNicolasBourbaki
    • @djc

    If you are ok with this relicensing, could you please reply to this issue confirming this? Thanks a lot!

    opened by danieldk 3
  • Analogy interface is too inflexible

    Analogy interface is too inflexible

    While reimplementing finalfusion-inspector in Rust, I bumped into a small annoyance. The analogy method takes an array of &str:

    query: [&str; 3]

    However, oftentimes you have a [String; 3]. We should relax this type to:

    query: [impl AsRef<str>; 3]

    This should not break the API, since it allows a superset of types of the original signature.

    opened by danieldk 0
  • Pretrained embedding fetcher

    Pretrained embedding fetcher

    I think it would be nice to have a small utility data structure to fetch pretrained embeddings. I don't think this needs to be part of the finalfusion crate, since it is not really core functionality. The basic idea is:

    • We'd have a repository finalfusion-fetcher with some metadata file (probably JSON), mapping embedding file identifiers to URLs. E.g. fasttext.wiki.nl.fifu could map to http://www.sfs.uni-tuebingen.de/a3-public-data/finalfusion-fasttext/wiki/wiki.nl.fifu

    • A small crate (possibly in the same repo), would provide a datastructure Fetcher With a constructor that retrieves the metadata and gives a fetcher:

      let fetcher = Fetcher::fetch_metadata().unwrap();
      

      A user could then open embeddings:

      let dutch_embeddings = fetcher.open("fasttext.wiki.nl.fifu").unwrap();
      

      This method would check if the embeddings are already available. If not, fetch them, store them in a standard XDG location. Then it would open the embeddings stored in this location.

      Similarly, Fetcher::mmap could be used to memory-map an embedding after downloading.

    After this is implemented, the functionality could also be exposed in finalfusion-python.

    opened by danieldk 1
  • Add support for embedding pruning

    Add support for embedding pruning

    Add support for pruning embeddings, where N embeddings are retained. Words for which embeddings are removed are mapped to their nearest neighbor.

    This should provide more or less the same functionality as pruning in spaCy:

    https://spacy.io/api/vocab#prune_vectors

    I encourage some investigation here. Some ideas:

    1. The most basic version could simply retain the embeddings of the N most frequent words and map all the remaining words to the nearest neighbor in the N embeddings that are retained.

    2. Select vectors such that the similarities to the pruned vectors is maximized. The challenge here is making it tractable.

    3. An approach similar to quantization, where k-means clustering is performed with N clusters. The embedding matrix is then replaced by the cluster centroid matrix. Each word maps to the cluster it is in. (This could reuse the KMeans stuff from reductive, which is already a dependency of finalfusion).

    I would focus on (1) and (3) first.

    Benefits:

    • Compresses the embedding matrix.
    • Faster than quantized embedding matrices, because simple lookups are used.
    • Could later be applied to @sebpuetz 's non-hashed subword n-grams as well.
    • Could perhaps be combined with quantization for even better compression.
    feature 
    opened by danieldk 1
Releases(0.17.2)
  • 0.17.2(Dec 12, 2021)

    • Add WriteEmbeddings::write_embeddings_len. This method returns the serialized length of embeddings in finalfusion format, without performing any serialization.
    • Add WriteChunk::chunk_len. This method returns the serialized length of a finalfusion chunk, without performing any serialization.
    • Switch the license to Apache License 2.0 or MIT
    Source code(tar.gz)
    Source code(zip)
  • 0.17.1(Dec 4, 2021)

    • Add support for reading, writing, and using Floret embeddings.
    • Add a finalfusion chunk type for Floret-like vocabularies.
    • Add support for batched embedding lookups (embedding_batch and embedding_batch_into)
    • Improve error handling:
      • Mark wrapped errors using #[source] to get better chains of error messages.
      • Split Error::Io into Error::Read and Error::Write.
      • Rename some Error variants.
    Source code(tar.gz)
    Source code(zip)
  • 0.14.0(Aug 15, 2020)

    • Add conversion from bucketed subword to explicit subword embeddings.
    • Hide WordSimilarityResult fields. Use the cosine_similarity and word methods instead.
    Source code(tar.gz)
    Source code(zip)
  • 0.12.2(Jun 9, 2020)

    • Make lookups of unknown words in OPQ-quantized embedding matrices 2.6x faster (resulting in ~1.6x faster allround lookups).
    • Add the Reconstruct trait is a counterpart to Quantize. This trait can be used to reconstruct quantized embedding matrices. Using this trait is also much faster than reconstructing individual embeddings.
    • Add more I/O checks to ensure that the embedding matrix can actually be represented in the native usize.
    Source code(tar.gz)
    Source code(zip)
  • 0.12.0(Jun 9, 2020)

    Modernize and improve error handling

    • Merge the Error and ErrorKind enums.
    • Move the Error enum to the error module.
    • Derive trait implementations using the thiserror crate.
    • Make the Error enum non-exhaustive
    • Replace the ChunkIdentifier::try_from method by an implementation of the TryFrom crate.

    This release also feature-gates the memmap dependency (the memmap feature is enabled by default).

    Source code(tar.gz)
    Source code(zip)
  • 0.11.0(Oct 26, 2019)

    • Add ExplicitVocab, a subword vocabulary that stores n-grams explicitly.
    • Add the Embedding::into method. This method realizes an embedding into a user-provided array.
    • Support big-endian architectures.
    • Add WordIndex::word and WordIndex::subword methods. These will return an Option with the word index or subword indices, as applicable.
    • Expose the quantizer in (Mmap)QuantizedArray through the quantizer method.
    • Add benchmarks for array and quantized embeddings.
    • Split WordSimilarity into WordSimilarity and WordSimilarityBy; EmbeddingSimilarity into EmbeddingSimilarity and EmbeddingSimilarityBy.
    • Rename FinalfusionSubwordVocab to BucketSubwordVocab.
    • Expose fewer types through the prelude.
    • Hide the chunks module. E.g. chunks::storage becomes storage.
    Source code(tar.gz)
    Source code(zip)
  • 0.10.0(Sep 16, 2019)

    This is a small update, that updates the reductive dependency to 0.3, which has a crucial bug fix for training product quantizers in multiple attempts. However, reductive 0.3 also requires rand 0.7, resulting in a changed API. Therefore, we have to bump the leading version number from 0.9 to 0.10.

    Source code(tar.gz)
    Source code(zip)
  • 0.9.0(Sep 4, 2019)

    • Add the MmapQuantizedArray storage type.
    • Rename Vocab::len to Vocab::words_len.
    • Add Vocab::vocab_len to get the vocabulary size including subword indices.
    Source code(tar.gz)
    Source code(zip)
  • 0.8.1(Aug 13, 2019)

    • Improve reading of embeddings that contain unicode whitespace in tokens.
    • Add lossy variants of the text/word2vec/fasttext reading methods. The lossy variants read tokens with invalid UTF-8 byte sequences.
    Source code(tar.gz)
    Source code(zip)
  • 0.8.0(Aug 4, 2019)

    • Add support for reading fastText embeddings through the ReadFastText trait.
    • Generalization of SubwordVocab for different indexing schemes.
    • Add an iterator over both n-grams and their subword indices (NGramsIndicesIter).
    • Support similarity queries for embeddings (EmbeddingSimilarity). The word similarity trait was renamed from Similarity to WordSimilarity.
    • Better and more informative errors on I/O errors.
    • Use library-specific error types.
    • The words for analogy queries are now accepted as an array of length 3.
    • Make the chunk modules (norms, metadata, vocab, and storage submodules of the chunks module).
    Source code(tar.gz)
    Source code(zip)
Owner
finalfusion
Word embeddings format and utilities
finalfusion
Implementation of sentence embeddings with BERT in Rust, using the Burn library.

Sentence Transformers in Burn This library provides an implementation of the Sentence Transformers framework for computing text representations as vec

Tyler Vergho 4 Sep 4, 2023
Rust-nlp is a library to use Natural Language Processing algorithm with RUST

nlp Rust-nlp Implemented algorithm Distance Levenshtein (Explanation) Jaro / Jaro-Winkler (Explanation) Phonetics Soundex (Explanation) Metaphone (Exp

Simon Paitrault 34 Dec 20, 2022
Fast suffix arrays for Rust (with Unicode support).

suffix Fast linear time & space suffix arrays for Rust. Supports Unicode! Dual-licensed under MIT or the UNLICENSE. Documentation https://docs.rs/suff

Andrew Gallant 207 Dec 26, 2022
Elastic tabstops for Rust.

tabwriter is a crate that implements elastic tabstops. It provides both a library for wrapping Rust Writers and a small program that exposes the same

Andrew Gallant 212 Dec 16, 2022
An efficient and powerful Rust library for word wrapping text.

Textwrap Textwrap is a library for wrapping and indenting text. It is most often used by command-line programs to format dynamic output nicely so it l

Martin Geisler 322 Dec 26, 2022
⏮ ⏯ ⏭ A Rust library to easily read forwards, backwards or randomly through the lines of huge files.

EasyReader The main goal of this library is to allow long navigations through the lines of large files, freely moving forwards and backwards or gettin

Michele Federici 81 Dec 6, 2022
An implementation of regular expressions for Rust. This implementation uses finite automata and guarantees linear time matching on all inputs.

regex A Rust library for parsing, compiling, and executing regular expressions. Its syntax is similar to Perl-style regular expressions, but lacks a f

The Rust Programming Language 2.6k Jan 8, 2023
Natural language detection library for Rust. Try demo online: https://www.greyblake.com/whatlang/

Whatlang Natural language detection for Rust with focus on simplicity and performance. Content Features Get started Documentation Supported languages

Sergey Potapov 805 Dec 28, 2022
Multilingual implementation of RAKE algorithm for Rust

RAKE.rs The library provides a multilingual implementation of Rapid Automatic Keyword Extraction (RAKE) algorithm for Rust. How to Use Append rake to

Navid 26 Dec 16, 2022
A Rust library for generically joining iterables with a separator

joinery A Rust library for generically joining iterables with a separator. Provides the tragically missing string join functionality to rust. extern c

Nathan West 72 Dec 16, 2022
Rust edit distance routines accelerated using SIMD. Supports fast Hamming, Levenshtein, restricted Damerau-Levenshtein, etc. distance calculations and string search.

triple_accel Rust edit distance routines accelerated using SIMD. Supports fast Hamming, Levenshtein, restricted Damerau-Levenshtein, etc. distance cal

Daniel Liu 75 Jan 8, 2023
Rust native ready-to-use NLP pipelines and transformer-based models (BERT, DistilBERT, GPT2,...)

rust-bert Rust native Transformer-based models implementation. Port of Hugging Face's Transformers library, using the tch-rs crate and pre-processing

null 1.3k Jan 8, 2023
👄 The most accurate natural language detection library in the Rust ecosystem, suitable for long and short text alike

Table of Contents What does this library do? Why does this library exist? Which languages are supported? How good is it? Why is it better than other l

Peter M. Stahl 569 Jan 3, 2023
Snips NLU rust implementation

Snips NLU Rust Installation Add it to your Cargo.toml: [dependencies] snips-nlu-lib = { git = "https://github.com/snipsco/snips-nlu-rs", branch = "mas

Snips 327 Dec 26, 2022
A fast, low-resource Natural Language Processing and Text Correction library written in Rust.

nlprule A fast, low-resource Natural Language Processing and Error Correction library written in Rust. nlprule implements a rule- and lookup-based app

Benjamin Minixhofer 496 Jan 8, 2023
A fast implementation of Aho-Corasick in Rust.

aho-corasick A library for finding occurrences of many patterns at once with SIMD acceleration in some cases. This library provides multiple pattern s

Andrew Gallant 662 Dec 31, 2022
Natural Language Processing for Rust

rs-natural Natural language processing library written in Rust. Still very much a work in progress. Basically an experiment, but hey maybe something c

Chris Tramel 211 Dec 28, 2022
Rust-tokenizer offers high-performance tokenizers for modern language models, including WordPiece, Byte-Pair Encoding (BPE) and Unigram (SentencePiece) models

rust-tokenizers Rust-tokenizer offers high-performance tokenizers for modern language models, including WordPiece, Byte-Pair Encoding (BPE) and Unigra

null 165 Jan 1, 2023
Simple NLP in Rust with Python bindings

vtext NLP in Rust with Python bindings This package aims to provide a high performance toolkit for ingesting textual data for machine learning applica

Roman Yurchak 133 Jan 3, 2023