Context-sensitive word embeddings with subwords. In Rust.

Overview

Crate Docs Build Status

finalfrontier

Introduction

finalfrontier is a Rust program for training word embeddings. finalfrontier currently has the following features:

  • Models:
    • skip-gram (Mikolov et al., 2013)
    • structured skip-gram (Ling et al., 2015)
    • directional skip-gram (Song et al., 2018)
    • dependency (Levy and Goldberg, 2014)
  • Output formats:
    • finalfusion
    • fastText
    • word2vec binary
    • word2vec text
    • GloVe text
  • Noise contrastive estimation (Gutmann and Hyvärinen, 2012)
  • Subword representations (Bojanowski et al., 2016)
  • Hogwild SGD (Recht et al., 2011)
  • Quantized embeddings through the finalfusion quantize command.

The trained embeddings can be stored in the versatile finalfusion format, which can be read and used with the finalfusion crate and the finalfusion Python module.

The minimum required Rust version is currently 1.40.

Where to go from here

Comments
  • Support explicitly stored ngrams

    Support explicitly stored ngrams

    Entails changes to finalfusion-rust for serialization/usage.

    • [x] Add config
    • [x] Restructure vocab into module (vocab.rs for all variants is quite unwieldy)
    • [x] Extend vocab module
      • Use indexer approach from finalfusion-rust, parameterize SubwordVocab with Indexer, separate ngram and word indices in the index through enum (might get ugly, considering how many type parameter we already have for the trainer structs)
      • Implement independent vocab type
    • [x] Add support in binaries (depends on support in finalfusion-rust)
    • [x] Update finalfusion-rust dependency once it supports NGramVocabs
    • [x] Replace finalfusion dependency with release.
    • [x] fix #72
    feature 
    opened by sebpuetz 90
  • Some general vocab questions

    Some general vocab questions

    Hello. I'm a genomics researcher interested in using finalfrontier to create embeddings based on DNA and protein sequences. Unfortunately, I'm a bit new to rust and very new to the finalfrontier codebase. I've got a few issues right away I need some help with (but likely to have many more).

    For DNA we use kmers (like ngrams, as DNA is essentially very large continuous strings) with a sliding window approach, and I've code to count a large corpus in about 3 hours (300 million unique kmers, 50Gb compressed -- almost as fast as FastText, without the extra processing on the front-end nor the extra few hundred Gb of data). I'd like to get this into a SubwordVocab.

    1. It would be great to get a function to supplement count (vocab/mod.rs) and pass a known value? As the corpus is already processed this would speed things up rather than calling count() multiple times. I can create a PR if that would help.

    2. Is it possible to create a way to skip bracketing for ngram creation? Happy to create a PR as well.

    3. Is it possible to create a set of specific ngram values instead of a range: 9 and 11, rather than 9, 10, and 11?

    4. I am storing everything at Vec but it seems like everything is String in finalfusion. This is more of a performance question, will it hurt anything if I switch all of my kmers over to String?

    Or -- Should I focus instead on creating a different vocab implementation instead so as not to mess up anything you have already?

    Any and all help is greatly appreciated!

    Cheers, --Joseph

    opened by jguhlin 28
  • Release 0.6

    Release 0.6

    I think the norms storage change is pretty important. The earlier we push 0.6.0 out, the better, since it reduces the number of embeddings in the wild that do not have norms.

    That said, I think it would be nice to have Nicole's directional skipgram implementation in as well, since then we also have a nice user-visible feature.

    Is there anything else that we want to add before branching 0.6?

    opened by danieldk 12
  • Add projectivization switch.

    Add projectivization switch.

    Adds option to projectivize dependency graphs before training.

    This is all a bit messier than I hoped. With how the binaries are structured at this point, I couldn't come up with a nice abstraction. For now we could stick with some wrapper struct holding an optional projectivizer:

    struct SentenceIter<P, R> {
        inner: Reader<R>,
        projectivizer: Option<P>,
    }
    

    and implement Iterator<Item = Sentence> for it.

    There's some ugliness in the do_work method because HeadProjectivizer doesn't derive Copy + Clone.

    opened by sebpuetz 9
  • Merge finalfrontier into a single crate

    Merge finalfrontier into a single crate

    Mostly moving files around. I put most of the finalfrontier-utils library stuff into the io module. The remainder is now in the app module.

    We have a lot of crate-level symbols that should probably be removed over time (instead exposing just the submodules).

    opened by danieldk 6
  • Make SkipGramTrainer generic over Vocab.

    Make SkipGramTrainer generic over Vocab.

    This commit makes SkipGramTrainer generic over the vocabulary.

    Mostly replacing SubwordVocab with V and adding trait bounds for V.

    Content wise, biggest changes are made in vocab:

    The AssociatedIndices trait defines a method to return a slice of indices associated with some lookup idx. SimpleVocab returns Some(&[]), regardless of the input, the implementation on SubwordVocab is just a copy of the subword_indices_idx method.

    Trainer delegates the n_input_types method to Vocab, since we can't calculate this number in Trainer without knowing the vocabulary type. Also adds to more flexibility if there'd be an additional input vocabulary involved.

    The next PR should be restricted to adding a flag to ff-train and making some of the methods generic.

    opened by sebpuetz 6
  • Changes to seperate different types of hyperparameters.

    Changes to seperate different types of hyperparameters.

    This PR looks bigger than it is. Mostly hyperparameters are moved around. Since changes to the config mean changes throughout the entire crate, this commit changes quite a number of files.

    To allow more flexibility with different configurations, the Config is split up into several specialized structs holding only the hyperparameters relevant to each component.

    The CommonConfig struct holds those hyperparameters that (at this point?) are shared by all training variants, including e.g. dims, negative_samples and loss_type.

    Additionally, there are two config structs for vocabularies: SimpleVocabConfig and SubwordVocabConfig, holding the relevant hyperparameters.

    Finally, there is the SkipgramConfig which holds settings relevant to skipgram-like training routines, i.e. the context_window and model_type determining whether vanilla skipgram is used or structgram.

    Since different Trainer instantiations can and will have different sets of configurations, changes were necessary to the WriteModelBinary impl on Trainmodel, too. Instead of converting the Config of the Trainmodel to a toml::Value, Trainer now defines a metadata method

    /// Get this Trainer's hyperparameters as `toml::Value`.
    fn to_metadata(&self) -> Result<Value, Error>;
    

    leaving it to the Trainer implementations to take care of the conversion.

    The SkipgramTrainer's takes care of this conversion through an additional struct:

    #[derive(Clone, Copy, Debug, Serialize)]
    struct SkipgramMetadata<V>
    where
        V: Serialize,
    {
        common_config: CommonConfig,
        #[serde(rename = "model_config")]
        skipgram_config: SkipGramConfig,
        vocab_config: V,
    }
    

    which is converted to a toml::Value.

    This framework is compatible with training dependency embeddings without requiring a lot of additional changes. The next steps in my mind are to make the SkipgramTrainer's input vocabulary generic, then adjust ff-train accordingly and finally introduce the DepembedsTrainer in a third PR. After the DepembedsTrainer is there, only minor changes would be left to ff-train (or a decision to split things into different crates/binaries).

    opened by sebpuetz 5
  • Dealing with different set of command-line options

    Dealing with different set of command-line options

    I have implemented support for training floret embeddings, but the command-line gets a bit unwieldy. Floret is quite a bit different than what we have so far:

    • We need an option to set the number of hashes.
    • We need an option to set the seed for murmur3.
    • Upstream floret doesn't use a matrix size that is a power of 2, it would be nice to provide the same freedom in finalfrontier.
    • Most output formats do not really make sense for floret, e.g. the word2vec and text formats are useless, since floret does not use word embeddings.

    I see two ways forward:

    1. We add the necessary options and validations to ensure that no incompatible set of options is used.
    2. We add another level of subcommands, with only the relevant set of options, e.g.: finalfrontier skipgram floret, finalfrontier skipgram fasttext, finalfrontier skipgram buckets, finalfrontier skipgram explicit and the same for deps.

    For (2), I am not sure if this is the best partitioning.

    opened by danieldk 4
  • Add support for vocab size target

    Add support for vocab size target

    1. Add an option vocab_size allowing people to set a target vocabulary size, due to which "mincount" is sized so that the vocabulary size is less than the target vocab size.
    2. Add enum VocabCutoff with two variants: TargetVocabSize and MinCount
    3. Refactor From<VocabBuilder<SimpleVocabConfig, T>> for SimpleVocab and From<VocabBuilder<SubwordVocabConfig, T>> for SubwordVocab, so that Vocabs can be built from corresponding VocabConfig with TargetVocabSize or MinCount

    Based on: #58

    opened by RealNicolasBourbaki 4
  • Add flag to dump the context matrix

    Add flag to dump the context matrix

    I see two ways:

    1. Extend finalfusion to handle files with both input and output matrices
    2. Dump the output matrix + vocab in a seperate finalfusion model.

    Re 1.: More work and might make APIs (more) complex. Re 2.: Hackier solution, output types need to implement to_string(). Lookup consequently also done through stringly keys.

    feature question 
    opened by sebpuetz 4
  • Add AppStructs for Depembeds and SkipGram.

    Add AppStructs for Depembeds and SkipGram.

    Rather light weight changes with lots of lines.

    While implementing the ff-deps binary I realized I didn't really like the same AppBuilder for both training types. I defined SkipGramApp and DepembedsApp structs that handle all clap-related business and construct all the required configs. The required configs and values can be accessed through getter methods.

    Next step after this PR is the actual binary, then clean up (there were some incorrect docs somewhere in the dep module) and we should be ready to release.

    opened by sebpuetz 4
  • Bump regex from 1.5.4 to 1.5.6

    Bump regex from 1.5.4 to 1.5.6

    Bumps regex from 1.5.4 to 1.5.6.

    Changelog

    Sourced from regex's changelog.

    1.5.6 (2022-05-20)

    This release includes a few bug fixes, including a bug that produced incorrect matches when a non-greedy ? operator was used.

    1.5.5 (2022-03-08)

    This releases fixes a security bug in the regex compiler. This bug permits a vector for a denial-of-service attack in cases where the regex being compiled is untrusted. There are no known problems where the regex is itself trusted, including in cases of untrusted haystacks.

    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    opened by dependabot[bot] 0
  • Bump crossbeam-utils from 0.8.5 to 0.8.8

    Bump crossbeam-utils from 0.8.5 to 0.8.8

    Bumps crossbeam-utils from 0.8.5 to 0.8.8.

    Release notes

    Sourced from crossbeam-utils's releases.

    crossbeam-utils 0.8.8

    • Fix a bug when unstable loom support is enabled. (#787)

    crossbeam-utils 0.8.7

    • Add AtomicCell<{i*,u*}>::{fetch_max,fetch_min}. (#785)
    • Add AtomicCell<{i*,u*,bool}>::fetch_nand. (#785)
    • Fix unsoundness of AtomicCell<{i,u}64> arithmetics on 32-bit targets that support Atomic{I,U}64 (#781)

    crossbeam-utils 0.8.6

    • Re-add AtomicCell<{i,u}64>::{fetch_add,fetch_sub,fetch_and,fetch_or,fetch_xor} that were accidentally removed in 0.8.0 0.7.1 on targets that do not support Atomic{I,U}64. (#767)
    • Re-add AtomicCell<{i,u}128>::{fetch_add,fetch_sub,fetch_and,fetch_or,fetch_xor} that were accidentally removed in 0.8.0 0.7.1. (#767)
    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    opened by dependabot[bot] 0
  • Thank you

    Thank you

    Hi @danieldk @sebpuetz ,

    I just want to say thanks. You saved me from a text classifier deadline where flair/bert embedding would be too slow, and I was unable to do any magical invocation(tried version 3.8, version 4, other parameters) to force gensim to train an working word2vec, the model would simplify not converge and got way worse on each epoch. You rock and finalfrontier from my POV look like the only game in town ( spaCy was lack-lusting even with floret)!!!

    opened by bratao 0
  • Implement ASAG and AsySVRG

    Implement ASAG and AsySVRG

    As you are using Hogwild! for the multicore SGD implementation, it would perhaps be interesting to investigate whether you can speed up the optimization with

    • AsySVRG (https://arxiv.org/abs/1508.05711)
    • ASAG (https://arxiv.org/abs/1606.04809)

    ps nice project you have there :+1:

    opened by bytesnake 1
  • Optimization opportunities

    Optimization opportunities

    The vast majority of time during training is spent in the dot product and scaled additions. We have been doing unaligned loads so far. I have made a quick modification that ensures that every embedding is aligned on a 16-byte boundary and changed the SSE code to do aligned loads, the compiled machine code seems ok and the compiler even performs some loop unrolling.

    Unfortunately, using aligned data/loads does not seem to have a measurable impact on running time. This is probably caused by those functions being constrained by memory bandwidth.

    I just wanted to jot down two possible opportunities for reducing cache missed that might have an impact on performance.

    1. Some papers that replace core word2vec computations using by kernels sample a set of negatives for a sentence, rather than for each token. In the base case the number of cache misses due to negatives is reduced by a factor corresponding to the sentence length. Of course, this modification may have an impact on the quality of the embeddings.

    2. The embeddings in the output matrix and the vocab part of the input matrix are ordered by the frequencies of the corresponding tokens. This might improve locality (due to zipf's law). However, the lookups for subword units are randomized by the hash function. Maybe something can be gained by ordering the embeddings in the subword matrix by hash code frequency. However, in the most obvious implementation it would add an indirection (hash index code -> index).

    @sebpuetz

    opened by danieldk 1
Releases(0.9.4)
  • 0.9.4(Jul 27, 2020)

  • 0.9.1(Jul 2, 2020)

    Do you use fastText, but you would also like to get your hands on structured skipgram, directional skipgram, or dependency embeddings models? This is now possible, since finalfrontier 0.9.1 adds support for saving trained embeddings in the fastText format :tada:.

    With the new --output flag, you can save embeddings to other formats in addition to finalfusion. Options are: fasttext, word2vec binary, text, or textdims

    Source code(tar.gz)
    Source code(zip)
  • 0.9.0(Jun 24, 2020)

    • Add support for training with a target vocabulary size. This is an alternative for setting a minimum token count and will attempt to create a vocabulary with the given size. Target vocabulary sizes are enabled through the --context-target-size, --target-size, and --ngram-target-size options. (@sebpuetz)
    • SIMD code paths are now dynamically selected at run-time. It is thus not necessary anymore to compile finalfrontier with specific target features to use code paths for newer SIMD instruction sets. (@danieldk)
    • Add dot product implementation using FMA (fused multiply-add). (@danieldk)
    • Enable training with the fastText indexer. With future changes in finalfusion-rust and finalfusion-convert, this will allow you to crate fastText embeddings with finalfrontier! (@sebpuetz)
    Source code(tar.gz)
    Source code(zip)
    finalfrontier-0.9.0-x86_64-unknown-linux-musl.tar.gz(874.25 KB)
  • 0.8.0(Jun 23, 2020)

  • 0.7.0(Nov 8, 2019)

    • The most user-visible change is that ff-train-deps and ff-train-skipgram have been merged into one command, finalfrontier. Dependency and skipgram embeddings can be trained with respectively finalfrontier deps and finalfrontier skipgram.

    • Support for training explicit subwords has been added.

      Thus far, finalfrontier has followed the same subword approach as fastText: each subword (n-gram) mapped to an embedding using the FNV-1 hash function. This approach reduces the number of embeddings when the corpus contains a large number of possible embeddings, at the cost of collisions. With the --subwords ngrams option, finalfrontier uses an (explicit) n-gram vocabulary instead.

    • The hogwild and finalfrontier-utils crates have been merged into the finalfrontier crate. Consequently, finalfrontier now consists of a single crate.

    • When the number of threads is not specified, finalfrontier has traditionally used half the logical CPUs. This has been refined to use half the number of logical GPUs, capped at 20 threads. Using more than 20 threads can slow convergence drastically on typical corpora.

    Source code(tar.gz)
    Source code(zip)
  • 0.6.0(Jun 14, 2019)

    This release has the following changes:

    • Add support for the directional skip-gram model (Song et al., 2018).
    • Store norms in a finalfusion chunk, making it possible to retrieve the unnormalized embeddings.
    • Better defaults in for skip-gram models: context size 5 -> 10, dimensions 100 -> 300, epochs: 5 -> 15
    • Improved command-line option handling.
    Source code(tar.gz)
    Source code(zip)
    finalfrontier-0.6.0-x86_64-apple-darwin.tar.gz(1.70 MB)
    finalfrontier-0.6.0-x86_64-unknown-linux-gnu.tar.gz(1.94 MB)
    finalfrontier-0.6.0-x86_64-unknown-linux-musl.tar.gz(1.87 MB)
  • 0.5.0(Apr 25, 2019)

    The addition in this release is support for dependencies as context. This makes it possible to train dependency embeddings as described by Levy & Goldberg, 2014. The dependency embedding model can be tuned in fine-grained detail (such as the depth of the relations).

    • Add dependency relations.
    • Refactoring training to make it easier to add different context types.
    • Precompiled releases, including a MUSL target.
    • Migration to Rust 2018.
    • ff-train has been renamed to ff-train-skipgram.
    Source code(tar.gz)
    Source code(zip)
    finalfrontier-0.5.0-x86_64-apple-darwin.tar.gz(1.62 MB)
    finalfrontier-0.5.0-x86_64-unknown-linux-gnu.tar.gz(1.86 MB)
    finalfrontier-0.5.0-x86_64-unknown-linux-musl.tar.gz(1.79 MB)
  • 0.4.1(Apr 3, 2019)

  • v0.4.0(Mar 11, 2019)

    The most important change in this release is that finalfrontier stores trained embeddings in the finalfusion format, which is implemented by rust2vec and finalfusion-python. This format is more generic than the old finalfrontier format and easier to implement readers for.

    As a result of these changes, finalfrontier is now only for training embeddings. To actually use the embeddings in your own program, use rust2vec.

    Summary of changes since v0.2.0:

    • Store trained embeddings in finalfusion format.
    • Remove ff-similarity, ff-convert, and ff-compute-accuracy. This functionality is provided by rust2vec.
    Source code(tar.gz)
    Source code(zip)
  • v0.3.0(Feb 4, 2019)

    • Update to ndarray 0.12
    • ff-train: show a progress bar while reading the vocabulary.
    • Fix a bug in negative sampling for the structured skip-gram model.
    • Remove dependency between cloned RNGs.
    • Fix one-off in the loss computation, resulting in reporting losses that were too low.
    • Add ff-compute-accuracy for evaluating anologies.
    • Do not unnecessarily sort the vocabulary upon deserialization.
    • ff-train: add the zipf option to specify the exponent of the zipf distribution.
    Source code(tar.gz)
    Source code(zip)
  • v0.2.0(Sep 30, 2018)

    • Support memory-mapped embedding matrices. This makes loading of the embedding matrices instantaneous and reduces memory use. The use of memory-mapped matrices comes at a small cost in efficiency.
    • Normalize stored embeddings with their l2 norms. This avoids normalization when loading a model and simplifies the functionality for similarity/analogy queries. The l2 norms are stored, so the word vectors can be restored with their original magnitudes by multiplying them by their l2 norms.
    • Add subword representation embeddings to known words in stored models. This speeds up the retrieval of known word embeddings. There is no loss of information: the original word embeddings can be reconstructed by subtracting their subword embeddings.
    Source code(tar.gz)
    Source code(zip)
Owner
finalfusion
Word embeddings format and utilities
finalfusion
Implementation of sentence embeddings with BERT in Rust, using the Burn library.

Sentence Transformers in Burn This library provides an implementation of the Sentence Transformers framework for computing text representations as vec

Tyler Vergho 4 Sep 4, 2023
An efficient and powerful Rust library for word wrapping text.

Textwrap Textwrap is a library for wrapping and indenting text. It is most often used by command-line programs to format dynamic output nicely so it l

Martin Geisler 322 Dec 26, 2022
Rust-nlp is a library to use Natural Language Processing algorithm with RUST

nlp Rust-nlp Implemented algorithm Distance Levenshtein (Explanation) Jaro / Jaro-Winkler (Explanation) Phonetics Soundex (Explanation) Metaphone (Exp

Simon Paitrault 34 Dec 20, 2022
Fast suffix arrays for Rust (with Unicode support).

suffix Fast linear time & space suffix arrays for Rust. Supports Unicode! Dual-licensed under MIT or the UNLICENSE. Documentation https://docs.rs/suff

Andrew Gallant 207 Dec 26, 2022
Elastic tabstops for Rust.

tabwriter is a crate that implements elastic tabstops. It provides both a library for wrapping Rust Writers and a small program that exposes the same

Andrew Gallant 212 Dec 16, 2022
⏮ ⏯ ⏭ A Rust library to easily read forwards, backwards or randomly through the lines of huge files.

EasyReader The main goal of this library is to allow long navigations through the lines of large files, freely moving forwards and backwards or gettin

Michele Federici 81 Dec 6, 2022
An implementation of regular expressions for Rust. This implementation uses finite automata and guarantees linear time matching on all inputs.

regex A Rust library for parsing, compiling, and executing regular expressions. Its syntax is similar to Perl-style regular expressions, but lacks a f

The Rust Programming Language 2.6k Jan 8, 2023
Natural language detection library for Rust. Try demo online: https://www.greyblake.com/whatlang/

Whatlang Natural language detection for Rust with focus on simplicity and performance. Content Features Get started Documentation Supported languages

Sergey Potapov 805 Dec 28, 2022
Multilingual implementation of RAKE algorithm for Rust

RAKE.rs The library provides a multilingual implementation of Rapid Automatic Keyword Extraction (RAKE) algorithm for Rust. How to Use Append rake to

Navid 26 Dec 16, 2022
A Rust library for generically joining iterables with a separator

joinery A Rust library for generically joining iterables with a separator. Provides the tragically missing string join functionality to rust. extern c

Nathan West 72 Dec 16, 2022
Rust edit distance routines accelerated using SIMD. Supports fast Hamming, Levenshtein, restricted Damerau-Levenshtein, etc. distance calculations and string search.

triple_accel Rust edit distance routines accelerated using SIMD. Supports fast Hamming, Levenshtein, restricted Damerau-Levenshtein, etc. distance cal

Daniel Liu 75 Jan 8, 2023
Rust native ready-to-use NLP pipelines and transformer-based models (BERT, DistilBERT, GPT2,...)

rust-bert Rust native Transformer-based models implementation. Port of Hugging Face's Transformers library, using the tch-rs crate and pre-processing

null 1.3k Jan 8, 2023
👄 The most accurate natural language detection library in the Rust ecosystem, suitable for long and short text alike

Table of Contents What does this library do? Why does this library exist? Which languages are supported? How good is it? Why is it better than other l

Peter M. Stahl 569 Jan 3, 2023
Snips NLU rust implementation

Snips NLU Rust Installation Add it to your Cargo.toml: [dependencies] snips-nlu-lib = { git = "https://github.com/snipsco/snips-nlu-rs", branch = "mas

Snips 327 Dec 26, 2022
A fast, low-resource Natural Language Processing and Text Correction library written in Rust.

nlprule A fast, low-resource Natural Language Processing and Error Correction library written in Rust. nlprule implements a rule- and lookup-based app

Benjamin Minixhofer 496 Jan 8, 2023
A fast implementation of Aho-Corasick in Rust.

aho-corasick A library for finding occurrences of many patterns at once with SIMD acceleration in some cases. This library provides multiple pattern s

Andrew Gallant 662 Dec 31, 2022
Natural Language Processing for Rust

rs-natural Natural language processing library written in Rust. Still very much a work in progress. Basically an experiment, but hey maybe something c

Chris Tramel 211 Dec 28, 2022
Rust-tokenizer offers high-performance tokenizers for modern language models, including WordPiece, Byte-Pair Encoding (BPE) and Unigram (SentencePiece) models

rust-tokenizers Rust-tokenizer offers high-performance tokenizers for modern language models, including WordPiece, Byte-Pair Encoding (BPE) and Unigra

null 165 Jan 1, 2023
Simple NLP in Rust with Python bindings

vtext NLP in Rust with Python bindings This package aims to provide a high performance toolkit for ingesting textual data for machine learning applica

Roman Yurchak 133 Jan 3, 2023