A naive (read: slow) implementation of Word2Vec. Uses BLAS behind the scenes for speed.

Overview

SloWord2Vec Build Status

This is a naive implementation of Word2Vec implemented in Rust.

The goal is to learn the basic principles and formulas behind Word2Vec. BTW, it's slow ;)

Getting it

This lib is available as a lib and as a binary.

Binary

A naive Word2Vec implementation

USAGE:
    sloword2vec [SUBCOMMAND]

FLAGS:
    -h, --help       Prints help information
    -V, --version    Prints version information

SUBCOMMANDS:
    add-subtract    Given a number of words to add and to subtract, returns a list of words in that area.
    help            Prints this message or the help of the given subcommand(s)
    similar         Given a path to a saved Word2Vec model and a target word, finds words in the model's vocab that are similar.
    train           Given a corpus and a path to save a trained model, trains Word2Vec encodings for the vocabulary in the corpus and saves it.

Training

Given a corpus and a path to save a trained model, trains Word2Vec encodings for the vocabulary in the corpus and saves it.

USAGE:
    sloword2vec train [OPTIONS] --corpus <corpus> --path <path>

FLAGS:
    -h, --help       Prints help information
    -V, --version    Prints version information

OPTIONS:
    -A, --acceptable-error <acceptable-error>              Acceptable error threshold under which training will end. [default: 0.1]
    -R, --context-radius <context-radius>                  The context radius (how many word surrounding a centre word to take into account per training sample). [default: 5]
    -C, --corpus <corpus>                                  Where the corpus file is.
    -D, --dimensions <dimensions>                          Number of dimensions to use for encoding a word as a vector. [default: 100]
    -I, --iterations <iterations>                          Max number of training iterations. [default: 500]
    -L, --learning-rate <learning-rate>                    Learning rate. [default: 0.001]
    -M, --min-error-improvement <min-error-improvement>    Minimum improvement in average error magnitude in a single training iteration (over all words) to keep on training [default:
                                                           0.0001]
    -O, --min-word-occurences <min-word-occurences>        Minimum number of occurences in the corpus a word needs to have in order to be included in the trained vocabulary. [default:
                                                           2]
    -P, --path <path>                                      Where to store the model when training is done.

Similarity

Given a path to a saved Word2Vec model and a target word, finds words in the model's vocab that are similar.

USAGE:
    sloword2vec similar --limit <limit> --path <path> --word <word>

FLAGS:
    -h, --help       Prints help information
    -V, --version    Prints version information

OPTIONS:
    -L, --limit <limit>    Max number of similar entries to show. [default: 20]
    -P, --path <path>      Where to store the model when training is done.
    -W, --word <word>      Word to find similar terms for.

Add subtract

The classic demo of Word2Vec..

Given a number of words to add and to subtract, returns a list of words in that area.

USAGE:
    sloword2vec add-subtract --add <add>... --limit <limit> --path <path> --subtract <subtract>...

FLAGS:
    -h, --help       Prints help information
    -V, --version    Prints version information

OPTIONS:
    -A, --add <add>...              Words to add encodings for
    -L, --limit <limit>             Max number of similar entries to show. [default: 20]
    -P, --path <path>               Where to store the model when training is done.
    -S, --subtract <subtract>...    Words to subtract encodings for

Details

Pretty much the most naive implementation of Word2Vec, the only special thing being the use of matrix/vector maths to speed things up.

The linear algebra library behind this lib is ndarray, with OpenBlas enabled (Fortran and transparent multithreading FTW!).

Reference material

  1. Word2Vec Parameter learning explained paper
  2. Word2Vec Skip-gram model tutorial article
You might also like...
A fast implementation of Aho-Corasick in Rust.

aho-corasick A library for finding occurrences of many patterns at once with SIMD acceleration in some cases. This library provides multiple pattern s

🦀 A Rust implementation of a RoBERTa classification model for the SNLI dataset
🦀 A Rust implementation of a RoBERTa classification model for the SNLI dataset

RustBERTa-SNLI A Rust implementation of a RoBERTa classification model for the SNLI dataset, with support for fine-tuning, predicting, and serving. Th

A rust implementation of some popular snowball stemming algorithms

Rust Stemmers This crate implements some stemmer algorithms found in the snowball project which are compiled to rust using the rust-backend of the sno

Gomez - A pure Rust framework and implementation of (derivative-free) methods for solving nonlinear (bound-constrained) systems of equations

Gomez A pure Rust framework and implementation of (derivative-free) methods for solving nonlinear (bound-constrained) systems of equations. Warning: T

A "Navie" Implementation of the Wavefront Algorithm For Sequence Alignment with Gap-Affine Scoring

A "Naive" Implementation of the Wavefront Algorithm for Sequence Alignment with Gap-Affine Scoring This repository contains some simple code that I wr

Implementation of sentence embeddings with BERT in Rust, using the Burn library.
Implementation of sentence embeddings with BERT in Rust, using the Burn library.

Sentence Transformers in Burn This library provides an implementation of the Sentence Transformers framework for computing text representations as vec

Simple, robust, BitTorrent's Mainline DHT implementation

Mainline Simple, robust, BitTorrent's Mainline DHT implementation. This library is focused on being the best and simplest Rust client for Mainline, es

Easy c̵̰͠r̵̛̠ö̴̪s̶̩̒s̵̭̀-t̶̲͝h̶̯̚r̵̺͐e̷̖̽ḁ̴̍d̶̖̔ ȓ̵͙ė̶͎ḟ̴͙e̸̖͛r̶̖͗ë̶̱́ṉ̵̒ĉ̷̥e̷͚̍ s̷̹͌h̷̲̉a̵̭͋r̷̫̊ḭ̵̊n̷̬͂g̵̦̃ f̶̻̊ơ̵̜ṟ̸̈́ R̵̞̋ù̵̺s̷̖̅ţ̸͗!̸̼͋

Rust S̵̓i̸̓n̵̉ I̴n̴f̶e̸r̵n̷a̴l mutability! Howdy, friendly Rust developer! Ever had a value get m̵̯̅ð̶͊v̴̮̾ê̴̼͘d away right under your nose just when

Meshtext is a font triangulation tool for 3D scenes.
Meshtext is a font triangulation tool for 3D scenes.

Meshtext is a font triangulation tool for 3D scenes.

BLAS bindings for Rust

RBLAS Rust bindings and wrappers for BLAS (Basic Linear Algebra Subprograms). Overview RBLAS wraps each external call in a trait with the same name (b

Wrappers for BLAS (Fortran)

BLAS The package provides wrappers for BLAS (Fortran). Architecture Example use blas::*; let (m, n, k) = (2, 4, 3); let a = vec![ 1.0, 4.0, 2

Rust interface to word2vec.

word2vec Rust interface to word2vec word vectors. This crate provides a way to read a trained word vector file from word2vec. It doesn't provide model

pure rust implemention of word2vec

Word2Vec-rs Word2Vec-rs is a fast implemention of word2vec's skip-gram algorithm. A simple benchmark on a 200M english corpus: 4 threads: tool words p

Rust library that can be reset if you think it's slow

GoodbyeKT Rust library that can be reset if you think it's slow

Plays back videos in your terminal in an insanely slow and inefficient way.
Plays back videos in your terminal in an insanely slow and inefficient way.

term-video I guess this is usable now... Compilation Since this project is built using Rust, install its toolchain first, for example using rustup. gi

Zopfli Compression Algorithm is a compression library programmed in C to perform very good, but slow, deflate or zlib compression.

Zopfli Compression Algorithm is a compression library programmed in C to perform very good, but slow, deflate or zlib compression.

Is it fast? Does it have an efficient runtime? Why is Bash so slow?

Testing Languages Is it fast? Does it have an efficient runtime? Why is Bash so slow? Usage Compile and execute the run.rs file to run benchmarks. rus

A naive native 128-bit cityhash v102 implementation

Naive CityHash naive-cityhash is a naive native 128-bit cityhash v102 implementation for clickhouse*. Contact Chojan Shang - @PsiACE - psiace@outlook.

A naive DBSCAN implementation in Rust

DBSCAN Density-Based Spatial Clustering of Applications with Noise Wikipedia link DBSCAN is a density-based clustering algorithm: given a set of point

Owner
Lloyd
Trying to be useful by making stuff, sharing, and learning. 🚀ƛ🤘 Stream[Thought]: twitter.com/meta_Lloyd
Lloyd
pure rust implemention of word2vec

Word2Vec-rs Word2Vec-rs is a fast implemention of word2vec's skip-gram algorithm. A simple benchmark on a 200M english corpus: 4 threads: tool words p

fang li 46 Oct 24, 2022
A naive native 128-bit cityhash v102 implementation

Naive CityHash naive-cityhash is a naive native 128-bit cityhash v102 implementation for clickhouse*. Contact Chojan Shang - @PsiACE - psiace@outlook.

Chojan Shang 5 Apr 4, 2022
An implementation of regular expressions for Rust. This implementation uses finite automata and guarantees linear time matching on all inputs.

regex A Rust library for parsing, compiling, and executing regular expressions. Its syntax is similar to Perl-style regular expressions, but lacks a f

The Rust Programming Language 2.6k Jan 8, 2023
Read input lines as byte slices for high efficiency

bytelines This library provides an easy way to read in input lines as byte slices for high efficiency. It's basically lines from the standard library,

Isaac Whitfield 53 Sep 24, 2022
(Read-only) Generate n-grams

N-grams Documentation This crate takes a sequence of tokens and generates an n-gram for it. For more information about n-grams, check wikipedia: https

Paul Woolcock 26 Dec 30, 2022
⏮ ⏯ ⏭ A Rust library to easily read forwards, backwards or randomly through the lines of huge files.

EasyReader The main goal of this library is to allow long navigations through the lines of large files, freely moving forwards and backwards or gettin

Michele Federici 81 Dec 6, 2022
Read and modify constituency trees in Rust.

lumberjack Read and process constituency trees in various formats. Install: From crates.io: cargo install lumberjack-utils From GitHub: cargo install

Sebastian Pütz 10 Apr 28, 2022
📏 ― Uses the Jaro similarity metric to measure the distance between two strings

distance distance: Uses the Jaro similarity metric to measure the distance between two strings FYI, this was just to test Neon, I do not recommend usi

Demigender 6 Dec 7, 2021
Multilingual implementation of RAKE algorithm for Rust

RAKE.rs The library provides a multilingual implementation of Rapid Automatic Keyword Extraction (RAKE) algorithm for Rust. How to Use Append rake to

Navid 26 Dec 16, 2022
Snips NLU rust implementation

Snips NLU Rust Installation Add it to your Cargo.toml: [dependencies] snips-nlu-lib = { git = "https://github.com/snipsco/snips-nlu-rs", branch = "mas

Snips 327 Dec 26, 2022