Rust interface to word2vec.

Overview

word2vec Build Status

Rust interface to word2vec word vectors.

This crate provides a way to read a trained word vector file from word2vec. It doesn't provide model training and hence requires a already trained model.

Documentation

Documentation is available at https://github.com/DimaKudosh/word2vec/wiki

Example

Add this to your cargo.toml:

[dependencies]
# …
word2vec = "0.3.3"

Example for word similarity and word clusters:

extern crate word2vec;

fn main(){
	let model = word2vec::wordvectors::WordVector::load_from_binary(
		"vectors.bin").expect("Unable to load word vector model");
	println!("{:?}", model.cosine("snow", 10));
	let positive = vec!["woman", "king"];
	let negative = vec!["man"];
	println!("{:?}", model.analogy(positive, negative, 10));
	
	let clusters = word2vec::wordclusters::WordClusters::load_from_file(
		"classes.txt").expect("Unable to load word clusters");
	println!("{:?}", clusters.get_cluster("belarus"));
	println!("{:?}", clusters.get_words_on_cluster(6));
}
Comments
  • Streaming WordVector Loading

    Streaming WordVector Loading

    Broken out the parsing part of wordvectors into a separate WordVectorReader which allows you to iterate over very large sets of word vectors without loading them into memory, or to stream them into your own datastructure without first buffering them in a Vec.

    opened by martindevans 2
  • Recognize binary vector files without line breaks

    Recognize binary vector files without line breaks

    The function to load binary word vector files did assume newlines before each (name, word vector) pair. The Google News Corpus doesn't have these and hence it failed to load the file. Gensim simply assumes no newlines and strips them, if it they occur. This crate now does the same.

    If you accept this PR, please make a bug fix release.

    opened by humenda 1
  • Loading From Reader

    Loading From Reader

    Added load_from_reader methods to WordClusters and WordVectors. This allows loading word vectors from other sources besides plain files such as a TCP connection or a compressed file.

    opened by martindevans 0
  • analogy returns NaNs when none of its inputs exist in its corpus

    analogy returns NaNs when none of its inputs exist in its corpus

    analogy has a return type of Option<_> and takes care to return None if it's passed two empty vectors. However, it doesn't bother to check whether vectors.is_empty() and instead ends up returning the top entries from its corpus with scores of NaN as a result.

    Fixing this should be as simple as changing this line to if exclude.is_empty() || vectors.is_empty() {.

    opened by altayhunter 0
  • SIMD and other optimizations

    SIMD and other optimizations

    I created a branch to try out SIMD optimizations. It turned out to make a good difference. I added some bench tests to check the differences, just run cargo bench vs. cargo bench --features=simd on that branch. There's some future considerations to take into account in terms of SIMD, such as core::simd being in the making. Also, as for now, the vector size needs to be divisible by 4. The simd branch has a lot of changes and optimizations, such as using a HashMap for the vocabulary, which makes the code easier to understand than using a Vec<(String, Vec<f32>)>. An API breakage was introduced by changing self::Item from String to &String for impl<'a> Iterator for Words<'a>, for the sake of readability and performance in the implementation.

    That's a lot of changes and I'd be open to submitting them in smaller PRs.

    The full changes are here: https://github.com/jayay/word2vec/compare/master..simd

    opened by jayay 0
Owner
Dima Kudosh
Dima Kudosh
A naive (read: slow) implementation of Word2Vec. Uses BLAS behind the scenes for speed.

SloWord2Vec This is a naive implementation of Word2Vec implemented in Rust. The goal is to learn the basic principles and formulas behind Word2Vec. BT

Lloyd 2 Jul 5, 2018
Rust-nlp is a library to use Natural Language Processing algorithm with RUST

nlp Rust-nlp Implemented algorithm Distance Levenshtein (Explanation) Jaro / Jaro-Winkler (Explanation) Phonetics Soundex (Explanation) Metaphone (Exp

Simon Paitrault 34 Dec 20, 2022
Fast suffix arrays for Rust (with Unicode support).

suffix Fast linear time & space suffix arrays for Rust. Supports Unicode! Dual-licensed under MIT or the UNLICENSE. Documentation https://docs.rs/suff

Andrew Gallant 207 Dec 26, 2022
Elastic tabstops for Rust.

tabwriter is a crate that implements elastic tabstops. It provides both a library for wrapping Rust Writers and a small program that exposes the same

Andrew Gallant 212 Dec 16, 2022
An efficient and powerful Rust library for word wrapping text.

Textwrap Textwrap is a library for wrapping and indenting text. It is most often used by command-line programs to format dynamic output nicely so it l

Martin Geisler 322 Dec 26, 2022
⏮ ⏯ ⏭ A Rust library to easily read forwards, backwards or randomly through the lines of huge files.

EasyReader The main goal of this library is to allow long navigations through the lines of large files, freely moving forwards and backwards or gettin

Michele Federici 81 Dec 6, 2022
An implementation of regular expressions for Rust. This implementation uses finite automata and guarantees linear time matching on all inputs.

regex A Rust library for parsing, compiling, and executing regular expressions. Its syntax is similar to Perl-style regular expressions, but lacks a f

The Rust Programming Language 2.6k Jan 8, 2023
Natural language detection library for Rust. Try demo online: https://www.greyblake.com/whatlang/

Whatlang Natural language detection for Rust with focus on simplicity and performance. Content Features Get started Documentation Supported languages

Sergey Potapov 805 Dec 28, 2022
Multilingual implementation of RAKE algorithm for Rust

RAKE.rs The library provides a multilingual implementation of Rapid Automatic Keyword Extraction (RAKE) algorithm for Rust. How to Use Append rake to

Navid 26 Dec 16, 2022
A Rust library for generically joining iterables with a separator

joinery A Rust library for generically joining iterables with a separator. Provides the tragically missing string join functionality to rust. extern c

Nathan West 72 Dec 16, 2022
Rust edit distance routines accelerated using SIMD. Supports fast Hamming, Levenshtein, restricted Damerau-Levenshtein, etc. distance calculations and string search.

triple_accel Rust edit distance routines accelerated using SIMD. Supports fast Hamming, Levenshtein, restricted Damerau-Levenshtein, etc. distance cal

Daniel Liu 75 Jan 8, 2023
Rust native ready-to-use NLP pipelines and transformer-based models (BERT, DistilBERT, GPT2,...)

rust-bert Rust native Transformer-based models implementation. Port of Hugging Face's Transformers library, using the tch-rs crate and pre-processing

null 1.3k Jan 8, 2023
👄 The most accurate natural language detection library in the Rust ecosystem, suitable for long and short text alike

Table of Contents What does this library do? Why does this library exist? Which languages are supported? How good is it? Why is it better than other l

Peter M. Stahl 569 Jan 3, 2023
Snips NLU rust implementation

Snips NLU Rust Installation Add it to your Cargo.toml: [dependencies] snips-nlu-lib = { git = "https://github.com/snipsco/snips-nlu-rs", branch = "mas

Snips 327 Dec 26, 2022
A fast, low-resource Natural Language Processing and Text Correction library written in Rust.

nlprule A fast, low-resource Natural Language Processing and Error Correction library written in Rust. nlprule implements a rule- and lookup-based app

Benjamin Minixhofer 496 Jan 8, 2023
A fast implementation of Aho-Corasick in Rust.

aho-corasick A library for finding occurrences of many patterns at once with SIMD acceleration in some cases. This library provides multiple pattern s

Andrew Gallant 662 Dec 31, 2022
Natural Language Processing for Rust

rs-natural Natural language processing library written in Rust. Still very much a work in progress. Basically an experiment, but hey maybe something c

Chris Tramel 211 Dec 28, 2022
finalfusion embeddings in Rust

Introduction finalfusion is a crate for reading, writing, and using embeddings in Rust. finalfusion primarily works with its own format which supports

finalfusion 55 Jan 2, 2023
Rust-tokenizer offers high-performance tokenizers for modern language models, including WordPiece, Byte-Pair Encoding (BPE) and Unigram (SentencePiece) models

rust-tokenizers Rust-tokenizer offers high-performance tokenizers for modern language models, including WordPiece, Byte-Pair Encoding (BPE) and Unigra

null 165 Jan 1, 2023