Rust port of sentence-transformers (https://github.com/UKPLab/sentence-transformers)

Overview

Rust SBert Latest Version Latest Doc Build Status

Rust port of sentence-transformers using rust-bert and tch-rs.

Supports both rust-tokenizers and Hugging Face's tokenizers.

Supported models

  • distiluse-base-multilingual-cased: Supported languages: Arabic, Chinese, Dutch, English, French, German, Italian, Korean, Polish, Portuguese, Russian, Spanish, Turkish. Performance on the extended STS2017: 80.1

  • DistilRoBERTa-based classifiers

Usage

Example

The API is made to be very easy to use and enables you to create quality multilingual sentence embeddings in a straightforward way.

Load SBert model with weights by specifying the directory of the model:

let mut home: PathBuf = env::current_dir().unwrap();
home.push("path-to-model");

You can use different versions of the models that use different tokenizers:

// To use Hugging Face tokenizer
let sbert_model = SBertHF::new(home.to_str().unwrap());

// To use Rust-tokenizers
let sbert_model = SBertRT::new(home.to_str().unwrap());

Now, you can encode your sentences:

let texts = ["You can encode",
             "As many sentences",
             "As you want",
             "Enjoy ;)"];

let batch_size = 64;

let output = sbert_model.forward(texts.to_vec(), batch_size).unwrap();

The parameter batch_size can be left to None to let the model use its default value.

Then you can use the output sentence embedding in any application you want.

Convert models from Python to Rust

Firstly, get a model provided by UKPLabs (all models are here):

mkdir -p models/distiluse-base-multilingual-cased

wget -P models https://public.ukp.informatik.tu-darmstadt.de/reimers/sentence-transformers/v0.2/distiluse-base-multilingual-cased.zip

unzip models/distiluse-base-multilingual-cased.zip -d models/distiluse-base-multilingual-cased

Then, you need to convert the model in a suitable format (requires pytorch):

python utils/prepare_distilbert.py models/distiluse-base-multilingual-cased

A dockerized environment is also available for running the conversion script:

docker build -t tch-converter -f utils/Dockerfile .

docker run \
  -v $(pwd)/models/distiluse-base-multilingual-cased:/model \
  tch-converter:latest \
  python prepare_distilbert.py /model

Finally, set "output_attentions": true in distiluse-base-multilingual-cased/0_distilbert/config.json.

Comments
  • Any plans to port utils like semantic_search?

    Any plans to port utils like semantic_search?

    Hi @cpcdoy,

    I was wondering whether there are any plans to also port some of the auxiliary code around the models from sentence-transformers, like semantic_search. This is probably one of the main use cases of both libraries, so many people might be interested in it in the future.

    edit: I saw a contributor also works on https://github.com/lerouxrgd/ngt-rs which should do the trick. @lerouxrgd, an example of combining rust-sbert and ngt-rs would be really really useful for this!

    opened by paulbricman 3
  • Adds missing TruncationDirection for hf_tokenizers

    Adds missing TruncationDirection for hf_tokenizers

    Missing field was caused by missing field in TruncationParams in hf_tokenizers and has since been fixed.

    Related hf_tokenizers issue: https://github.com/huggingface/tokenizers/commit/4122a33f095e7da48217e19bf9184fefd0506a8e

    opened by Luxbit 2
  • Update style, dependencies, tests

    Update style, dependencies, tests

    1. Ran Prettier on all files to enforce style.
    2. Updated dependencies to match the torch environment used by the latest rust-bert 0.15.1 (i.e. the same libtorch version, tch-rs, rust-bert, etc.). Figured rust-sbert might want to stay in sync with rust-bert, more or less. This also makes it feasible to use both in the same project without different torch environments.
    3. When running the distiluse-... tests using the model downloaded using the instructions specified in the README, the test results would be slightly different, although really close. Maybe the tests were defined for a different version of that model? I tweaked them to match the model specified in the README.

    :warning: I couldn't find the distilroberta-toxicity model used by the other tests in the online model directory, so I just ignored those.

    opened by paulbricman 2
  • Update dependencies

    Update dependencies

    • Update dependencies
      • prost: 0.6 -> 0.9
      • rust-bert: (git) -> 0.17
      • rust_tokenizers: 6.0 -> 7.0
      • strum: 0.20 -> 0.23
      • tch: 0.3 -> 0.6
      • tokenizers: 0.10 -> 0.11
      • torch-sys: 0.3 -> 0.6
    • Update expected test results based on latest models
      • distiluse-base-multilingual-cased: Tested
      • distilroberta_toxicity: Untested
    opened by kerryeon 1
  • Use rayon for multithreading and simplify code

    Use rayon for multithreading and simplify code

    • Use rayon for multi-threading (instead of crossbeam scope threads)
    • Remove Arc/Mutex based Safe* wrappers
    • Ensure tests/benches consistency with seeded random input
    • Add Dockerfile and instructions for model preparation
    opened by lerouxrgd 0
  • Recommendations for further usage

    Recommendations for further usage

    Hi,

    first of all: thanks for that amazing port. I finally managed to get it working even under windows.

    Are there any recommendations for further usage of the computed vectors in regards to similarity measurements?

    My plan now would have been to embed a given search query as well and then calculate the cosine similarity between search query and dataset entry for each entry from the processed "dataset".

    Is there anything against this approach? Are there any more advanced libraries that you know of?

    Thanks in advance for the support. I would be open for contribution in any case.

    Kind regards Julian

    opened by JulianGerhard21 0
Owner
Machine learning and computer graphics enthusiast :)
null
Implementation of sentence embeddings with BERT in Rust, using the Burn library.

Sentence Transformers in Burn This library provides an implementation of the Sentence Transformers framework for computing text representations as vec

Tyler Vergho 4 Sep 4, 2023
Semantic text segmentation. For sentence boundary detection, compound splitting and more.

NNSplit A tool to split text using a neural network. The main application is sentence boundary detection, but e. g. compound splitting for German is a

Benjamin Minixhofer 273 Dec 29, 2022
A rule based sentence segmentation library.

cutters A rule based sentence segmentation library. ?? This library is experimental. ?? Features Full UTF-8 support. Robust parsing. Language specific

null 11 Jul 29, 2022
Succeeded by SyntaxDot: https://github.com/tensordot/syntaxdot

Warning: sticker is succeeded by SyntaxDot, which supports many new features: Multi-task learning. Pretrained transformer models, suchs as BERT and XL

stickeritis 25 Apr 28, 2022
Natural language detection library for Rust. Try demo online: https://www.greyblake.com/whatlang/

Whatlang Natural language detection for Rust with focus on simplicity and performance. Content Features Get started Documentation Supported languages

Sergey Potapov 805 Dec 28, 2022
Rust-nlp is a library to use Natural Language Processing algorithm with RUST

nlp Rust-nlp Implemented algorithm Distance Levenshtein (Explanation) Jaro / Jaro-Winkler (Explanation) Phonetics Soundex (Explanation) Metaphone (Exp

Simon Paitrault 34 Dec 20, 2022
Fast suffix arrays for Rust (with Unicode support).

suffix Fast linear time & space suffix arrays for Rust. Supports Unicode! Dual-licensed under MIT or the UNLICENSE. Documentation https://docs.rs/suff

Andrew Gallant 207 Dec 26, 2022
Elastic tabstops for Rust.

tabwriter is a crate that implements elastic tabstops. It provides both a library for wrapping Rust Writers and a small program that exposes the same

Andrew Gallant 212 Dec 16, 2022
An efficient and powerful Rust library for word wrapping text.

Textwrap Textwrap is a library for wrapping and indenting text. It is most often used by command-line programs to format dynamic output nicely so it l

Martin Geisler 322 Dec 26, 2022
⏮ ⏯ ⏭ A Rust library to easily read forwards, backwards or randomly through the lines of huge files.

EasyReader The main goal of this library is to allow long navigations through the lines of large files, freely moving forwards and backwards or gettin

Michele Federici 81 Dec 6, 2022
An implementation of regular expressions for Rust. This implementation uses finite automata and guarantees linear time matching on all inputs.

regex A Rust library for parsing, compiling, and executing regular expressions. Its syntax is similar to Perl-style regular expressions, but lacks a f

The Rust Programming Language 2.6k Jan 8, 2023
Multilingual implementation of RAKE algorithm for Rust

RAKE.rs The library provides a multilingual implementation of Rapid Automatic Keyword Extraction (RAKE) algorithm for Rust. How to Use Append rake to

Navid 26 Dec 16, 2022
A Rust library for generically joining iterables with a separator

joinery A Rust library for generically joining iterables with a separator. Provides the tragically missing string join functionality to rust. extern c

Nathan West 72 Dec 16, 2022
Rust edit distance routines accelerated using SIMD. Supports fast Hamming, Levenshtein, restricted Damerau-Levenshtein, etc. distance calculations and string search.

triple_accel Rust edit distance routines accelerated using SIMD. Supports fast Hamming, Levenshtein, restricted Damerau-Levenshtein, etc. distance cal

Daniel Liu 75 Jan 8, 2023
Rust native ready-to-use NLP pipelines and transformer-based models (BERT, DistilBERT, GPT2,...)

rust-bert Rust native Transformer-based models implementation. Port of Hugging Face's Transformers library, using the tch-rs crate and pre-processing

null 1.3k Jan 8, 2023
👄 The most accurate natural language detection library in the Rust ecosystem, suitable for long and short text alike

Table of Contents What does this library do? Why does this library exist? Which languages are supported? How good is it? Why is it better than other l

Peter M. Stahl 569 Jan 3, 2023
Snips NLU rust implementation

Snips NLU Rust Installation Add it to your Cargo.toml: [dependencies] snips-nlu-lib = { git = "https://github.com/snipsco/snips-nlu-rs", branch = "mas

Snips 327 Dec 26, 2022
A fast, low-resource Natural Language Processing and Text Correction library written in Rust.

nlprule A fast, low-resource Natural Language Processing and Error Correction library written in Rust. nlprule implements a rule- and lookup-based app

Benjamin Minixhofer 496 Jan 8, 2023
A fast implementation of Aho-Corasick in Rust.

aho-corasick A library for finding occurrences of many patterns at once with SIMD acceleration in some cases. This library provides multiple pattern s

Andrew Gallant 662 Dec 31, 2022