Rust-tokenizer offers high-performance tokenizers for modern language models, including WordPiece, Byte-Pair Encoding (BPE) and Unigram (SentencePiece) models

Overview

rust-tokenizers

Build Status Latest version License

Rust-tokenizer offers high-performance tokenizers for modern language models, including WordPiece, Byte-Pair Encoding (BPE) and Unigram (SentencePiece) models. These tokenizers are used in the rust-bert crate. A broad range of tokenizers for state-of-the-art transformers architectures is included, including:

  • Sentence Piece (unigram model)
  • BERT
  • ALBERT
  • DistilBERT
  • RoBERTa
  • GPT
  • GPT2
  • ProphetNet
  • CTRL

The wordpiece based tokenizers include both single-threaded and multi-threaded processing. The Byte-Pair-Encoding tokenizers favor the use of a shared cache and are only available as single-threaded tokenizers Using the tokenizers requires downloading manually the tokenizers required files (vocabulary or merge files). These can be found in the Transformers library.

The sentence piece model loads the same .model proto files as the C++ library

Usage example (Rust)

let vocab = Arc::new(rust_tokenizers::BertVocab::from_file(&vocab_path));

let test_sentence = Example::new_from_string("This is a sample sentence to be tokenized");
let bert_tokenizer: BertTokenizer = BertTokenizer::from_existing_vocab(vocab.clone());

println!("{:?}", bert_tokenizer.encode(&test_sentence.sentence_1,
                                       None,
                                       128,
                                       &TruncationStrategy::LongestFirst,
                                       0));

Python bindings set-up

Rust-tokenizer requires a rust nightly build in order to use the Python API. Building from source involves the following steps:

  1. Install Rust and use the nightly tool chain
  2. run python setup.py install in the /python-bindings repository. This will compile the Rust library and install the python API
  3. Example use are available in the /tests folder, including benchmark and integration tests

The library is fully unit tested at the Rust level

Usage example (Python)

from rust_transformers import PyBertTokenizer
from transformers.modeling_bert import BertForSequenceClassification

rust_tokenizer = PyBertTokenizer('bert-base-uncased-vocab.txt')
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', output_attentions=False).cuda()
model = model.eval()

sentence = '''For instance, on the planet Earth, man had always assumed that he was more intelligent than dolphins because 
              he had achieved so much—the wheel, New York, wars and so on—whilst all the dolphins had ever done was muck 
              about in the water having a good time. But conversely, the dolphins had always believed that they were far 
              more intelligent than man—for precisely the same reasons.'''

features = rust_tokenizer.encode(sentence, max_len=128, truncation_strategy='only_first', stride=0)
input_ids = torch.tensor([f.token_ids for f in features], dtype=torch.long).cuda()

with torch.no_grad():
    output = model(all_input_ids)[0].cpu().numpy()
Comments
  • Using tokenizers within threads is a little cumbersome

    Using tokenizers within threads is a little cumbersome

    Hey @guillaume-be, I'm having a little trouble using rust-tokenizers within a multi-process data loading pipeline. In particular, with the RobertaTokenizer, the issue is that it's not Send or Sync due to the use of Rc and RefCell, so currently what I have to do is instantiate the tokenizer within each worker thread after they have been spawned (see here, for example).

    This is a little cumbersome, and also isn't ideal because the cache can't be shared between threads.

    Is there any reason not to use something like a RwLock instead?

    opened by epwalsh 10
  • Character offset information

    Character offset information

    It would help if the tokenisers could maintain and output character offset information, with respect to the original input. This would help greatly to align information back together at the end of an NLP pipeline.

    Possible points for discussion:

    • Are utf-8 byte offsets or do we also want unicode graphemes? (with https://github.com/unicode-rs/unicode-segmentation for instance).

    (this continues part of the discussion from guillaume-be/rust-bert#29)

    opened by proycon 10
  • Implemented offset information and major refactoring of tokenisation functions.

    Implemented offset information and major refactoring of tokenisation functions.

    This (draft) PR is the result of #14 , it implements offset support for all the tokenisers.

    As a lot of the original logic consisted of simple string replacements, losing the alignment with the original text, various internal functions had to be modified or rewritten completely. This code therefore constitutes a major refactoring. All logic is rewritten to use new Token (or TokenRef, holding &str instead of String) structures. Each token holds an offset and a mask. The mask tells something about the role of the token (is it a special token? out of vocabulary? is it a continuation subtoken?).

    The core of the tokenisers is formed by the following new shared functions in tokenization_utils:

    • split_on_char() - split on character (takes a testing function as argument)
    • split_on_substr() - split on substring (takes a testing function as argument)
    • split_on_regex() - split on regular expression (takes a regex as argument)
    • split_on_bpe_pairs() - split on bpe pairs, this abstracts over parts of the GPT2/Roberta/ctrl tokenizers (takes a bpe function as argument).

    The various higher-order functions all use these:

    • split_on_regex_with_lookahead() - abstracts a part of the GPT2/Roberta tokenizers
    • split_on_punct- split on punctuation
    • whitespace_tokenize- split on whitespace
    • tokenize_cjk_chars()- split on CJK characters

    All these function take a single Token or TokenRef (which may correspond to an entire text/sentence/whatever!) and return either Vec<Token> or Vec<TokenRef> (depending on whether I could get away with not doing any copies). The general idea is that tokens are being constantly decomposed into smaller subtokens by chaining the functions in an appropriate order.

    Each of the tokenisers has a tokenize_into_tokens() -> Vec<Token> method coming from the Tokenizer trait. tokenize_with_offsets() is a small wrapper around that which decomposes the Vec<Token> into a 3-tuple (Vec<String>,Vec<Offset>,Vec<Mask>). The original tokenize() function remains for backward compatibility but simply ignores the offset and mask information (so has no performance benefits).

    As for performance, though certain things got optimized and will surely help, there is also added complexity that will have some impact. Things can probably still be improved (and cleaned up) too.

    Because of all the extensive changes, the deeper code no longer maps one-on-one to the original transformer's tokenizers (assuming it did in the first place?), still the behaviour should be almost identical.

    This is a draft pull request at this stage, because there are still some loose ends to fix. The main issue is that I currently have failures on a few tests which I'm not sure about (I'll address it in a new comment because this one is already getting pretty long).

    opened by proycon 7
  • Use AsRef<Path> instead of &str.

    Use AsRef instead of &str.

    Master PR: #76 Blocked by: #79

    Motivation

    1. Not every FS path is a valid utf-8 string.
    2. AsRef<Path> is strictly more acceptable with same level of correctness: you can use &str, Path and PathBuf.

    Implementation

    • [x] Use AsRef<Path> instead of &str.
    • [x] Remove unnecessary type casts and unwraps from tests.

    @veta666, this one is also on you.

    opened by npatsakula 4
  • Special token map extension

    Special token map extension

    This PR handles the addition of special token maps for all vocabularies and tokenizers.

    • Normalizes the crate interface to special tokens
    • Replaces the unknown_token field of vocabs by special_token_map (this common token can still be accessed directly via get_unknown_token())
    • No longer hardcodes vocabs special tokens to static strings
    • Refactors some duplicated code to read files

    Todo:

    • [x] Update vocabs
    • [x] Update tokenizers
    • [x] Add special token map constructors to tokenizers
    • [x] Fix unit tests
    • [ ] Fix integration tests
    opened by guillaume-be 4
  • DRAFT: NLLB tokenizer support.

    DRAFT: NLLB tokenizer support.

    Hello, this is my initial NLLB tokenizer support MR!

    Tokenizer config: https://huggingface.co/facebook/nllb-200-1.3B/blob/main/tokenizer_config.json

    Special tokens: https://huggingface.co/facebook/nllb-200-1.3B/blob/main/special_tokens_map.json

    Vocabulary + depth of unknown: https://huggingface.co/facebook/nllb-200-1.3B/blob/main/tokenizer.json

    Unsolved questions:

    1. ~~bos/cls/eos is hardcoded constants, but we have configuration file for them. Why doesn't rust-tokenizers use it?~~
    2. ~~I copied some code from M2M module and python's source, but I am not entirely understand it. Is there some documentation for vocab alignment and JSON format?~~
    3. tokenizer.json contains not only vocab, but also normalizer, pre_tokenizer, post_processor and meta information for the model. Should we do something with it?

    Plan

    • [x] Initial implementation.
    • [x] Tests.
    • [ ] Python bindings.
    • [ ] Docs.
    opened by npatsakula 2
  • Can't create `XLMRobertaTokenizer` from xlm-roberta dataset

    Can't create `XLMRobertaTokenizer` from xlm-roberta dataset

    I'm not sure if this is where this is supposed to go but, at the highest level, I'm using rust-bert and trying to instantiate an NERModel with seemingly default XLMRoberta configurations like:

        let config = TokenClassificationConfig::new(
            ModelType::XLMRoberta,
            RemoteResource::from_pretrained(RobertaModelResources::XLM_ROBERTA_NER_EN),
            RemoteResource::from_pretrained(RobertaVocabResources::XLM_ROBERTA_NER_EN),
            RemoteResource::from_pretrained(RobertaConfigResources::XLM_ROBERTA_NER_EN),
            None,  //merges resource only relevant with ModelType::Roberta
            false, //lowercase
            None,  //strip_accents
            None,  //add_prefix_space
            LabelAggregationOption::Mode,
        );
    
        let ner_model = NERModel::new(config).unwrap();
    

    and this is failing with:

    TokenizerError("Error when loading vocabulary file, the file 
    may be corrupted or does not match the expected format: incorrect tag")
    

    That resource seems to be pointing to the sentencepiece.bpe.model file for this model and it seems to be parsed by ModelProto in this repo.

    I'm not sure if:

    1. I'm doing the wrong thing with rust-bert
    2. The file hosted in the xlm-roberta dataset is out of date
    3. This repo is out of date with a new update to the xlm-roberta dataset file

    This issue is really only relevant for the last item (though I wouldn't know where to report the second item). Apologies if it's the first item but I'm wondering if you can offer any insight here!

    Note I'm using the version of rust-bert on the master git branch, not the latest published crate (because I need 0.7 tch support).

    opened by mlodato517 2
  • [Question] How is this library compared to huggingface's tokenizer?

    [Question] How is this library compared to huggingface's tokenizer?

    Hi, thanks for the awesome libraries. I'm new to rust and the community is enthusiastic. I have some questions about the tokenizer library. How is rust-tokenizers compared to huggingface's tokenizer? (Performance, tokenized result etc.) Can I use them interchangably with rust-bert?

    opened by liebkne 2
  • Tokenize non-breaking space

    Tokenize non-breaking space

    Dearest Maintainer,

    I have been using rust-bert and my panic leads me here. I have a nonbreaking space \u{a0} https://en.wikipedia.org/wiki/Non-breaking_space . This appears to be valid and the chars are there. https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=865b0df5864e4d3b68004df0babd3afe

    My error:

    thread 'main' panicked at 'byte index 11 is not a char boundary; it is inside '\u{a0}' (bytes 10..12) of input.   We', /home/becker/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/src/libcore/str/mod.rs:1920:47

    it comes from 14: core::str::traits::<impl core::slice::SliceIndex for core::ops::range::Range>::index::{{closure}} 15: rust_tokenizers::preprocessing::tokenizer::tokenization_utils::split_on_regex_with_lookahead

    I have looked over the code and i dont understand which index is off. it looks like it calls len_utf8 at a good spot.

    example code to trigger the issue.

        let result = panic::catch_unwind(|| {
            let mut summarization_model =
                SummarizationModel::new(Default::default()).expect("summarization_model fail");
            let input = [text];
            summarization_model
                .summarize(" input.\u{a0} \u{a0}We")
                .join(" ")
        });
    

    in the mean time i am going to preprocess \u{a0} to covert it to a space.

    Any help in understanding this would be much appreciated.

    Thanks again Becker

    opened by sbeckeriv 2
  • Issues with clean_up_tokenization() function?

    Issues with clean_up_tokenization() function?

    I have some issues regarding the clean_up_tokenization() function, it looks like what it basically does is strip some whitespace? I agree that the tokens itself should not contain leading or trailing whitespace. I think it could be implemented more efficiently and generically (= language independent)?

    https://github.com/guillaume-be/rust-tokenizers/blob/master/main/src/preprocessing/tokenizer/base_tokenizer.rs#L130

    .replace(" do not", " don't")

    I don't like this part because it changes the actual text and doesn't just concern whitespace.

    opened by proycon 2
  • Fix panic with unicode chars that are expanded at the end of sentences

    Fix panic with unicode chars that are expanded at the end of sentences

    The current tokenizers don't have test cases where unicode chars are expanded in the normalization process with .nfkc(). This pr adds test cases and fixes the problem. Expanded unicode chars in the middle of the sentence were previously wrongly tokenized and unicode chars that were expanded at the end of a text fragment led to panics.

    We've filled in the tests with the output of the tokenizers to get the tests to pass, but it's probably worth checking whether they look alright, as we pattern-matched the tests and don't have any insight as to what the output should be.

    opened by sftse 1
  • Structural errors.

    Structural errors.

    Master MR: https://github.com/guillaume-be/rust-tokenizers/pull/76

    • [x] Add structural errors:
      • [x] Save source error type.
      • [x] Line location.
      • [x] Make reasonable error message.
    • [ ] Fix tests ignored in tokenization_utils: they broke because line location in display message.

    @veta666, could you finish this PR?

    opened by npatsakula 1
  • Reading SentencePieceVocab from text file

    Reading SentencePieceVocab from text file

    I've created a SentencePiece model using Python which results in a .model and a .vocab file. It is not possible to create a SentencePieceVocab from the later since Python does not seem to use protobuf but rather a plain text file. Here's an excerpt of my file:

    <unk>	0
    <s>	0
    </s>	0
    ▁	-2.29038
    s	-3.10405
    l	-3.41047
    

    I didn't find an option in the Python code for creating a protobuf vocab file so I wrote a parser. Unless I'm mistaken and did something wrong, would you like that code as a PR? I.e. something like:

    impl SentencePieceVocab {
        ...
        
        pub fn from_vocab_txt_file(path: &str) -> Result<SentencePieceVocab, TokenizerError> { 
            ... 
        }
    }
    

    in rust-tokenizers/main/src/vocab/sentence_piece_vocab.rs

    opened by MikaelCall 5
Owner
null
💥 Fast State-of-the-Art Tokenizers optimized for Research and Production

Provides an implementation of today's most used tokenizers, with a focus on performance and versatility. Main features: Train new vocabularies and tok

Hugging Face 6.2k Jan 5, 2023
Read input lines as byte slices for high efficiency

bytelines This library provides an easy way to read in input lines as byte slices for high efficiency. It's basically lines from the standard library,

Isaac Whitfield 53 Sep 24, 2022
Vaporetto: a fast and lightweight pointwise prediction based tokenizer

?? VAporetto: POintwise pREdicTion based TOkenizer Vaporetto is a fast and lightweight pointwise prediction based tokenizer. Overview This repository

null 184 Dec 22, 2022
A WHATWG-compliant HTML5 tokenizer and tag soup parser

html5gum html5gum is a WHATWG-compliant HTML tokenizer. use std::fmt::Write; use html5gum::{Tokenizer, Token}; let html = "<title >hello world</tit

Markus Unterwaditzer 129 Dec 30, 2022
🛥 Vaporetto is a fast and lightweight pointwise prediction based tokenizer. This is a Python wrapper for Vaporetto.

?? python-vaporetto ?? Vaporetto is a fast and lightweight pointwise prediction based tokenizer. This is a Python wrapper for Vaporetto. Installation

null 17 Dec 22, 2022
The Bytepiece Tokenizer Implemented in Rust.

bytepiece Implementation of Su's bytepiece. Bytepiece is a new tokenize method, which uses UTF-8 Byte as unigram to process text. It needs little prep

Yam(长琴) 11 Oct 2, 2023
Viterbi-based accelerated tokenizer (Python wrapper)

?? python-vibrato ?? Vibrato is a fast implementation of tokenization (or morphological analysis) based on the Viterbi algorithm. This is a Python wra

null 20 Dec 29, 2022
nombytes is a library that provides a wrapper for the bytes::Bytes byte container for use with nom.

NomBytes nombytes is a library that provides a wrapper for the bytes::Bytes byte container for use with nom. I originally made this so that I could ha

Alexander Krivács Schrøder 2 Jul 25, 2022
High-performance time series downsampling algorithms for visualization

tsdownsample ?? Time series downsampling algorithms for visualization Features ✨ Fast: written in rust with PyO3 bindings leverages optimized argminma

PreDiCT.IDLab 5 Dec 8, 2022
Rust native ready-to-use NLP pipelines and transformer-based models (BERT, DistilBERT, GPT2,...)

rust-bert Rust native Transformer-based models implementation. Port of Hugging Face's Transformers library, using the tch-rs crate and pre-processing

null 1.3k Jan 8, 2023
Papercraft is a tool to unwrap 3D models.

Papercraft Introduction Papercraft is a tool to unwrap paper 3D models, so that you can cut and glue them together and get a real world paper model. T

Rodrigo Rivas Costa 13 Nov 18, 2022
High precision decimal

decimal-rs High precision decimal with maximum precision of 38. Optional features serde When this optional dependency is enabled, Decimal implements t

CoD 22 Dec 28, 2022
👄 The most accurate natural language detection library in the Rust ecosystem, suitable for long and short text alike

Table of Contents What does this library do? Why does this library exist? Which languages are supported? How good is it? Why is it better than other l

Peter M. Stahl 569 Jan 3, 2023
A fast, low-resource Natural Language Processing and Text Correction library written in Rust.

nlprule A fast, low-resource Natural Language Processing and Error Correction library written in Rust. nlprule implements a rule- and lookup-based app

Benjamin Minixhofer 496 Jan 8, 2023
Simple, extendable and embeddable scripting language.

duckscript duckscript SDK CLI Simple, extendable and embeddable scripting language. Overview Language Goals Installation Homebrew Binary Release Ducks

Sagie Gur-Ari 356 Dec 24, 2022
lingua-rs Python binding. An accurate natural language detection library, suitable for long and short text alike.

lingua-py lingua-rs Python binding. An accurate natural language detection library, suitable for long and short text alike. Installation pip install l

messense 7 Dec 30, 2022
Rust-nlp is a library to use Natural Language Processing algorithm with RUST

nlp Rust-nlp Implemented algorithm Distance Levenshtein (Explanation) Jaro / Jaro-Winkler (Explanation) Phonetics Soundex (Explanation) Metaphone (Exp

Simon Paitrault 34 Dec 20, 2022
Natural language detection library for Rust. Try demo online: https://www.greyblake.com/whatlang/

Whatlang Natural language detection for Rust with focus on simplicity and performance. Content Features Get started Documentation Supported languages

Sergey Potapov 805 Dec 28, 2022
Natural Language Processing for Rust

rs-natural Natural language processing library written in Rust. Still very much a work in progress. Basically an experiment, but hey maybe something c

Chris Tramel 211 Dec 28, 2022