Rust-tokenizer offers high-performance tokenizers for modern language models, including WordPiece, Byte-Pair Encoding (BPE) and Unigram (SentencePiece) models

Last update: Jan 1, 2023

Related tags

Overview

rust-tokenizers

Rust-tokenizer offers high-performance tokenizers for modern language models, including WordPiece, Byte-Pair Encoding (BPE) and Unigram (SentencePiece) models. These tokenizers are used in the rust-bert crate. A broad range of tokenizers for state-of-the-art transformers architectures is included, including:

Sentence Piece (unigram model)
BERT
ALBERT
DistilBERT
RoBERTa
GPT
GPT2
ProphetNet
CTRL

The wordpiece based tokenizers include both single-threaded and multi-threaded processing. The Byte-Pair-Encoding tokenizers favor the use of a shared cache and are only available as single-threaded tokenizers Using the tokenizers requires downloading manually the tokenizers required files (vocabulary or merge files). These can be found in the Transformers library.

The sentence piece model loads the same .model proto files as the C++ library

Usage example (Rust)

let vocab = Arc::new(rust_tokenizers::BertVocab::from_file(&vocab_path));

let test_sentence = Example::new_from_string("This is a sample sentence to be tokenized");
let bert_tokenizer: BertTokenizer = BertTokenizer::from_existing_vocab(vocab.clone());

println!("{:?}", bert_tokenizer.encode(&test_sentence.sentence_1,
                                       None,
                                       128,
                                       &TruncationStrategy::LongestFirst,
                                       0));

Python bindings set-up

Rust-tokenizer requires a rust nightly build in order to use the Python API. Building from source involves the following steps:

Install Rust and use the nightly tool chain
run python setup.py install in the /python-bindings repository. This will compile the Rust library and install the python API
Example use are available in the /tests folder, including benchmark and integration tests

The library is fully unit tested at the Rust level

Usage example (Python)

from rust_transformers import PyBertTokenizer
from transformers.modeling_bert import BertForSequenceClassification

rust_tokenizer = PyBertTokenizer('bert-base-uncased-vocab.txt')
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', output_attentions=False).cuda()
model = model.eval()

sentence = '''For instance, on the planet Earth, man had always assumed that he was more intelligent than dolphins because 
              he had achieved so much—the wheel, New York, wars and so on—whilst all the dolphins had ever done was muck 
              about in the water having a good time. But conversely, the dolphins had always believed that they were far 
              more intelligent than man—for precisely the same reasons.'''

features = rust_tokenizer.encode(sentence, max_len=128, truncation_strategy='only_first', stride=0)
input_ids = torch.tensor([f.token_ids for f in features], dtype=torch.long).cuda()

with torch.no_grad():
    output = model(all_input_ids)[0].cpu().numpy()

Comments

Using tokenizers within threads is a little cumbersome

Hey @guillaume-be, I'm having a little trouble using rust-tokenizers within a multi-process data loading pipeline. In particular, with the RobertaTokenizer, the issue is that it's not Send or Sync due to the use of Rc and RefCell, so currently what I have to do is instantiate the tokenizer within each worker thread after they have been spawned (see here, for example).

This is a little cumbersome, and also isn't ideal because the cache can't be shared between threads.

Is there any reason not to use something like a RwLock instead?

opened by epwalsh 10
Character offset information
It would help if the tokenisers could maintain and output character offset information, with respect to the original input. This would help greatly to align information back together at the end of an NLP pipeline.

Possible points for discussion:

Are utf-8 byte offsets or do we also want unicode graphemes? (with https://github.com/unicode-rs/unicode-segmentation for instance).

(this continues part of the discussion from guillaume-be/rust-bert#29)
opened by proycon 10
Implemented offset information and major refactoring of tokenisation functions.
This (draft) PR is the result of #14 , it implements offset support for all the tokenisers.

As a lot of the original logic consisted of simple string replacements, losing the alignment with the original text, various internal functions had to be modified or rewritten completely. This code therefore constitutes a major refactoring. All logic is rewritten to use new Token (or TokenRef, holding &str instead of String) structures. Each token holds an offset and a mask. The mask tells something about the role of the token (is it a special token? out of vocabulary? is it a continuation subtoken?).

The core of the tokenisers is formed by the following new shared functions in tokenization_utils:

split_on_char() - split on character (takes a testing function as argument)

split_on_substr() - split on substring (takes a testing function as argument)

split_on_regex() - split on regular expression (takes a regex as argument)

split_on_bpe_pairs() - split on bpe pairs, this abstracts over parts of the GPT2/Roberta/ctrl tokenizers (takes a bpe function as argument).

The various higher-order functions all use these:

split_on_regex_with_lookahead() - abstracts a part of the GPT2/Roberta tokenizers

split_on_punct- split on punctuation

whitespace_tokenize- split on whitespace

tokenize_cjk_chars()- split on CJK characters

All these function take a single Token or TokenRef (which may correspond to an entire text/sentence/whatever!) and return either Vec<Token> or Vec<TokenRef> (depending on whether I could get away with not doing any copies). The general idea is that tokens are being constantly decomposed into smaller subtokens by chaining the functions in an appropriate order.

Each of the tokenisers has a tokenize_into_tokens() -> Vec<Token> method coming from the Tokenizer trait. tokenize_with_offsets() is a small wrapper around that which decomposes the Vec<Token> into a 3-tuple (Vec<String>,Vec<Offset>,Vec<Mask>). The original tokenize() function remains for backward compatibility but simply ignores the offset and mask information (so has no performance benefits).

As for performance, though certain things got optimized and will surely help, there is also added complexity that will have some impact. Things can probably still be improved (and cleaned up) too.

Because of all the extensive changes, the deeper code no longer maps one-on-one to the original transformer's tokenizers (assuming it did in the first place?), still the behaviour should be almost identical.

This is a draft pull request at this stage, because there are still some loose ends to fix. The main issue is that I currently have failures on a few tests which I'm not sure about (I'll address it in a new comment because this one is already getting pretty long).
opened by proycon 7
Use AsRef instead of &str.
Master PR: #76 Blocked by: #79

Motivation

Not every FS path is a valid utf-8 string.

AsRef<Path> is strictly more acceptable with same level of correctness: you can use &str, Path and PathBuf.

Implementation

[x] Use AsRef<Path> instead of &str.

[x] Remove unnecessary type casts and unwraps from tests.

@veta666, this one is also on you.
opened by npatsakula 4
Special token map extension
This PR handles the addition of special token maps for all vocabularies and tokenizers.

Normalizes the crate interface to special tokens

Replaces the unknown_token field of vocabs by special_token_map (this common token can still be accessed directly via get_unknown_token())

No longer hardcodes vocabs special tokens to static strings

Refactors some duplicated code to read files

Todo:

[x] Update vocabs

[x] Update tokenizers

[x] Add special token map constructors to tokenizers

[x] Fix unit tests

[ ] Fix integration tests
opened by guillaume-be 4
DRAFT: NLLB tokenizer support.
Hello, this is my initial NLLB tokenizer support MR!

Tokenizer config: https://huggingface.co/facebook/nllb-200-1.3B/blob/main/tokenizer_config.json

Special tokens: https://huggingface.co/facebook/nllb-200-1.3B/blob/main/special_tokens_map.json

Vocabulary + depth of unknown: https://huggingface.co/facebook/nllb-200-1.3B/blob/main/tokenizer.json

Unsolved questions:

~~bos/cls/eos is hardcoded constants, but we have configuration file for them. Why doesn't rust-tokenizers use it?~~

~~I copied some code from M2M module and python's source, but I am not entirely understand it. Is there some documentation for vocab alignment and JSON format?~~

tokenizer.json contains not only vocab, but also normalizer, pre_tokenizer, post_processor and meta information for the model. Should we do something with it?

Plan

[x] Initial implementation.

[x] Tests.

[ ] Python bindings.

[ ] Docs.
opened by npatsakula 2
Can't create `XLMRobertaTokenizer` from xlm-roberta dataset
I'm not sure if this is where this is supposed to go but, at the highest level, I'm using rust-bert and trying to instantiate an NERModel with seemingly default XLMRoberta configurations like:

let config = TokenClassificationConfig::new( ModelType::XLMRoberta, RemoteResource::from_pretrained(RobertaModelResources::XLM_ROBERTA_NER_EN), RemoteResource::from_pretrained(RobertaVocabResources::XLM_ROBERTA_NER_EN), RemoteResource::from_pretrained(RobertaConfigResources::XLM_ROBERTA_NER_EN), None, //merges resource only relevant with ModelType::Roberta false, //lowercase None, //strip_accents None, //add_prefix_space LabelAggregationOption::Mode, ); let ner_model = NERModel::new(config).unwrap();

and this is failing with:

TokenizerError("Error when loading vocabulary file, the file may be corrupted or does not match the expected format: incorrect tag")

That resource seems to be pointing to the sentencepiece.bpe.model file for this model and it seems to be parsed by ModelProto in this repo.

I'm not sure if:

I'm doing the wrong thing with rust-bert

The file hosted in the xlm-roberta dataset is out of date

This repo is out of date with a new update to the xlm-roberta dataset file

This issue is really only relevant for the last item (though I wouldn't know where to report the second item). Apologies if it's the first item but I'm wondering if you can offer any insight here!

Note I'm using the version of rust-bert on the master git branch, not the latest published crate (because I need 0.7 tch support).
opened by mlodato517 2
[Question] How is this library compared to huggingface's tokenizer?

Hi, thanks for the awesome libraries. I'm new to rust and the community is enthusiastic. I have some questions about the tokenizer library. How is rust-tokenizers compared to huggingface's tokenizer? (Performance, tokenized result etc.) Can I use them interchangably with rust-bert?

opened by liebkne 2
Tokenize non-breaking space
Dearest Maintainer,

I have been using rust-bert and my panic leads me here. I have a nonbreaking space \u{a0} https://en.wikipedia.org/wiki/Non-breaking_space . This appears to be valid and the chars are there. https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=865b0df5864e4d3b68004df0babd3afe

My error:

thread 'main' panicked at 'byte index 11 is not a char boundary; it is inside '\u{a0}' (bytes 10..12) of input. We', /home/becker/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/src/libcore/str/mod.rs:1920:47

it comes from 14: core::str::traits::<impl core::slice::SliceIndex for core::ops::range::Range>::index::{{closure}} 15: rust_tokenizers::preprocessing::tokenizer::tokenization_utils::split_on_regex_with_lookahead

I have looked over the code and i dont understand which index is off. it looks like it calls len_utf8 at a good spot.

example code to trigger the issue.

let result = panic::catch_unwind(|| { let mut summarization_model = SummarizationModel::new(Default::default()).expect("summarization_model fail"); let input = [text]; summarization_model .summarize(" input.\u{a0} \u{a0}We") .join(" ") });

in the mean time i am going to preprocess \u{a0} to covert it to a space.

Any help in understanding this would be much appreciated.

Thanks again Becker
opened by sbeckeriv 2
Issues with clean_up_tokenization() function?

I have some issues regarding the clean_up_tokenization() function, it looks like what it basically does is strip some whitespace? I agree that the tokens itself should not contain leading or trailing whitespace. I think it could be implemented more efficiently and generically (= language independent)?

https://github.com/guillaume-be/rust-tokenizers/blob/master/main/src/preprocessing/tokenizer/base_tokenizer.rs#L130

.replace(" do not", " don't")

I don't like this part because it changes the actual text and doesn't just concern whitespace.

opened by proycon 2
Fix panic with unicode chars that are expanded at the end of sentences

The current tokenizers don't have test cases where unicode chars are expanded in the normalization process with .nfkc(). This pr adds test cases and fixes the problem. Expanded unicode chars in the middle of the sentence were previously wrongly tokenized and unicode chars that were expanded at the end of a text fragment led to panics.

We've filled in the tests with the output of the tokenizers to get the tests to pass, but it's probably worth checking whether they look alright, as we pattern-matched the tests and don't have any insight as to what the output should be.

opened by sftse 1
Structural errors.
Master MR: https://github.com/guillaume-be/rust-tokenizers/pull/76

[x] Add structural errors:

[x] Save source error type.

[x] Line location.

[x] Make reasonable error message.

[ ] Fix tests ignored in tokenization_utils: they broke because line location in display message.

@veta666, could you finish this PR?
opened by npatsakula 1
Reading SentencePieceVocab from text file
I've created a SentencePiece model using Python which results in a .model and a .vocab file. It is not possible to create a SentencePieceVocab from the later since Python does not seem to use protobuf but rather a plain text file. Here's an excerpt of my file:

<unk> 0 <s> 0 </s> 0 ▁ -2.29038 s -3.10405 l -3.41047

I didn't find an option in the Python code for creating a protobuf vocab file so I wrote a parser. Unless I'm mistaken and did something wrong, would you like that code as a PR? I.e. something like:

impl SentencePieceVocab { ... pub fn from_vocab_txt_file(path: &str) -> Result<SentencePieceVocab, TokenizerError> { ... } }

in rust-tokenizers/main/src/vocab/sentence_piece_vocab.rs
opened by MikaelCall 5