💥 Fast State-of-the-Art Tokenizers optimized for Research and Production

Hugging Face

Last update: Jan 5, 2023

Related tags

Text processing nlp natural-language-processing transformers gpt language-model bert natural-language-understanding

Overview

Provides an implementation of today's most used tokenizers, with a focus on performance and versatility.

Main features:

Train new vocabularies and tokenize, using today's most used tokenizers.
Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes less than 20 seconds to tokenize a GB of text on a server's CPU.
Easy to use, but also extremely versatile.
Designed for research and production.
Normalization comes with alignments tracking. It's always possible to get the part of the original sentence that corresponds to a given token.
Does all the pre-processing: Truncate, Pad, add the special tokens your model needs.

Bindings

We provide bindings to the following languages (more to come!):

Rust (Original implementation)
Python
Node.js

Quick example using Python:

Choose your model between Byte-Pair Encoding, WordPiece or Unigram and instantiate a tokenizer:

from tokenizers import Tokenizer
from tokenizers.models import BPE

tokenizer = Tokenizer(BPE())

You can customize how pre-tokenization (e.g., splitting into words) is done:

from tokenizers.pre_tokenizers import Whitespace

tokenizer.pre_tokenizer = Whitespace()

Then training your tokenizer on a set of files just takes two lines of codes:

from tokenizers.trainers import BpeTrainer

trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])
tokenizer.train(files=["wiki.train.raw", "wiki.valid.raw", "wiki.test.raw"], trainer=trainer)

Once your tokenizer is trained, encode any text with just one line:

output = tokenizer.encode("Hello, y'all! How are you 😁 ?")
print(output.tokens)
# ["Hello", ",", "y", "'", "all", "!", "How", "are", "you", "[UNK]", "?"]

Check the python documentation or the python quicktour to learn more!

A fast implementation of Aho-Corasick in Rust.

aho-corasick A library for finding occurrences of many patterns at once with SIMD acceleration in some cases. This library provides multiple pattern s

662 Dec 31, 2022

Blazingly fast framework for in-process microservices on top of Tower ecosystem

norpc = not remote procedure call Motivation Developing an async application is often a very difficult task but building an async application as a set

15 Dec 12, 2022

Ultra-fast, spookily accurate text summarizer that works on any language

pithy 0.1.0 - an absurdly fast, strangely accurate, summariser Quick example: pithy -f your_file_here.txt --sentences 4 --help: Print this help messa

13 Oct 31, 2022

A lightning-fast Sanskrit toolkit. For Python bindings, see `vidyut-py`.

Vidyut मा भूदेवं क्षणमपि च ते विद्युता विप्रयोगः ॥ Vidyut is a lightning-fast toolkit for processing Sanskrit text. Vidyut aims to provide standard co

14 Dec 30, 2022

Checks all your documentation for spelling and grammar mistakes with hunspell and a nlprule based checker for grammar

cargo-spellcheck Check your spelling with hunspell and/or nlprule. Use Cases Run cargo spellcheck --fix or cargo spellcheck fix to fix all your docume

274 Nov 5, 2022

A Markdown to HTML compiler and Syntax Highlighter, built using Rust's pulldown-cmark and tree-sitter-highlight crates.

A blazingly fast( possibly the fastest) markdown to html parser and syntax highlighter built using Rust's pulldown-cmark and tree-sitter-highlight crate natively for Node's Foreign Function Interface.

48 Nov 11, 2022

Text calculator with support for units and conversion

cpc calculation + conversion cpc parses and evaluates strings of math, with support for units and conversion. 128-bit decimal floating points are used

82 Jan 4, 2023

A command-line tool and library for generating regular expressions from user-provided test cases

Table of Contents What does this tool do? Do I still need to learn to write regexes then? Current features How to install? 4.1 The command-line tool 4

5.8k Dec 30, 2022

Find and replace text in source files

Ruplacer Find and replace text in source files: $ ruplacer old new src/ Patching src/a_dir/sub/foo.txt -- old is everywhere, old is old ++ new is ever

331 Dec 28, 2022

Comments

Support for incremental decoding

I would like to be able to decode a sequence of token ids incrementally in a decoder-agnostic manner. I haven't found a straightforward way to do this with the current API - the first token is treated differently by some decoders which means that in general

decode([1,2,3]) != decode([1]) + decode([2]) + decode([3])

It would be really nice to have some kind of "continuation" flag to indicate that the result is intended to be be appended to an already-decoded prefix. So that you could have

decode([1,2,3]) == decode([1]) + decode'([2]) + decode'([3])

It would also be nice to have a variant of this that takes either a single u32 id or string token rather than a vec, for related reasons (latter could be used with id_to_token).

I'd love to know if there is another way to achieve this than my current ugly workaround :)

Current workaround

pub(crate) struct Decoder {
    pub(crate) tokenizer: Tokenizer,
    prefix_id: u32,
    prefix: String,
}

impl Decoder {
    pub(crate) fn new(tokenizer: Tokenizer) -> Decoder {
        let prefix_id = tokenizer.token_to_id("A").unwrap();
        Decoder {
            prefix_id,
            prefix: tokenizer.decode(vec![prefix_id], false).unwrap(),
            tokenizer,
        }
    }

    /// Decode continuation tokens to be added to some existing text
    pub(crate) fn decode_continuation(&self, mut ids: Vec<u32>) -> tokenizers::Result<String> {
        // How we handle this depends on the specific decoder's behaviour,
        // see each one's implementation of decode_chain in the tokenizers library.
        match self.tokenizer.get_decoder() {
            Some(ByteLevel(_)) => {
                // Lossless - call standard decode function
                self.tokenizer.decode(ids, true)
            },
            Some(Metaspace(_)) | Some(WordPiece(_)) | Some(BPE(_)) => {
                // For these, the first token in the sequence is treated differently,
                // so we add and then strip a placeholder token.
                ids.insert(0, self.prefix_id);
                let result = self.tokenizer.decode(ids, true)?;
                Ok(result.strip_prefix(&self.prefix).ok_or(DecodingError)?.to_string())
            },
            None => {
                // Just prepend a space
                Ok(format!(" {}", self.tokenizer.decode(ids, true)?))
            },
            _ => Err(UnsupportedTokenizerError.into())
        }
    }
}

opened by njhill 4

Can't import any modules

What is says on the tin. Every module I try importing into a script is spitting out a "module not found" rror.

Traceback (most recent call last): File "ab2.py", line 3, in from tokenizers.tools import BertWordPieceTokenizer ImportError: cannot import name 'BertWordPieceTokenizer' from 'tokenizers.tools' (/home/../anaconda3/envs/tokenizers/lib/python3.7/site-packages/tokenizers/tools/init.py)

Traceback (most recent call last): File "ab2.py", line 3, in from transformers import BertWordPieceTokenizer ImportError: cannot import name 'BertWordPieceTokenizer' from 'transformers' (/home/../anaconda3/envs/tokenizers/lib/python3.7/site-packages/transformers/init.py)

I've tried:

import BertWordPieceTokenizer from tokenizers.toold import AutoTokenizer from tokenizers import BartTokenizer

To Illustrate a few examples.

I've installed Tokenizers in an anaconda3 venv via pip, via conda forge, and compiled from source.

I've tried installing Transformers as well and get the same errors. I've tried installing Tokenizers and then installing Transformers and got the same errors.

I've tried installing Transformers and then Tokenizers and gotten the same error.

I've looked through the Tokenizers code and unless I'm missing something (entirely possible) autotokenize isn't even a part of the package? I'll admit I'm not a very experienced programmer but I'll be damned if I can find it.

Help would be appreciated.

System specs are:

Linux mint 21.1 RTX2080 ti i78700k

cudnn 8.1.1 cuda 11.2.0 Tensor rt 7.2.3 Python 3.7 (by the way, figuring out what was needed here, finding the files, and actually installing them was beyond arduous. There has to be a better way. It's the only way I could get anything at all to work though).

opened by kronkinatorix 1

How to decode with the existing tokenizer

I train the tokenizer following the tutorial of the huggingface:

from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace

tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])
tokenizer.pre_tokenizer = Whitespace()
files = [f"wikitext-103-raw/wiki.{split}.raw" for split in ["test", "train", "valid"]]
tokenizer.train(files, trainer)
tokenizer.save("tokenizer-wiki.json")

But I don't know how to use the existing tokenizer for decoding:

tokenizer = Tokenizer.from_file("tokenizer-wiki.json")
o=tokenizer.encode("sd jk sds  sds")
tokenizer.decode(o.ids)
# s d j k s ds s ds

I know we can recover the output with the o.offsets, but what if we do not know the offsets, like we are decoding from a language model or NMT.

opened by ZhiYuanZeng 4

Is there any support for 'google/tapas-mini-finetuned-wtq' tokenizer?

I'm trying to run a tokenizer in java then eventually compile it to run on android for an open domain question and answer project. I'm wondering why 'google/tapas-mini-finetuned-wtq' doesn't work with DeepJavaLibrary. For more popular models the tokenizer is working. I'm assuming there is no fast tokenizer for tapas, so i was wondering if anyone had any advice on how to go about running tapas tokenizer and model on android/java?

opened by memetrusidovski 4

Releases(v0.13.2)

v0.13.2(Nov 7, 2022)

Python 3.11 support (Python only modification)
Source code(tar.gz)
Source code(zip)
python-v0.13.2(Nov 7, 2022)
[0.13.2]

[#1096] Python 3.11 support

Source code(tar.gz)
Source code(zip)
node-v0.13.2(Nov 7, 2022)

Python 3.11 support (Python only modification)
Source code(tar.gz)
Source code(zip)
v0.13.1(Oct 6, 2022)
[0.13.1]

[#1072] Fixing Roberta type ids.

Source code(tar.gz)
Source code(zip)
python-v0.13.1(Oct 6, 2022)
[0.13.1]

[#1072] Fixing Roberta type ids.

Source code(tar.gz)
Source code(zip)
node-v0.13.1(Oct 6, 2022)
[0.13.1]

[#1072] Fixing Roberta type ids.

Source code(tar.gz)
Source code(zip)
python-v0.13.0(Sep 21, 2022)
[0.13.0]

[#956] PyO3 version upgrade

[#1055] M1 automated builds

[#1008] Decoder is now a composable trait, but without being backward incompatible

[#1047, #1051, #1052] Processor is now a composable trait, but without being backward incompatible

Both trait changes warrant a "major" number since, despite best efforts to not break backward compatibility, the code is different enough that we cannot be exactly sure.
Source code(tar.gz)
Source code(zip)
v0.13.0(Sep 19, 2022)
[0.13.0]

[#1009] unstable_wasm feature to support building on Wasm (it's unstable !)

[#1008] Decoder is now a composable trait, but without being backward incompatible

[#1047, #1051, #1052] Processor is now a composable trait, but without being backward incompatible

Both trait changes warrant a "major" number since, despite best efforts to not break backward compatibility, the code is different enough that we cannot be exactly sure.
Source code(tar.gz)
Source code(zip)
node-v0.13.0(Sep 19, 2022)
[0.13.0]

[#1008] Decoder is now a composable trait, but without being backward incompatible

[#1047, #1051, #1052] Processor is now a composable trait, but without being backward incompatible

Source code(tar.gz)
Source code(zip)
python-v0.12.1(Apr 13, 2022)
[0.12.1]

[#938] Reverted breaking change. https://github.com/huggingface/transformers/issues/16520

Source code(tar.gz)
Source code(zip)
v0.12.0(Mar 31, 2022)
[0.12.0]

Bump minor version because of a breaking change.

The breaking change was causing more issues upstream in transformers than anticipated: https://github.com/huggingface/transformers/pull/16537#issuecomment-1085682657

The decision was to rollback on that breaking change, and figure out a different way later to do this modification

[#938] Breaking change. Decoder trait is modified to be composable. This is only breaking if you are using decoders on their own. tokenizers should be error free.

[#939] Making the regex in ByteLevel pre_tokenizer optional (necessary for BigScience)

[#952] Fixed the vocabulary size of UnigramTrainer output (to respect added tokens)

[#954] Fixed not being able to save vocabularies with holes in vocab (ConvBert). Yell warnings instead, but stop panicking.

[#961] Added link for Ruby port of tokenizers

[#960] Feature gate for cli and its clap dependency

Source code(tar.gz)
Source code(zip)
python-v0.12.0(Mar 31, 2022)
[0.12.0]

The breaking change was causing more issues upstream in transformers than anticipated: https://github.com/huggingface/transformers/pull/16537#issuecomment-1085682657

The decision was to rollback on that breaking change, and figure out a different way later to do this modification

Bump minor version because of a breaking change.

[#938] Breaking change. Decoder trait is modified to be composable. This is only breaking if you are using decoders on their own. tokenizers should be error free.

[#939] Making the regex in ByteLevel pre_tokenizer optional (necessary for BigScience)

[#952] Fixed the vocabulary size of UnigramTrainer output (to respect added tokens)

[#954] Fixed not being able to save vocabularies with holes in vocab (ConvBert). Yell warnings instead, but stop panicking.

[#962] Fix tests for python 3.10

[#961] Added link for Ruby port of tokenizers

Source code(tar.gz)
Source code(zip)
node-v0.12.0(Mar 31, 2022)
[0.12.0]

The breaking change was causing more issues upstream in transformers than anticipated: https://github.com/huggingface/transformers/pull/16537#issuecomment-1085682657

The decision was to rollback on that breaking change, and figure out a different way later to do this modification

Bump minor version because of a breaking change. Using 0.12 to match other bindings.

[#938] Breaking change. Decoder trait is modified to be composable. This is only breaking if you are using decoders on their own. tokenizers should be error free.

[#939] Making the regex in ByteLevel pre_tokenizer optional (necessary for BigScience)

[#952] Fixed the vocabulary size of UnigramTrainer output (to respect added tokens)

[#954] Fixed not being able to save vocabularies with holes in vocab (ConvBert). Yell warnings instead, but stop panicking.

[#961] Added link for Ruby port of tokenizers

Source code(tar.gz)
Source code(zip)
v0.11.2(Feb 28, 2022)
[#919] Fixing single_word AddedToken. (regression from 0.11.2)

[#916] Deserializing faster added_tokens by loading them in batch.

Source code(tar.gz)
Source code(zip)
python-v0.11.6(Feb 28, 2022)
[#919] Fixing single_word AddedToken. (regression from 0.11.2)

[#916] Deserializing faster added_tokens by loading them in batch.

Source code(tar.gz)
Source code(zip)
node-v0.8.3(Feb 28, 2022)

Source code(tar.gz)
Source code(zip)
python-v0.11.5(Feb 16, 2022)

[#895] Add wheel support for Python 3.10
Source code(tar.gz)
Source code(zip)
v0.11.1(Jan 17, 2022)
[#882] Fixing Punctuation deserialize without argument.

[#868] Fixing missing direction in TruncationParams

[#860] Adding TruncationSide to TruncationParams

Source code(tar.gz)
Source code(zip)
python-v0.11.3(Jan 17, 2022)
[#882] Fixing Punctuation deserialize without argument.

[#868] Fixing missing direction in TruncationParams

[#860] Adding TruncationSide to TruncationParams

Source code(tar.gz)
Source code(zip)
node-v0.8.2(Jan 17, 2022)

[#884] Fixing bad deserialization following inclusion of a default for Punctuation
Source code(tar.gz)
Source code(zip)
node-v0.8.1(Jan 17, 2022)

Fixing various backward compatibility bugs (Old serialized files couldn't be deserialized anymore.
Source code(tar.gz)
Source code(zip)
python-v0.11.4(Jan 17, 2022)

[#884] Fixing bad deserialization following inclusion of a default for Punctuation
Source code(tar.gz)
Source code(zip)
python-v0.11.2(Jan 4, 2022)

Fixes https://github.com/huggingface/tokenizers/pull/868
Source code(tar.gz)
Source code(zip)
python-v0.11.1(Dec 28, 2021)

[#860] Adding TruncationSide to TruncationParams.
Source code(tar.gz)
Source code(zip)
python-v0.11.0(Dec 24, 2021)
Fixed

[#585] Conda version should now work on old CentOS

[#844] Fixing interaction between is_pretokenized and trim_offsets.

[#851] Doc links

Added

[#657]: Add SplitDelimiterBehavior customization to Punctuation constructor

[#845]: Documentation for Decoders.

Changed

[#850]: Added a feature gate to enable disabling http features

[#718]: Fix WordLevel tokenizer determinism during training

[#762]: Add a way to specify the unknown token in SentencePieceUnigramTokenizer

[#770]: Improved documentation for UnigramTrainer

[#780]: Add Tokenizer.from_pretrained to load tokenizers from the Hugging Face Hub

[#793]: Saving a pretty JSON file by default when saving a tokenizer

Source code(tar.gz)
Source code(zip)
node-v0.8.0(Sep 2, 2021)
BREACKING CHANGES

Many improvements on the Trainer (#519). The files must now be provided first when calling tokenizer.train(files, trainer).

Features

Adding the TemplateProcessing

Add WordLevel and Unigram models (#490)

Add nmtNormalizer and precompiledNormalizer normalizers (#490)

Add templateProcessing post-processor (#490)

Add digitsPreTokenizer pre-tokenizer (#490)

Add support for mapping to sequences (#506)

Add splitPreTokenizer pre-tokenizer (#542)

Add behavior option to the punctuationPreTokenizer (#657)

Add the ability to load tokenizers from the Hugging Face Hub using fromPretrained (#780)

Fixes

Fix a bug where long tokenizer.json files would be incorrectly deserialized (#459)

Fix RobertaProcessing deserialization in PostProcessorWrapper (#464)

Source code(tar.gz)
Source code(zip)
python-v0.10.3(May 24, 2021)
Fixed

[#686]: Fix SPM conversion process for whitespace deduplication

[#707]: Fix stripping strings containing Unicode characters

Added

[#693]: Add a CTC Decoder for Wave2Vec models

Removed

[#714]: Removed support for Python 3.5

Source code(tar.gz)
Source code(zip)
python-v0.10.2(Apr 5, 2021)
Fixed

[#652]: Fix offsets for Precompiled corner case

[#656]: Fix BPE continuing_subword_prefix

[#674]: Fix Metaspace serialization problems

Source code(tar.gz)
Source code(zip)
python-v0.10.1(Feb 4, 2021)
Fixed

[#616]: Fix SentencePiece tokenizers conversion

[#617]: Fix offsets produced by Precompiled Normalizer (used by tokenizers converted from SPM)

[#618]: Fix Normalizer.normalize with PyNormalizedStringRefMut

[#620]: Fix serialization/deserialization for overlapping models

[#621]: Fix ByteLevel instantiation from a previously saved state (using __getstate__())

Source code(tar.gz)
Source code(zip)
python-v0.10.0(Jan 12, 2021)
Added

[#508]: Add a Visualizer for notebooks to help understand how the tokenizers work

[#519]: Add a WordLevelTrainer used to train a WordLevel model

[#533]: Add support for conda builds

[#542]: Add Split pre-tokenizer to easily split using a pattern

[#544]: Ability to train from memory. This also improves the integration with datasets

[#590]: Add getters/setters for components on BaseTokenizer

[#574]: Add fust_unk option to SentencePieceBPETokenizer

Changed

[#509]: Automatically stubbing the .pyi files

[#519]: Each Model can return its associated Trainer with get_trainer()

[#530]: The various attributes on each component can be get/set (ie. tokenizer.model.dropout = 0.1)

[#538]: The API Reference has been improved and is now up-to-date.

Fixed

[#519]: During training, the Model is now trained in-place. This fixes several bugs that were forcing to reload the Model after a training.

[#539]: Fix BaseTokenizer enable_truncation docstring

Source code(tar.gz)
Source code(zip)

Owner

Hugging Face

Solving NLP, one commit at a time!

GitHub https://huggingface.co/docs/tokenizers

A simple and fast linear algebra library for games and graphics

glam A simple and fast 3D math library for games and graphics. Development status glam is in beta stage. Base functionality has been implemented and t

953 Jan 3, 2023

Rust edit distance routines accelerated using SIMD. Supports fast Hamming, Levenshtein, restricted Damerau-Levenshtein, etc. distance calculations and string search.

triple_accel Rust edit distance routines accelerated using SIMD. Supports fast Hamming, Levenshtein, restricted Damerau-Levenshtein, etc. distance cal

75 Jan 8, 2023

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production

Related tags

Overview

Main features:

Bindings

Quick example using Python:

You might also like...

A fast implementation of Aho-Corasick in Rust.

Blazingly fast framework for in-process microservices on top of Tower ecosystem

Ultra-fast, spookily accurate text summarizer that works on any language

A lightning-fast Sanskrit toolkit. For Python bindings, see `vidyut-py`.

Checks all your documentation for spelling and grammar mistakes with hunspell and a nlprule based checker for grammar

A Markdown to HTML compiler and Syntax Highlighter, built using Rust's pulldown-cmark and tree-sitter-highlight crates.

Text calculator with support for units and conversion

A command-line tool and library for generating regular expressions from user-provided test cases

Find and replace text in source files

Comments

Support for incremental decoding

Can't import any modules

How to decode with the existing tokenizer

Is there any support for 'google/tapas-mini-finetuned-wtq' tokenizer?

Releases(v0.13.2)

v0.13.2(Nov 7, 2022)

python-v0.13.2(Nov 7, 2022)

[0.13.2]

node-v0.13.2(Nov 7, 2022)

v0.13.1(Oct 6, 2022)

[0.13.1]

python-v0.13.1(Oct 6, 2022)

[0.13.1]

node-v0.13.1(Oct 6, 2022)

[0.13.1]

python-v0.13.0(Sep 21, 2022)

[0.13.0]

v0.13.0(Sep 19, 2022)

[0.13.0]

node-v0.13.0(Sep 19, 2022)

[0.13.0]

python-v0.12.1(Apr 13, 2022)

[0.12.1]

v0.12.0(Mar 31, 2022)

[0.12.0]

python-v0.12.0(Mar 31, 2022)

[0.12.0]

node-v0.12.0(Mar 31, 2022)

[0.12.0]

v0.11.2(Feb 28, 2022)

python-v0.11.6(Feb 28, 2022)

node-v0.8.3(Feb 28, 2022)

python-v0.11.5(Feb 16, 2022)

v0.11.1(Jan 17, 2022)

python-v0.11.3(Jan 17, 2022)

node-v0.8.2(Jan 17, 2022)

node-v0.8.1(Jan 17, 2022)

python-v0.11.4(Jan 17, 2022)

python-v0.11.2(Jan 4, 2022)

python-v0.11.1(Dec 28, 2021)

python-v0.11.0(Dec 24, 2021)

Fixed

Added

Changed

node-v0.8.0(Sep 2, 2021)

BREACKING CHANGES

Features

Fixes

python-v0.10.3(May 24, 2021)

Fixed

Added

Removed

python-v0.10.2(Apr 5, 2021)

Fixed

python-v0.10.1(Feb 4, 2021)

Fixed

python-v0.10.0(Jan 12, 2021)

Added

Changed

Fixed

Owner

Hugging Face

A simple and fast linear algebra library for games and graphics