Simple NLP in Rust with Python bindings

Overview

vtext

Crates.io PyPI CircleCI Build Status

NLP in Rust with Python bindings

This package aims to provide a high performance toolkit for ingesting textual data for machine learning applications.

Features

  • Tokenization: Regexp tokenizer, Unicode segmentation + language specific rules
  • Stemming: Snowball (in Python 15-20x faster than NLTK)
  • Token counting: converting token counts to sparse matrices for use in machine learning libraries. Similar to CountVectorizer and HashingVectorizer in scikit-learn but will less broad functionality.
  • Levenshtein edit distance; Sørensen-Dice, Jaro, Jaro Winkler string similarities

Usage

Usage in Python

vtext requires Python 3.6+ and can be installed with,

pip install vtext

Below is a simple tokenization example,

>>> from vtext.tokenize import VTextTokenizer
>>> VTextTokenizer("en").tokenize("Flights can't depart after 2:00 pm.")
["Flights", "ca", "n't", "depart" "after", "2:00", "pm", "."]

For more details see the project documentation: vtext.io/doc/latest/index.html

Usage in Rust

Add the following to Cargo.toml,

[dependencies]
vtext = "0.2.0"

For more details see rust documentation: docs.rs/vtext

Benchmarks

Tokenization

Following benchmarks illustrate the tokenization accuracy (F1 score) on UD treebanks ,

lang dataset regexp spacy 2.1 vtext
en EWT 0.812 0.972 0.966
en GUM 0.881 0.989 0.996
de GSD 0.896 0.944 0.964
fr Sequoia 0.844 0.968 0.971

and the English tokenization speed,

regexp spacy 2.1 vtext
Speed (10⁶ tokens/s) 3.1 0.14 2.1

Text vectorization

Below are benchmarks for converting textual data to a sparse document-term matrix using the 20 newsgroups dataset, run on Intel(R) Xeon(R) CPU E3-1270 v6 @ 3.80GHz,

Speed (MB/s) scikit-learn 0.20.1 vtext (n_jobs=1) vtext (n_jobs=4)
CountVectorizer.fit 14 104 225
CountVectorizer.transform 14 82 303
CountVectorizer.fit_transform 14 70 NA
HashingVectorizer.transform 19 89 309

Note however that these two estimators in vtext currently support only a fraction of scikit-learn's functionality. See benchmarks/README.md for more details.

License

vtext is released under the Apache License, Version 2.0.

Comments
  • Add sentence splitter

    Add sentence splitter

    It would be useful to add a sentence splitter, for instance, possibilities could be,

    • Puntk sentence tokenizer from NLTK (needs pre-trained model)
    • Unicode sentence boundaries from https://github.com/unicode-rs/unicode-segmentation/pull/24 (doesn't need a pre-trained model)
    • investigate spacy implementation (likely needs pre-trained model)
    new feature 
    opened by rth 8
  • Sentence tokenization using Unicode segmentation (Python package)

    Sentence tokenization using Unicode segmentation (Python package)

    First attempt at including the UnicodeSentenceTokenizer in the Python package. I have two issues that I am unsure how to resolve:

    1. After using python3 setup.py develop to compile the lib which runs without errors, when trying to import the package in Python import _lib I get the following fatal error:
    SystemError: Type does not define the tp_name field.
    thread '<unnamed>' panicked at 'An error occurred while initializing class UnicodeSentenceTokenizer', ...
    note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
    fatal runtime error: failed to initiate panic, error 5
    
    1. In python/src/tokenize_sentence.rs I have resorted to creating a second base class BaseTokenize2 (see line 14) as I am unsure how to import BaseTokenize from python/src/tokenize.rs
    opened by joshlk 3
  • Update to PyO3 0.7

    Update to PyO3 0.7

    This updates to the lastest PyO3, which allows using lifetimes in pymethods. As result tokenization in Python is a bit faster by avoiding string copies.

    On master,

    python3.7 benchmarks/bench_tokenizers.py
     Tokenizing 19924 documents
             Python re.findall(r'\b\w\w+\b', ...): 2.93s [31.0 MB/s, 2450 kWPS]
                    RegexpTokenizer(r'\b\w\w+\b'): 1.96s [46.5 MB/s, 3671 kWPS]
       UnicodeSegmentTokenizer(word_bounds=False): 2.97s [30.7 MB/s, 2269 kWPS]
        UnicodeSegmentTokenizer(word_bounds=True): 3.58s [25.4 MB/s, 3182 kWPS]
                             VTextTokenizer('en'): 4.11s [22.1 MB/s, 2467 kWPS]
                            CharacterTokenizer(4): 7.73s [11.8 MB/s, 5927 kWPS]
    

    after this PR,

    # Tokenizing 19924 documents
             Python re.findall(r'\b\w\w+\b', ...): 2.92s [31.2 MB/s, 2460 kWPS]
                    RegexpTokenizer(r'\b\w\w+\b'): 1.40s [64.8 MB/s, 5119 kWPS]
       UnicodeSegmentTokenizer(word_bounds=False): 2.48s [36.8 MB/s, 2721 kWPS]
        UnicodeSegmentTokenizer(word_bounds=True): 2.65s [34.3 MB/s, 4292 kWPS]
                             VTextTokenizer('en'): 3.32s [27.4 MB/s, 3053 kWPS]
                            CharacterTokenizer(4): 4.47s [20.4 MB/s, 10252 kWPS]
    
    opened by rth 3
  • Better support of configuration parameters in vectorizers

    Better support of configuration parameters in vectorizers

    Currently CountVectorizer and HashingVectorizer mostly perform BOW token counting without the possibility to change the tokenizer or any other parameters.

    While we intentionally won't support all the parameter that scikit-learn versions does (as these meta-estimators are doing too much), additional parametrization would be preferable.

    • parametrization of the tokenizer will be addressed in #48
    opened by rth 2
  • Add tokenizer trait

    Add tokenizer trait

    This makes it possible to use any object implementing the Tokenizer trait in Vectorizers,

        let tokenizer = VTextTokenizer::new("en");
        let vectorizer = CountVectorizer::new(&tokenizer);
    
    opened by rth 2
  • Support different hash functions in HashingVectorizer

    Support different hash functions in HashingVectorizer

    Currently, we use the MurmurHash3 hash function from the rust-fasthash (to be more similar to scikit-learn implementation). That crate also supports a number of other hash functions,

    City Hash Farm Hash Metro Hash Mum Hash Sea Hash Spooky Hash T1 Hash xx Hash

    I'm not convinced hashing is currently the performance bottleneck, but in any case using a faster hash function such as xxhash would not hurt.

    This would involve updating the text-vectorize crate and adding hasher parameter to the HashingVectorizer python estimator.

    Another use case could to use different hash functions to reduce the effect of collisions Svenstrup et. al. 2017, discussed e.g. in https://stackoverflow.com/q/53767469/1791279

    opened by rth 2
  • Rename UnicodeSegmentTokenizer to UnicodeWordTokenizer

    Rename UnicodeSegmentTokenizer to UnicodeWordTokenizer

    UnicodeSegmentTokenizer was meant to be a shorter version of Unicode segmentation tokenizer, but the name is not very explicit. Besides UnicodeSentenceTokenizer also uses unicode-segmentation crate which adds to the confusion. Maybe UnicodeWordTokenizer would be a better name?

    opened by rth 1
  • BLD Build for the wasm target

    BLD Build for the wasm target

    This aims to build vtext for the wasm32 target.

    Currently a blocker is the incompatible rustc-serialize create scattered though the dependencies. In particular,

    • ndarray <0.13 (required by sprs: will be fixed in https://github.com/vbarrielle/sprs/pull/175)
    • num-complex < 0.2 (used in sprs: will be fixed in https://github.com/vbarrielle/sprs/pull/164)

    other issues may be discovered later on..

    opened by rth 1
  • TST add float_cmp crate for tests

    TST add float_cmp crate for tests

    What would you think about adding float-cmp for doing float comparisons in test?

    Clippy complains about float comparisons and I've run into it with other projects...

    first time I've tried out this crate to deal with float comparisons in tests.

    opened by jbowles 1
  • Tokenizers dispatch in vectorizers

    Tokenizers dispatch in vectorizers

    Follow up on #48, allows to select the tokenizer used in vectorizers.

    As mentioned in https://github.com/rth/vtext/pull/48#issuecomment-488223434 this will require PyO3 O.7.0 that includes https://github.com/PyO3/pyo3/pull/461 (hopefully released in the coming days) or a way to avoid using explicitly defined lifetimes in python wrapper methods for vectorizers.

    opened by rth 1
  • ENH Avoid copying tokens in tokenizers in Python

    ENH Avoid copying tokens in tokenizers in Python

    Currently, tokenizers return Vec<String> where tokens are be slices of the input string. Moving to Vec<&str> would remove one memory copy and is likely to help with run time.

    This should be possible with PyO3 0.7.0 (not yet released) that will allow using lifetime specifiers in pymethods.

    python tokenization performance 
    opened by rth 1
  • Bump numpy from 1.17.3 to 1.22.0 in /ci

    Bump numpy from 1.17.3 to 1.22.0 in /ci

    Bumps numpy from 1.17.3 to 1.22.0.

    Release notes

    Sourced from numpy's releases.

    v1.22.0

    NumPy 1.22.0 Release Notes

    NumPy 1.22.0 is a big release featuring the work of 153 contributors spread over 609 pull requests. There have been many improvements, highlights are:

    • Annotations of the main namespace are essentially complete. Upstream is a moving target, so there will likely be further improvements, but the major work is done. This is probably the most user visible enhancement in this release.
    • A preliminary version of the proposed Array-API is provided. This is a step in creating a standard collection of functions that can be used across application such as CuPy and JAX.
    • NumPy now has a DLPack backend. DLPack provides a common interchange format for array (tensor) data.
    • New methods for quantile, percentile, and related functions. The new methods provide a complete set of the methods commonly found in the literature.
    • A new configurable allocator for use by downstream projects.

    These are in addition to the ongoing work to provide SIMD support for commonly used functions, improvements to F2PY, and better documentation.

    The Python versions supported in this release are 3.8-3.10, Python 3.7 has been dropped. Note that 32 bit wheels are only provided for Python 3.8 and 3.9 on Windows, all other wheels are 64 bits on account of Ubuntu, Fedora, and other Linux distributions dropping 32 bit support. All 64 bit wheels are also linked with 64 bit integer OpenBLAS, which should fix the occasional problems encountered by folks using truly huge arrays.

    Expired deprecations

    Deprecated numeric style dtype strings have been removed

    Using the strings "Bytes0", "Datetime64", "Str0", "Uint32", and "Uint64" as a dtype will now raise a TypeError.

    (gh-19539)

    Expired deprecations for loads, ndfromtxt, and mafromtxt in npyio

    numpy.loads was deprecated in v1.15, with the recommendation that users use pickle.loads instead. ndfromtxt and mafromtxt were both deprecated in v1.17 - users should use numpy.genfromtxt instead with the appropriate value for the usemask parameter.

    (gh-19615)

    ... (truncated)

    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    opened by dependabot[bot] 0
  • Treebank word tokenizer from NLTK

    Treebank word tokenizer from NLTK

    This adds a NTLKWordTokenizer which implements the default tokenizer from NLTK. The test suite from NLTK passes, however we are not handling one edge case due to the lack of the lookahead functionality in regex. I don't think it's worth adding another library as a dependency to address that, and marking it as a known limitation could be a workaround for now. That regexp is an enhancement proposed by NLTK on top of the classical Penn Treebank word tokenizer.

    Currently this returns Vec<String> and I have struggled with making it return an iterator due to lifetime issues so far.

    It's around 3x faster than the NLTK version in Python. It is very English specific and should probably not be used in other languages.

    TODO:

    • [ ] improve documentation
    opened by rth 0
  • Feature/kskip-ngram

    Feature/kskip-ngram

    Implemented k-skip-n-grams. Has convenience functions for bigram, trigram, ngrams, everygrams and skipgrams.

    Provides the same output as the equivalent nltk functions - although nltk does generate duplicates sometimes which are omitted here. The iterator consumes the input iterator only once and holds a window of items to generate the grams. The window is stepped forward as it consumes the input. It also correctly generates left or right padding if specified.

    Currently the iterator outputs Vec<Vec<&str>>. I'm unsure if this is desirable or the function should join the items into a string to output Vec<String>. e.g. Currently it does: vec![vec!["One", "Two"], vec!["Two", "Three"] ...] but it might be desirable to have vec!["One Two", "Two Three"].

    The nltk implementations tend to consume the input sequence multiple times and concatenate the output (for example everygram). I wanted the struct to consume the input iterator only once and provide an output. This however considerably complicated the code. I have tried to refractor it so it is as readable as much as possible but it would be good to get a second eye on it.

    There is also still the question of how to chain the different components together (which links to #21). Currently the transform method takes an input iterator and provides an iterator as output which can be consumed by the user.

    Todo:

    • [x] Add Python interface
    • [x] Benchmark against nltk
    • [ ] Add function for character ngrams
    • [x] Further documentation
    opened by joshlk 10
  • Fine-tune tokenizers

    Fine-tune tokenizers

    It can happen that the tokenization results are unsatisfactory in some way, and the question is what should be the mechanism to customize/improve them. Whether it should be by, a) adding options make these optional improvements in the tokenizer. The issue with these is that some of these might be relevant to multiple tokenizers b) add a new step later in the pipeline. That's probably the best way to allow arbitrary customization. The issue is that some steps might be specific to the previous step, and adding them in the library might be confusing.

    There is probably a balance that needs to be found between the two.

    For instance,

    1. PunctuationTokenizer,
      • currently doesn't take into account repeated punctuation
        >>> PunctuationTokenizer().tokenize("test!!!")                                                                                                     
        ['test!', '!', '!']
        
      • will tokenize abbreviations separated by . as separate sentence
        >>> PunctuationTokenizer().tokenize("W.T.O.")
        ['W.', 'T.', 'O.']
        

      both could probably be addressed by adding an option to force sentences to be longer than some minimal length (and otherwise append them to the previous token).

    2. UnicodeSentenceTokenizer, will not tokenizer sentences separated by a punctuation without space e.g.,
      >>> UnicodeSentenceTokenizer().tokenize('One sentence.Another sentence.')
      ['One sentence.Another sentence.']
      

      That's a very common occurrence in actual text, and I think a workaround should be found (e.g. using an additional tokenization pass with a regex/punctuation tokenizer).

    Generally it would be good to add some evaluation benchmarks to evaluation/ for sentence tokenization to evaluation/ folder.

    1. UnicodeTokenizer is currently extended in VTextTokenizer (for lack of a better name), with a few additional rules. Maybe this could have been a separate token-processing step, particularly if one imagine that more rules could be added (or potentially even using an ML model).
    opened by rth 0
  • Standardize language option

    Standardize language option

    From https://github.com/rth/vtext/pull/78#issuecomment-644009378 by @joshlk

    I think it would be sensible to identify different languages throughout the package using ISO two-letter codes (e.g. en, fr, de ...).

    In particular, we should implement this for the Snowball stemmer in python which currently uses the full language names.

    I am also wondering if in Rust, we should use String for the language parameter or define an Enum e.g.

    use vtext::lang
    
    let stemmer = SnowballStemmerParams::default().lang(lang::en).build()
    

    The latter is probably simpler, but it makes it a bit harder to extend e.g. if someone designs an custom estimator for a language not in the list (e.g. some ancient infrequently used language), they would have to create a new enum.

    Also just to be consistent the parameter name would be "lang" not "language", right?

    opened by rth 0
Owner
Roman Yurchak
Data Scientist & Founder @symerio.
Roman Yurchak
Rust native ready-to-use NLP pipelines and transformer-based models (BERT, DistilBERT, GPT2,...)

rust-bert Rust native Transformer-based models implementation. Port of Hugging Face's Transformers library, using the tch-rs crate and pre-processing

null 1.3k Jan 8, 2023
An NLP-suite powered by deep learning

DeepFrog - NLP Suite Introduction DeepFrog aims to be a (partial) successor of the Dutch-NLP suite Frog. Whereas the various NLP modules in Frog wre b

Maarten van Gompel 16 Feb 28, 2022
Probabilistically split concatenated words using NLP based on English Wikipedia unigram frequencies.

Untanglr Untanglr takes in a some mangled words and makes sense out of them so you dont have to. It goes through the input and splits it probabilistic

Andrei Butnaru 15 Nov 23, 2022
A lightning-fast Sanskrit toolkit. For Python bindings, see `vidyut-py`.

Vidyut मा भूदेवं क्षणमपि च ते विद्युता विप्रयोगः ॥ Vidyut is a lightning-fast toolkit for processing Sanskrit text. Vidyut aims to provide standard co

Ambuda 14 Dec 30, 2022
Quickner is a new tool to quickly annotate texts for NER (Named Entity Recognition). It is written in Rust and accessible through a Python API.

Quickner ⚡ A simple, fast, and easy to use NER annotator for Python Quickner is a new tool to quickly annotate texts for NER (Named Entity Recognition

Omar MHAIMDAT 7 Mar 3, 2023
lingua-rs Python binding. An accurate natural language detection library, suitable for long and short text alike.

lingua-py lingua-rs Python binding. An accurate natural language detection library, suitable for long and short text alike. Installation pip install l

messense 7 Dec 30, 2022
🛥 Vaporetto is a fast and lightweight pointwise prediction based tokenizer. This is a Python wrapper for Vaporetto.

?? python-vaporetto ?? Vaporetto is a fast and lightweight pointwise prediction based tokenizer. This is a Python wrapper for Vaporetto. Installation

null 17 Dec 22, 2022
Viterbi-based accelerated tokenizer (Python wrapper)

?? python-vibrato ?? Vibrato is a fast implementation of tokenization (or morphological analysis) based on the Viterbi algorithm. This is a Python wra

null 20 Dec 29, 2022
Google CP-SAT solver Rust bindings

Google CP-SAT solver Rust bindings Rust bindings to the Google CP-SAT constraint programming solver. To use this library, you need a C++ compiler and

Kardinal 11 Nov 16, 2022
A crate using DeepSpeech bindings to convert mic audio from speech to text

DS-TRANSCRIBER Need an Offline Speech To Text converter? Records your mic, and returns a String containing what was said. Features Begins transcriptio

null 32 Oct 8, 2022
A simple OpenAI (GPT-3) client written in Rust.

A simple OpenAI (GPT-3) client written in Rust. It works by making HTTP requests to OpenAI's API and consuming the results.

Apostolos Kiraleos 3 Oct 28, 2022
Simple, extendable and embeddable scripting language.

duckscript duckscript SDK CLI Simple, extendable and embeddable scripting language. Overview Language Goals Installation Homebrew Binary Release Ducks

Sagie Gur-Ari 356 Dec 24, 2022
Simple STM32F103 based glitcher FW

Airtag glitcher (Bluepill firmware) Simple glitcher firmware running on an STM32F103 on a bluepill board. See https://github.com/pd0wm/airtag-dump for

Willem Melching 27 Dec 22, 2022
Simple Data Stealer

helfsteal Simple Data Stealer Hi All, I published basic data stealer malware with Rust. FOR EDUCATIONAL PURPOSES. You can use it for Red Team operatio

Ahmet Güler 7 Jul 7, 2022
A sweet n' simple pastebin with syntax highlighting and no client-side code!

sweetpaste sweetpaste is a sweet n' simple pastebin server. It's completely server-side, with zero client-side code. Configuration The configuration w

Lucy 0 Sep 4, 2022
A simple and fast linear algebra library for games and graphics

glam A simple and fast 3D math library for games and graphics. Development status glam is in beta stage. Base functionality has been implemented and t

Cameron Hart 953 Jan 3, 2023
Simple expression transformer that is not Coq.

Noq Not Coq. Simple expression transformer that is not Coq. Quick Start $ cargo run ./examples/add.noq Main Idea The Main Idea is being able to define

Tsoding 187 Jan 7, 2023
SEFF - Simple Embeddable Font Format

SEFF - Simple Embeddable Font Format This crate is designed to allow decent text rendering in resource-constrained environments like microcontrollers.

Cliff L. Biffle 3 May 2, 2022
Simple, robust, BitTorrent's Mainline DHT implementation

Mainline Simple, robust, BitTorrent's Mainline DHT implementation. This library is focused on being the best and simplest Rust client for Mainline, es

Nuh 4 Nov 21, 2023