Vaporetto: a fast and lightweight pointwise prediction based tokenizer

Overview

🛥 VAporetto: POintwise pREdicTion based TOkenizer

Vaporetto is a fast and lightweight pointwise prediction based tokenizer.

Crates.io Documentation

Overview

This repository includes both a Rust crate that provides APIs for Vaporetto and CLI frontends.

The following examples use KFTT for training and prediction data.

Training

% cargo run --release --bin train -- --model ./kftt.model --tok ./kftt-data-1.0/data/tok/kyoto-train.ja

Prediction

% cargo run --release --bin predict -- --model ./kftt.model < ./kftt-data-1.0/data/orig/kyoto-test.ja > ./tokenized.ja

Conversion from KyTea's Model File

% cargo run --release --bin convert_kytea_model -- --model-in ./jp-0.4.7-5.mod --model-out ./kytea.model

Disclaimer

This software is developed by LegalForce, Inc., but not an officially supported LegalForce product.

License

Licensed under either of

at your option.

Contribution

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.

Comments
  • Add wsconst option and remove multithreading support from CLI

    Add wsconst option and remove multithreading support from CLI

    This branch adds:

    • Pre-processing (character normalization)
    • Post-processing (--wsconst option)

    to CLIs.

    and removes:

    • multithreading support

    from predict command.

    opened by vbkaisetsu 4
  • Reimplement Vaporetto with supporting multiple tags

    Reimplement Vaporetto with supporting multiple tags

    Supports multiple tags

    Currently, a single tag followed by a slash after each token will train the tag classifier on each token. This branch changes so that multiple tags containing pronunciation can be predicted. To specify multiple tags, use multiple slashes as follows:

    この/代名詞/コノ 人/名詞/ヒト は/助詞/ワ 火星/名詞 人/接尾辞/ジン です/助動詞/デス
    

    This feature does not support for unknown words now.

    Changes in Predictor struct

    In the previous version, the predict() function takes a Sentence and returns a modified one. In this change, predict() takes a mutable reference of Sentence instead.

    Changes in Sentence struct

    In the previous version, the to_tokenized_vec() function returns a newly allocated vector containing tokens. In this change, this function is removed and iter_tokens(), which returns an iterator for tokens, is added.

    This branch also includes refactoring and other API changes.

    opened by vbkaisetsu 3
  • Support serialization with bincode v2

    Support serialization with bincode v2

    This branch uses bincode for serialization and deserialization features of Model and Predictor.

    The deserialization feature of the Predictor is unsafe and users should not load serialized data provided by thirdparty distributors, so the predict command does not support loading serialized Predictors.

    I will add an example usage in another branch.

    opened by vbkaisetsu 2
  • Add demo page

    Add demo page

    This branch removes the previous wasm example and adds a demo page.

    The demo page is already deployed manually. This branch automatically deploys the demo when main branch is updated.

    opened by vbkaisetsu 1
  • Update tantivy requirement from 0.17 to 0.18

    Update tantivy requirement from 0.17 to 0.18

    ⚠️ Dependabot is rebasing this PR ⚠️

    Rebasing might not happen immediately, so don't worry if this takes some time.

    Note: if you make any changes to this PR yourself, they will take precedence over the rebase.


    Updates the requirements on tantivy to permit the latest version.

    Changelog

    Sourced from tantivy's changelog.

    Tantivy 0.18

    • For date values chrono has been replaced with time (@​uklotzde) #1304 :
      • The time crate is re-exported as tantivy::time instead of tantivy::chrono.
      • The type alias tantivy::DateTime has been removed.
      • Value::Date wraps time::PrimitiveDateTime without time zone information.
      • Internally date/time values are stored as seconds since UNIX epoch in UTC.
      • Converting a time::OffsetDateTime to Value::Date implicitly converts the value into UTC. If this is not desired do the time zone conversion yourself and use time::PrimitiveDateTime directly instead.
    • Add histogram aggregation (@​PSeitz)
    • Add support for fastfield on text fields (@​PSeitz)
    • Add terms aggregation (@​PSeitz)
    • Add support for zstd compression (@​kryesh)

    Tantivy 0.17

    • LogMergePolicy now triggers merges if the ratio of deleted documents reaches a threshold (@​shikhar @​fulmicoton) #115
    • Adds a searcher Warmer API (@​shikhar @​fulmicoton)
    • Change to non-strict schema. Ignore fields in data which are not defined in schema. Previously this returned an error. #1211
    • Facets are necessarily indexed. Existing index with indexed facets should work out of the box. Index without facets that are marked with index: false should be broken (but they were already broken in a sense). (@​fulmicoton) #1195 .
    • Bugfix that could in theory impact durability in theory on some filesystems #1224
    • Schema now offers not indexing fieldnorms (@​lpouget) #922
    • Reduce the number of fsync calls #1225
    • Fix opening bytes index with dynamic codec (@​PSeitz) #1278
    • Added an aggregation collector for range, average and stats compatible with Elasticsearch. (@​PSeitz)
    • Added a JSON schema type @​fulmicoton #1251
    • Added support for slop in phrase queries @​halvorboe #1068

    Tantivy 0.16.2

    • Bugfix in FuzzyTermQuery. (tranposition_cost_one was not doing anything)

    Tantivy 0.16.1

    • Major Bugfix on multivalued fastfield. #1151
    • Demux operation (@​PSeitz)

    Tantivy 0.16.0

    Tantivy 0.15.3

    Tantivy 0.15.2

    ... (truncated)

    Commits

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    dependencies 
    opened by dependabot[bot] 1
  • Add --buffered-out option to predict command

    Add --buffered-out option to predict command

    This branch adds --buffered-out option to the predict command.

    When this option is enabled, the stdout is wrapped by a BufWriter and the tokenization speed is improved.

    On the other hand, when this option is enabled, results are not flushed line by line.

    $ cargo run --release -p predict -- --buffered-out --model model-tags.zst < ./inputdata > /dev/null
    Loading model file...
    Start tokenization
    Elapsed: 0.204146218 [sec]
    
    $ cargo run --release -p predict -- --model model-tags.zst < ./inputdata > /dev/null
    Loading model file...
    Start tokenization
    Elapsed: 0.230809509 [sec]
    
    opened by vbkaisetsu 1
  • error: The following required arguments were not provided:     --model-out <model-out>

    error: The following required arguments were not provided: --model-out

    README.md says:

    %  cargo run --release -p convert_kytea_model -- --model-in jp-0.4.7-5-tokenize.model.zstd
    
    

    but this happens:

    # cargo run --release -p convert_kytea_model -- --model-in jp-0.4.7-5-tokenize.model.zstd
        Updating crates.io index
      Downloaded cc v1.0.72
      Downloaded structopt-derive v0.4.18
      Downloaded quote v1.0.15
      Downloaded proc-macro2 v1.0.36
      Downloaded proc-macro-error v1.0.4
      Downloaded syn v1.0.86
      Downloaded bincode v1.3.3
      Downloaded anyhow v1.0.53
      Downloaded bitflags v1.3.2
      Downloaded ansi_term v0.12.1
      Downloaded unicode-segmentation v1.9.0
      Downloaded vec_map v0.8.2
      Downloaded jobserver v0.1.24
      Downloaded zstd v0.9.2+zstd.1.5.1
      Downloaded textwrap v0.11.0
      Downloaded heck v0.3.3
      Downloaded zstd-sys v1.6.2+zstd.1.5.1
      Downloaded libc v0.2.117
      Downloaded lazy_static v1.4.0
      Downloaded clap v2.34.0
      Downloaded structopt v0.3.26
      Downloaded zstd-safe v4.1.3+zstd.1.5.1
      Downloaded unicode-width v0.1.9
      Downloaded strsim v0.8.0
      Downloaded serde_derive v1.0.136
      Downloaded proc-macro-error-attr v1.0.4
      Downloaded version_check v0.9.4
      Downloaded unicode-xid v0.2.2
      Downloaded serde v1.0.136
      Downloaded atty v0.2.14
      Downloaded byteorder v1.4.3
      Downloaded daachorse v0.2.1
      Downloaded 32 crates (2.5 MB) in 1.12s
       Compiling libc v0.2.117
       Compiling proc-macro2 v1.0.36
       Compiling unicode-xid v0.2.2
       Compiling syn v1.0.86
       Compiling version_check v0.9.4
       Compiling serde_derive v1.0.136
       Compiling serde v1.0.136
       Compiling anyhow v1.0.53
       Compiling zstd-safe v4.1.3+zstd.1.5.1
       Compiling unicode-segmentation v1.9.0
       Compiling unicode-width v0.1.9
       Compiling bitflags v1.3.2
       Compiling byteorder v1.4.3
       Compiling strsim v0.8.0
       Compiling ansi_term v0.12.1
       Compiling vec_map v0.8.2
       Compiling lazy_static v1.4.0
       Compiling textwrap v0.11.0
       Compiling daachorse v0.2.1
       Compiling heck v0.3.3
       Compiling proc-macro-error-attr v1.0.4
       Compiling proc-macro-error v1.0.4
       Compiling quote v1.0.15
       Compiling atty v0.2.14
       Compiling jobserver v0.1.24
       Compiling clap v2.34.0
       Compiling cc v1.0.72
       Compiling zstd-sys v1.6.2+zstd.1.5.1
       Compiling structopt-derive v0.4.18
       Compiling structopt v0.3.26
       Compiling zstd v0.9.2+zstd.1.5.1
       Compiling bincode v1.3.3
       Compiling vaporetto v0.2.0 (/work/vae_experiments/vaporetto/vaporetto)
       Compiling convert_kytea_model v0.1.0 (/work/vae_experiments/vaporetto/convert_kytea_model)
        Finished release [optimized] target(s) in 1m 14s
         Running `target/release/convert_kytea_model --model-in jp-0.4.7-5-tokenize.model.zstd`
    error: The following required arguments were not provided:
        --model-out <model-out>
    
    USAGE:
        convert_kytea_model --model-in <model-in> --model-out <model-out>
    

    I think, the correct command is:

    % cargo run --release -p convert_kytea_model -- --model-in jp-0.4.7-5.mod --model-out jp-0.4.7-5-tokenize.model.zstd
    
    opened by ghost 1
  • Add a rule for GitHub Actions

    Add a rule for GitHub Actions

    This branch adds a rule for GitHub Actions.

    Examples:

    • Failed: https://github.com/vbkaisetsu/vaporetto/runs/4058966169?check_suite_focus=true
    • Succeeded: https://github.com/vbkaisetsu/vaporetto/runs/4058989305?check_suite_focus=true
    opened by vbkaisetsu 1
  • Use normal arrays instead of FSTs for holding dictionaries

    Use normal arrays instead of FSTs for holding dictionaries

    Currently, Vaporetto uses FSTs for holding dictionaries, but FST is too rich but not fast enough for just holding dictionaries. Therefore, this branch uses normal arrays for holding dictionaries.

    In addition, this branch adds zstd compression for CLI frontends. The compression is not a core feature of Vaporetto, so it is not included in vaporetto crate.

    This change affects the data structure of the model file, so the previous model data is not compatible with this branch. On the other hand, a model file is large binary data and it is inappropriate to manage in the source code repository, so I removed the model data now. I will release model data in other ways.

    Note:

    | Method | size (bytes) | | ----- | -----:| | FST | 22,457,279 | | FST + zstd | 6,224,462 | | Normal arrays | 32,678,923 | | Normal arrays + zstd | 4,554,971 |

    Base model file: jp-0.4.7-5.mod (KyTea's model file)

    opened by vbkaisetsu 1
  • Update tantivy requirement from 0.18 to 0.19

    Update tantivy requirement from 0.18 to 0.19

    Updates the requirements on tantivy to permit the latest version.

    Changelog

    Sourced from tantivy's changelog.

    Tantivy 0.19

    Bugfixes

    Features/Improvements

    Tantivy 0.18

    • For date values chrono has been replaced with time (@​uklotzde) #1304 :
      • The time crate is re-exported as tantivy::time instead of tantivy::chrono.
      • The type alias tantivy::DateTime has been removed.
      • Value::Date wraps time::PrimitiveDateTime without time zone information.
      • Internally date/time values are stored as seconds since UNIX epoch in UTC.
      • Converting a time::OffsetDateTime to Value::Date implicitly converts the value into UTC. If this is not desired do the time zone conversion yourself and use time::PrimitiveDateTime directly instead.
    • Add histogram aggregation (@​PSeitz)
    • Add support for fastfield on text fields (@​PSeitz)
    • Add terms aggregation (@​PSeitz)
    • Add support for zstd compression (@​kryesh)

    ... (truncated)

    Commits

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    dependencies 
    opened by dependabot[bot] 0
  • Update zstd requirement from 0.11 to 0.12

    Update zstd requirement from 0.11 to 0.12

    Updates the requirements on zstd to permit the latest version.

    Commits

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    dependencies 
    opened by dependabot[bot] 0
Releases(v0.5.1)
Owner
null
Viterbi-based accelerated tokenizer (Python wrapper)

?? python-vibrato ?? Vibrato is a fast implementation of tokenization (or morphological analysis) based on the Viterbi algorithm. This is a Python wra

null 20 Dec 29, 2022
Rust-tokenizer offers high-performance tokenizers for modern language models, including WordPiece, Byte-Pair Encoding (BPE) and Unigram (SentencePiece) models

rust-tokenizers Rust-tokenizer offers high-performance tokenizers for modern language models, including WordPiece, Byte-Pair Encoding (BPE) and Unigra

null 165 Jan 1, 2023
A WHATWG-compliant HTML5 tokenizer and tag soup parser

html5gum html5gum is a WHATWG-compliant HTML tokenizer. use std::fmt::Write; use html5gum::{Tokenizer, Token}; let html = "<title >hello world</tit

Markus Unterwaditzer 129 Dec 30, 2022
The Bytepiece Tokenizer Implemented in Rust.

bytepiece Implementation of Su's bytepiece. Bytepiece is a new tokenize method, which uses UTF-8 Byte as unigram to process text. It needs little prep

Yam(长琴) 11 Oct 2, 2023
A lightweight and snappy crate to remove emojis from a string.

A lightweight and snappy crate to remove emojis from a string.

Tejas Ravishankar 8 Jul 19, 2022
A lightweight library with vehicle tuning utilities.

A lightweight library with vehicle tuning utilities. This includes utilities for communicating with OBD-II services, firmware downloading/flashing, and table modifications.

LibreTuner 6 Oct 3, 2022
A lightweight platform-accelerated library for biological motif scanning using position weight matrices.

?? ?? lightmotif A lightweight platform-accelerated library for biological motif scanning using position weight matrices. ??️ Overview Motif scanning

Martin Larralde 16 May 4, 2023
A simple and fast linear algebra library for games and graphics

glam A simple and fast 3D math library for games and graphics. Development status glam is in beta stage. Base functionality has been implemented and t

Cameron Hart 953 Jan 3, 2023
Checks all your documentation for spelling and grammar mistakes with hunspell and a nlprule based checker for grammar

cargo-spellcheck Check your spelling with hunspell and/or nlprule. Use Cases Run cargo spellcheck --fix or cargo spellcheck fix to fix all your docume

Bernhard Schuster 274 Nov 5, 2022
Rust edit distance routines accelerated using SIMD. Supports fast Hamming, Levenshtein, restricted Damerau-Levenshtein, etc. distance calculations and string search.

triple_accel Rust edit distance routines accelerated using SIMD. Supports fast Hamming, Levenshtein, restricted Damerau-Levenshtein, etc. distance cal

Daniel Liu 75 Jan 8, 2023
💥 Fast State-of-the-Art Tokenizers optimized for Research and Production

Provides an implementation of today's most used tokenizers, with a focus on performance and versatility. Main features: Train new vocabularies and tok

Hugging Face 6.2k Jan 5, 2023
A fast, low-resource Natural Language Processing and Text Correction library written in Rust.

nlprule A fast, low-resource Natural Language Processing and Error Correction library written in Rust. nlprule implements a rule- and lookup-based app

Benjamin Minixhofer 496 Jan 8, 2023
Fast and easy random number generation.

alea A zero-dependency crate for fast number generation, with a focus on ease of use (no more passing &mut rng everywhere!). The implementation is bas

Jeff Shen 26 Dec 13, 2022
Composable n-gram combinators that are ergonomic and bare-metal fast

CREATURE FEATUR(ization) A crate for polymorphic ML & NLP featurization that leverages zero-cost abstraction. It provides composable n-gram combinator

null 3 Aug 25, 2022
Fast PDF password cracking utility equipped with commonly encountered password format builders and dictionary attacks.

PDFRip Fast PDF password cracking utility equipped with commonly encountered password format builders and dictionary attacks. ?? Table of Contents Int

Mufeed VH 226 Jan 4, 2023
Rust native ready-to-use NLP pipelines and transformer-based models (BERT, DistilBERT, GPT2,...)

rust-bert Rust native Transformer-based models implementation. Port of Hugging Face's Transformers library, using the tch-rs crate and pre-processing

null 1.3k Jan 8, 2023
A backend for mdBook written in Rust for generating PDF based on headless chrome and Chrome DevTools Protocol.

A backend for mdBook written in Rust for generating PDF based on headless chrome and Chrome DevTools Protocol.

Hollow Man 52 Jan 7, 2023
Find files (ff) by name, fast!

Find Files (ff) Find Files (ff) utility recursively searches the files whose names match the specified RegExp pattern in the provided directory (defau

Vishal Telangre 310 Dec 29, 2022
Fast suffix arrays for Rust (with Unicode support).

suffix Fast linear time & space suffix arrays for Rust. Supports Unicode! Dual-licensed under MIT or the UNLICENSE. Documentation https://docs.rs/suff

Andrew Gallant 207 Dec 26, 2022