Vaporetto: a fast and lightweight pointwise prediction based tokenizer

Last update: Dec 22, 2022

Related tags

Text processing nlp rust library japanese tokenizer analyzer morphological

Overview

🛥 VAporetto: POintwise pREdicTion based TOkenizer

Vaporetto is a fast and lightweight pointwise prediction based tokenizer.

Overview

This repository includes both a Rust crate that provides APIs for Vaporetto and CLI frontends.

The following examples use KFTT for training and prediction data.

Training

% cargo run --release --bin train -- --model ./kftt.model --tok ./kftt-data-1.0/data/tok/kyoto-train.ja

Prediction

% cargo run --release --bin predict -- --model ./kftt.model < ./kftt-data-1.0/data/orig/kyoto-test.ja > ./tokenized.ja

Conversion from KyTea's Model File

% cargo run --release --bin convert_kytea_model -- --model-in ./jp-0.4.7-5.mod --model-out ./kytea.model

Disclaimer

This software is developed by LegalForce, Inc., but not an officially supported LegalForce product.

License

Licensed under either of

Apache License, Version 2.0 (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
MIT license (LICENSE-MIT or http://opensource.org/licenses/MIT)

at your option.

Contribution

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.

Comments

Add wsconst option and remove multithreading support from CLI
This branch adds:

Pre-processing (character normalization)

Post-processing (--wsconst option)

to CLIs.

and removes:

multithreading support

from predict command.
opened by vbkaisetsu 4
Reimplement Vaporetto with supporting multiple tags
Supports multiple tags

Currently, a single tag followed by a slash after each token will train the tag classifier on each token. This branch changes so that multiple tags containing pronunciation can be predicted. To specify multiple tags, use multiple slashes as follows:

この/代名詞/コノ人/名詞/ヒトは/助詞/ワ火星/名詞人/接尾辞/ジンです/助動詞/デス

This feature does not support for unknown words now.

Changes in Predictor struct

In the previous version, the predict() function takes a Sentence and returns a modified one. In this change, predict() takes a mutable reference of Sentence instead.

Changes in Sentence struct

In the previous version, the to_tokenized_vec() function returns a newly allocated vector containing tokens. In this change, this function is removed and iter_tokens(), which returns an iterator for tokens, is added.

This branch also includes refactoring and other API changes.
opened by vbkaisetsu 3
Support serialization with bincode v2

This branch uses bincode for serialization and deserialization features of Model and Predictor.

The deserialization feature of the Predictor is unsafe and users should not load serialized data provided by thirdparty distributors, so the predict command does not support loading serialized Predictors.

I will add an example usage in another branch.

opened by vbkaisetsu 2
Add demo page

This branch removes the previous wasm example and adds a demo page.

The demo page is already deployed manually. This branch automatically deploys the demo when main branch is updated.

opened by vbkaisetsu 1
Update tantivy requirement from 0.17 to 0.18
⚠️ Dependabot is rebasing this PR ⚠️

Rebasing might not happen immediately, so don't worry if this takes some time.

Note: if you make any changes to this PR yourself, they will take precedence over the rebase.

Updates the requirements on tantivy to permit the latest version.

Changelog

Sourced from tantivy's changelog.

Tantivy 0.18

For date values chrono has been replaced with time (@uklotzde) #1304 :

The time crate is re-exported as tantivy::time instead of tantivy::chrono.

The type alias tantivy::DateTime has been removed.

Value::Date wraps time::PrimitiveDateTime without time zone information.

Internally date/time values are stored as seconds since UNIX epoch in UTC.

Converting a time::OffsetDateTime to Value::Date implicitly converts the value into UTC. If this is not desired do the time zone conversion yourself and use time::PrimitiveDateTime directly instead.

Add histogram aggregation (@PSeitz)

Add support for fastfield on text fields (@PSeitz)

Add terms aggregation (@PSeitz)

Add support for zstd compression (@kryesh)

Tantivy 0.17

LogMergePolicy now triggers merges if the ratio of deleted documents reaches a threshold (@shikhar @fulmicoton) #115

Adds a searcher Warmer API (@shikhar @fulmicoton)

Change to non-strict schema. Ignore fields in data which are not defined in schema. Previously this returned an error. #1211

Facets are necessarily indexed. Existing index with indexed facets should work out of the box. Index without facets that are marked with index: false should be broken (but they were already broken in a sense). (@fulmicoton) #1195 .

Bugfix that could in theory impact durability in theory on some filesystems #1224

Schema now offers not indexing fieldnorms (@lpouget) #922

Reduce the number of fsync calls #1225

Fix opening bytes index with dynamic codec (@PSeitz) #1278

Added an aggregation collector for range, average and stats compatible with Elasticsearch. (@PSeitz)

Added a JSON schema type @fulmicoton #1251

Added support for slop in phrase queries @halvorboe #1068

Tantivy 0.16.2

Bugfix in FuzzyTermQuery. (tranposition_cost_one was not doing anything)

Tantivy 0.16.1

Major Bugfix on multivalued fastfield. #1151

Demux operation (@PSeitz)

Tantivy 0.16.0

Bugfix in the filesum check. (@evanxg852000) #1127

Bugfix in positions when the index is sorted by a field. (@appaquet) #1125

Tantivy 0.15.3

Major bugfix. Deleting documents was broken when the index was sorted by a field. (@appaquet, @fulmicoton) #1101

Tantivy 0.15.2

... (truncated)

Commits

See full diff in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

dependencies
opened by dependabot[bot] 1

Add --buffered-out option to predict command

This branch adds --buffered-out option to the predict command.

When this option is enabled, the stdout is wrapped by a BufWriter and the tokenization speed is improved.

On the other hand, when this option is enabled, results are not flushed line by line.

$ cargo run --release -p predict -- --buffered-out --model model-tags.zst < ./inputdata > /dev/null
Loading model file...
Start tokenization
Elapsed: 0.204146218 [sec]

$ cargo run --release -p predict -- --model model-tags.zst < ./inputdata > /dev/null
Loading model file...
Start tokenization
Elapsed: 0.230809509 [sec]

opened by vbkaisetsu 1

error: The following required arguments were not provided: --model-out

README.md says:

%  cargo run --release -p convert_kytea_model -- --model-in jp-0.4.7-5-tokenize.model.zstd

but this happens:

# cargo run --release -p convert_kytea_model -- --model-in jp-0.4.7-5-tokenize.model.zstd
    Updating crates.io index
  Downloaded cc v1.0.72
  Downloaded structopt-derive v0.4.18
  Downloaded quote v1.0.15
  Downloaded proc-macro2 v1.0.36
  Downloaded proc-macro-error v1.0.4
  Downloaded syn v1.0.86
  Downloaded bincode v1.3.3
  Downloaded anyhow v1.0.53
  Downloaded bitflags v1.3.2
  Downloaded ansi_term v0.12.1
  Downloaded unicode-segmentation v1.9.0
  Downloaded vec_map v0.8.2
  Downloaded jobserver v0.1.24
  Downloaded zstd v0.9.2+zstd.1.5.1
  Downloaded textwrap v0.11.0
  Downloaded heck v0.3.3
  Downloaded zstd-sys v1.6.2+zstd.1.5.1
  Downloaded libc v0.2.117
  Downloaded lazy_static v1.4.0
  Downloaded clap v2.34.0
  Downloaded structopt v0.3.26
  Downloaded zstd-safe v4.1.3+zstd.1.5.1
  Downloaded unicode-width v0.1.9
  Downloaded strsim v0.8.0
  Downloaded serde_derive v1.0.136
  Downloaded proc-macro-error-attr v1.0.4
  Downloaded version_check v0.9.4
  Downloaded unicode-xid v0.2.2
  Downloaded serde v1.0.136
  Downloaded atty v0.2.14
  Downloaded byteorder v1.4.3
  Downloaded daachorse v0.2.1
  Downloaded 32 crates (2.5 MB) in 1.12s
   Compiling libc v0.2.117
   Compiling proc-macro2 v1.0.36
   Compiling unicode-xid v0.2.2
   Compiling syn v1.0.86
   Compiling version_check v0.9.4
   Compiling serde_derive v1.0.136
   Compiling serde v1.0.136
   Compiling anyhow v1.0.53
   Compiling zstd-safe v4.1.3+zstd.1.5.1
   Compiling unicode-segmentation v1.9.0
   Compiling unicode-width v0.1.9
   Compiling bitflags v1.3.2
   Compiling byteorder v1.4.3
   Compiling strsim v0.8.0
   Compiling ansi_term v0.12.1
   Compiling vec_map v0.8.2
   Compiling lazy_static v1.4.0
   Compiling textwrap v0.11.0
   Compiling daachorse v0.2.1
   Compiling heck v0.3.3
   Compiling proc-macro-error-attr v1.0.4
   Compiling proc-macro-error v1.0.4
   Compiling quote v1.0.15
   Compiling atty v0.2.14
   Compiling jobserver v0.1.24
   Compiling clap v2.34.0
   Compiling cc v1.0.72
   Compiling zstd-sys v1.6.2+zstd.1.5.1
   Compiling structopt-derive v0.4.18
   Compiling structopt v0.3.26
   Compiling zstd v0.9.2+zstd.1.5.1
   Compiling bincode v1.3.3
   Compiling vaporetto v0.2.0 (/work/vae_experiments/vaporetto/vaporetto)
   Compiling convert_kytea_model v0.1.0 (/work/vae_experiments/vaporetto/convert_kytea_model)
    Finished release [optimized] target(s) in 1m 14s
     Running `target/release/convert_kytea_model --model-in jp-0.4.7-5-tokenize.model.zstd`
error: The following required arguments were not provided:
    --model-out <model-out>

USAGE:
    convert_kytea_model --model-in <model-in> --model-out <model-out>

I think, the correct command is:

% cargo run --release -p convert_kytea_model -- --model-in jp-0.4.7-5.mod --model-out jp-0.4.7-5-tokenize.model.zstd

opened by ghost 1

Add a rule for GitHub Actions
This branch adds a rule for GitHub Actions.

Examples:

Failed: https://github.com/vbkaisetsu/vaporetto/runs/4058966169?check_suite_focus=true

Succeeded: https://github.com/vbkaisetsu/vaporetto/runs/4058989305?check_suite_focus=true
opened by vbkaisetsu 1
Use normal arrays instead of FSTs for holding dictionaries

Currently, Vaporetto uses FSTs for holding dictionaries, but FST is too rich but not fast enough for just holding dictionaries. Therefore, this branch uses normal arrays for holding dictionaries.

In addition, this branch adds zstd compression for CLI frontends. The compression is not a core feature of Vaporetto, so it is not included in vaporetto crate.

This change affects the data structure of the model file, so the previous model data is not compatible with this branch. On the other hand, a model file is large binary data and it is inappropriate to manage in the source code repository, so I removed the model data now. I will release model data in other ways.

Note:

| Method | size (bytes) | | ----- | -----:| | FST | 22,457,279 | | FST + zstd | 6,224,462 | | Normal arrays | 32,678,923 | | Normal arrays + zstd | 4,554,971 |

Base model file: jp-0.4.7-5.mod (KyTea's model file)

opened by vbkaisetsu 1
Update tantivy requirement from 0.18 to 0.19
Updates the requirements on tantivy to permit the latest version.

Changelog

Sourced from tantivy's changelog.

Tantivy 0.19

Bugfixes

Fix missing fieldnorms for u64, i64, f64, bool, bytes and date #1620 (@PSeitz)

Fix interpolation overflow in linear interpolation fastfield codec #1480 (@PSeitz @fulmicoton)

Features/Improvements

Add support for IN in queryparser , e.g. field: IN [val1 val2 val3] #1683 (@trinity-1686a)

Skip score calculation, when no scoring is required #1646 (@PSeitz)

Limit fast fields to u32 (get_val(u32)) #1644 (@PSeitz)

The DateTime type has been updated to hold timestamps with microseconds precision. DateOptions and DatePrecision have been added to configure Date fields. The precision is used to hint on fast values compression. Otherwise, seconds precision is used everywhere else (i.e terms, indexing) #1396 (@evanxg852000)

Add IP address field type #1553 (@PSeitz)

Add boolean field type #1382 (@boraarslan)

Remove Searcher pool and make Searcher cloneable. (@PSeitz)

Validate settings on create #1570 (@PSeitz)

Detect and apply gcd on fastfield codecs #1418 (@PSeitz)

Doc store

use separate thread to compress block store #1389 #1510 (@PSeitz @fulmicoton)

Expose doc store cache size #1403 (@PSeitz)

Enable compression levels for doc store #1378 (@PSeitz)

Make block size configurable #1374 (@kryesh)

Make tantivy::TantivyError cloneable #1402 (@PSeitz)

Add support for phrase slop in query language #1393 (@saroh)

Aggregation

Add aggregation support for date type #1693(@PSeitz)

Add support for keyed parameter in range and histgram aggregations #1424 (@k-yomo)

Add aggregation bucket limit #1363 (@PSeitz)

Faster indexing

#1610 (@PSeitz)

#1594 (@PSeitz)

#1582 (@PSeitz)

#1611 (@PSeitz)

Added a pre-configured stop word filter for various language #1666 (@adamreichold)

Tantivy 0.18

For date values chrono has been replaced with time (@uklotzde) #1304 :

The time crate is re-exported as tantivy::time instead of tantivy::chrono.

The type alias tantivy::DateTime has been removed.

Value::Date wraps time::PrimitiveDateTime without time zone information.

Internally date/time values are stored as seconds since UNIX epoch in UTC.

Converting a time::OffsetDateTime to Value::Date implicitly converts the value into UTC. If this is not desired do the time zone conversion yourself and use time::PrimitiveDateTime directly instead.

Add histogram aggregation (@PSeitz)

Add support for fastfield on text fields (@PSeitz)

Add terms aggregation (@PSeitz)

Add support for zstd compression (@kryesh)

... (truncated)

Commits

2c50b02 Fix max bucket limit in histogram (#1703)

509adab Bump version (#1715)

96c93a6 Merge pull request #1700 from quickwit-oss/PSeitz-patch-1

4958243 Move split_full_path to Schema (#1692)

485a8f5 Update CHANGELOG.md

1119e59 prepare fastfield format for null index (#1691)

ee1f2c1 add aggregation support for date type (#1693)

600548f Merge pull request #1694 from quickwit-oss/dependabot/cargo/zstd-0.12

9929c0c Merge pull request #1696 from quickwit-oss/dependabot/cargo/env_logger-0.10.0

f53e656 Update env_logger requirement from 0.9.0 to 0.10.0

Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

dependencies
opened by dependabot[bot] 0
Update zstd requirement from 0.11 to 0.12
Updates the requirements on zstd to permit the latest version.

Commits

See full diff in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

dependencies
opened by dependabot[bot] 0

Releases(v0.5.1)

v0.5.1(Jun 20, 2022)

Model Files

You can use the following assets: https://github.com/daac-tools/vaporetto/releases/tag/v0.5.0
Source code(tar.gz)
Source code(zip)
v0.5.0(Jun 6, 2022)
Model Files

We provide multiple model files for Vaporetto that you can download and use in your work. These models have been trained using BCCWJ and UniDic.

All of these models are trained with L1-regularization.

See below for license terms of each model.

(NOTE) Some of BCCWJ are not included in training data due to rights reasons.

Models With Dictionary

We provide models containing UniDic. These models have the highest accuracy in our distributions.

bccwj-suw+unidic+tag.model.zst: contains a tag prediction model. Tags are only trained using BCCWJ.

bccwj-suw+unidic+tag-huge.model.zst: contains a tag prediction model. Tags are trained using BCCWJ and UniDic.

Models Without Dictionary

We also provide models that do not contain UniDic. These models have been trained over three model sizes and two word units.

| | Short unit words (SUW) | Long unit words (LUW) | | ---- | ---- | ---- | | Tiny (C=0.003) | bccwj-suw-tiny.model.zst | N/A | | Small (C=0.1) | bccwj-suw-small.model.zst | bccwj-luw-small.model.zst | | Middle (C=0.5) | bccwj-suw-middle.model.zst | bccwj-luw-middle.model.zst | | Large (C=1.0) | bccwj-suw-large.model.zst | bccwj-luw-large.model.zst |

License

The following models are licensed under 3-Clause BSD License.

bccwj-suw+unidic+tag.model.zst

bccwj-suw+unidic+tag-huge.model.zst

The following models are licensed under either of Apache License (Version 2.0) or MIT License at your option.

bccwj-suw-small.model.zst

bccwj-suw-middle.model.zst

bccwj-suw-large.model.zst

bccwj-luw-small.model.zst

bccwj-luw-middle.model.zst

bccwj-luw-large.model.zst

Source code(tar.gz)
Source code(zip)
bccwj-luw-large.tar.xz(597.11 KB)
bccwj-luw-middle.tar.xz(373.68 KB)
bccwj-luw-small.tar.xz(76.05 KB)
bccwj-suw+unidic+tag-huge.tar.xz(5.77 MB)
bccwj-suw+unidic+tag.tar.xz(2.70 MB)
bccwj-suw-large.tar.xz(371.74 KB)
bccwj-suw-middle.tar.xz(251.32 KB)
bccwj-suw-small.tar.xz(72.75 KB)
bccwj-suw-tiny.tar.xz(8.03 KB)
v0.4.0(Apr 12, 2022)
Model Files

We provide multiple model files for Vaporetto that you can download and use in your work. These models have been trained using BCCWJ and UniDic.

All of these models are trained with L1-regularization.

See below for license terms of each model.

(NOTE) Some of BCCWJ are not included in training data due to rights reasons.

Models With Dictionary

We provide two models containing UniDic. These models have the highest accuracy in our distributions.

bccwj-suw+unidic+tag.model.zst: contains a tag prediction model

bccwj-suw+unidic.model.zst: does not contain a tag prediction model

Models Without Dictionary

We also provide models that do not contain UniDic. These models have been trained over three model sizes and two word units.

| | Short unit words (SUW) | Long unit words (LUW) | | ---- | ---- | ---- | | Tiny (C=0.003) | bccwj-suw-tiny.model.zst | N/A | | Small (C=0.1) | bccwj-suw-small.model.zst | bccwj-luw-small.model.zst | | Middle (C=0.5) | bccwj-suw-middle.model.zst | bccwj-luw-middle.model.zst | | Large (C=1.0) | bccwj-suw-large.model.zst | bccwj-luw-large.model.zst |

License

The following models are licensed under 3-Clause BSD License.

bccwj-suw+unidic+tag.model.zst

bccwj-suw+unidic.model.zst

The following models are licensed under either of Apache License (Version 2.0) or MIT License at your option.

bccwj-suw-small.model.zst

bccwj-suw-middle.model.zst

bccwj-suw-large.model.zst

bccwj-luw-small.model.zst

bccwj-luw-middle.model.zst

bccwj-luw-large.model.zst

Source code(tar.gz)
Source code(zip)
bccwj-luw-large.tar.xz(359.37 KB)
bccwj-luw-middle.tar.xz(243.65 KB)
bccwj-luw-small.tar.xz(58.71 KB)
bccwj-suw+unidic+tag.tar.xz(2.65 MB)
bccwj-suw+unidic.tar.xz(1.95 MB)
bccwj-suw-large.tar.xz(372.17 KB)
bccwj-suw-middle.tar.xz(251.72 KB)
bccwj-suw-small.tar.xz(73.07 KB)
bccwj-suw-tiny.tar.xz(8.74 KB)
v0.3.0(Feb 14, 2022)
Model Files

We provide multiple model files for Vaporetto that you can download and use in your work. These models have been trained using BCCWJ and UniDic.

All of these models are trained with L1-regularization.

See below for license terms of each model.

(NOTE) Some of BCCWJ are not included in training data due to rights reasons.

Models With Dictionary

We provide two models containing UniDic. These models have the highest accuracy in our distributions.

bccwj-suw+unidic+tag.model.zst: contains a tag prediction model

bccwj-suw+unidic.model.zst: does not contain a tag prediction model

Models Without Dictionary

We also provide models that do not contain UniDic. These models have been trained over three model sizes and two word units.

| | Short unit words (SUW) | Long unit words (LUW) | | ---- | ---- | ---- | | Small (C=0.1) | bccwj-suw-small.model.zst | bccwj-luw-small.model.zst | | Middle (C=0.5) | bccwj-suw-middle.model.zst | bccwj-luw-middle.model.zst | | Large (C=1.0) | bccwj-suw-large.model.zst | bccwj-luw-large.model.zst |

License

The following models are licensed under 3-Clause BSD License.

bccwj-suw+unidic+tag.model.zst

bccwj-suw+unidic.model.zst

The following models are licensed under either of Apache License (Version 2.0) or MIT License at your option.

bccwj-suw-small.model.zst

bccwj-suw-middle.model.zst

bccwj-suw-large.model.zst

bccwj-luw-small.model.zst

bccwj-luw-middle.model.zst

bccwj-luw-large.model.zst

Source code(tar.gz)
Source code(zip)
bccwj-luw-large.tar.xz(640.60 KB)
bccwj-luw-middle.tar.xz(399.57 KB)
bccwj-luw-small.tar.xz(82.81 KB)
bccwj-suw+unidic+tag.tar.xz(2.64 MB)
bccwj-suw+unidic.tar.xz(1.84 MB)
bccwj-suw-large.tar.xz(399.82 KB)
bccwj-suw-middle.tar.xz(269.99 KB)
bccwj-suw-small.tar.xz(79.28 KB)
0.2.0(Nov 1, 2021)

Source code(tar.gz)
Source code(zip)
0.1.6(Oct 18, 2021)

Source code(tar.gz)
Source code(zip)
0.1.5(Sep 30, 2021)

Source code(tar.gz)
Source code(zip)
0.1.4(Sep 15, 2021)

Source code(tar.gz)
Source code(zip)
0.1.3(Aug 30, 2021)

Source code(tar.gz)
Source code(zip)
0.1.2(Aug 29, 2021)

Source code(tar.gz)
Source code(zip)

Owner

GitHub

Viterbi-based accelerated tokenizer (Python wrapper)

?? python-vibrato ?? Vibrato is a fast implementation of tokenization (or morphological analysis) based on the Viterbi algorithm. This is a Python wra

20 Dec 29, 2022

Rust-tokenizer offers high-performance tokenizers for modern language models, including WordPiece, Byte-Pair Encoding (BPE) and Unigram (SentencePiece) models

rust-tokenizers Rust-tokenizer offers high-performance tokenizers for modern language models, including WordPiece, Byte-Pair Encoding (BPE) and Unigra

165 Jan 1, 2023

A WHATWG-compliant HTML5 tokenizer and tag soup parser

html5gum html5gum is a WHATWG-compliant HTML tokenizer. use std::fmt::Write; use html5gum::{Tokenizer, Token}; let html = "<title >hello world</tit

129 Dec 30, 2022

The Bytepiece Tokenizer Implemented in Rust.

bytepiece Implementation of Su's bytepiece. Bytepiece is a new tokenize method, which uses UTF-8 Byte as unigram to process text. It needs little prep

11 Oct 2, 2023

A lightweight and snappy crate to remove emojis from a string.

8 Jul 19, 2022

A lightweight library with vehicle tuning utilities.

A lightweight library with vehicle tuning utilities. This includes utilities for communicating with OBD-II services, firmware downloading/flashing, and table modifications.

6 Oct 3, 2022

A lightweight platform-accelerated library for biological motif scanning using position weight matrices.

?? ?? lightmotif A lightweight platform-accelerated library for biological motif scanning using position weight matrices. ??️ Overview Motif scanning

16 May 4, 2023

A simple and fast linear algebra library for games and graphics

glam A simple and fast 3D math library for games and graphics. Development status glam is in beta stage. Base functionality has been implemented and t

953 Jan 3, 2023

Checks all your documentation for spelling and grammar mistakes with hunspell and a nlprule based checker for grammar

cargo-spellcheck Check your spelling with hunspell and/or nlprule. Use Cases Run cargo spellcheck --fix or cargo spellcheck fix to fix all your docume

274 Nov 5, 2022

Rust edit distance routines accelerated using SIMD. Supports fast Hamming, Levenshtein, restricted Damerau-Levenshtein, etc. distance calculations and string search.

triple_accel Rust edit distance routines accelerated using SIMD. Supports fast Hamming, Levenshtein, restricted Damerau-Levenshtein, etc. distance cal

75 Jan 8, 2023

Vaporetto: a fast and lightweight pointwise prediction based tokenizer

Related tags

Overview

🛥 VAporetto: POintwise pREdicTion based TOkenizer

Overview

Training

Prediction

Conversion from KyTea's Model File

Disclaimer

License

Contribution

Comments

Supports multiple tags

Changes in Predictor struct

Changes in Sentence struct

Tantivy 0.18

Tantivy 0.17

Tantivy 0.16.2

Tantivy 0.16.1

Tantivy 0.16.0

Tantivy 0.15.3

Tantivy 0.15.2

Tantivy 0.19

Bugfixes

Features/Improvements

Tantivy 0.18

Releases(v0.5.1)

v0.5.1(Jun 20, 2022)

Model Files

v0.5.0(Jun 6, 2022)

Model Files

Models With Dictionary

Models Without Dictionary

License

v0.4.0(Apr 12, 2022)

Model Files

Models With Dictionary

Models Without Dictionary

License

v0.3.0(Feb 14, 2022)

Model Files

Models With Dictionary

Models Without Dictionary

License

0.2.0(Nov 1, 2021)

0.1.6(Oct 18, 2021)

0.1.5(Sep 30, 2021)

0.1.4(Sep 15, 2021)

0.1.3(Aug 30, 2021)

0.1.2(Aug 29, 2021)

Owner

Viterbi-based accelerated tokenizer (Python wrapper)

Rust-tokenizer offers high-performance tokenizers for modern language models, including WordPiece, Byte-Pair Encoding (BPE) and Unigram (SentencePiece) models

A WHATWG-compliant HTML5 tokenizer and tag soup parser

The Bytepiece Tokenizer Implemented in Rust.

A lightweight and snappy crate to remove emojis from a string.

A lightweight library with vehicle tuning utilities.

A lightweight platform-accelerated library for biological motif scanning using position weight matrices.

A simple and fast linear algebra library for games and graphics

Checks all your documentation for spelling and grammar mistakes with hunspell and a nlprule based checker for grammar

Rust edit distance routines accelerated using SIMD. Supports fast Hamming, Levenshtein, restricted Damerau-Levenshtein, etc. distance calculations and string search.

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production

A fast, low-resource Natural Language Processing and Text Correction library written in Rust.

Fast and easy random number generation.

Composable n-gram combinators that are ergonomic and bare-metal fast

Fast PDF password cracking utility equipped with commonly encountered password format builders and dictionary attacks.

Rust native ready-to-use NLP pipelines and transformer-based models (BERT, DistilBERT, GPT2,...)

A backend for mdBook written in Rust for generating PDF based on headless chrome and Chrome DevTools Protocol.

Find files (ff) by name, fast!

Fast suffix arrays for Rust (with Unicode support).

Changes in `Predictor` struct

Changes in `Sentence` struct