An official Sudachi clone in Rust (incomplete) 🦀

Works Applications

Last update: Dec 29, 2022

Related tags

Text processing segmentation pos-tagging morphological-analysis tokenization sudachi nlp-libary

Overview

2021-07-07 UPDATE: The official Sudachi team will take over this project (cf. 日本語形態素解析器 SudachiPy の現状と今後について - Speaker Deck)

sudachi.rs

An official Sudachi clone in Rust 🦀

日本語 README

Caution

This is my hobby project to try out Rust, and the implementation is incomplete; One fatal problem is that it will throw an error when there is an Out-of-Vocabulary word (i.e., when there is no lattice path from the beginning to the end).

$ echo "あ" | sudachi
あ      感動詞,フィラー,*,*,*,* あー
EOS

$ echo "阿" | sudachi
thread 'main' panicked at 'EOS isn't connected to BOS', src/lattice.rs:70:13
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Please also have a look at an alternative by another person, Yasu-umi/sudachiclone-rs.

Example

Multi-granular Tokenization

$ echo 選挙管理委員会 | sudachi
選挙管理委員会	名詞,固有名詞,一般,*,*,*	選挙管理委員会
EOS

$ echo 選挙管理委員会 | sudachi --mode A
選挙	名詞,普通名詞,サ変可能,*,*,*	選挙
管理	名詞,普通名詞,サ変可能,*,*,*	管理
委員	名詞,普通名詞,一般,*,*,*	委員
会	名詞,普通名詞,一般,*,*,*	会
EOS

Normalized Form

$ echo 打込む かつ丼 附属 vintage | sudachi
打込む	動詞,一般,*,*,五段-マ行,終止形-一般	打ち込む
 	空白,*,*,*,*,*
かつ丼	名詞,普通名詞,一般,*,*,*	カツ丼
 	空白,*,*,*,*,*
附属	名詞,普通名詞,サ変可能,*,*,*	付属
 	空白,*,*,*,*,*
vintage	名詞,普通名詞,一般,*,*,*	ビンテージ

Wakati (space-delimited surface form) Output

$ cat lemon.txt
えたいの知れない不吉な塊が私の心を始終圧えつけていた。
焦躁と言おうか、嫌悪と言おうか――酒を飲んだあとに宿酔があるように、酒を毎日飲んでいると宿酔に相当した時期がやって来る。
それが来たのだ。これはちょっといけなかった。

$ sudachi --wakati lemon.txt
えたい の 知れ ない 不吉 な 塊 が 私 の 心 を 始終 圧え つけ て い た 。
焦躁 と 言おう か 、 嫌悪 と 言おう か ― ― 酒 を 飲ん だ あと に 宿酔 が ある よう に 、 酒 を 毎日 飲ん で いる と 宿酔 に 相当 し た 時期 が やっ て 来る 。
それ が 来 た の だ 。 これ は ちょっと いけ なかっ た 。

Usage

$ sudachi -h
sudachi 0.1.0
A Japanese tokenizer

USAGE:
    sudachi [FLAGS] [OPTIONS] [file]

FLAGS:
    -d, --debug      Debug mode: Dumps lattice
    -h, --help       Prints help information
    -a, --all        Prints all fields
    -V, --version    Prints version information
    -w, --wakati     Outputs only surface form

OPTIONS:
    -m, --mode <mode>    Split unit: "A" (short), "B" (middle), or "C" (Named Entity) [default: C]

ARGS:
    <file>    Input text file: If not present, read from STDIN

Setup

1. Get the source code

$ git clone https://github.com/sorami/sudachi.rs.git

2. Download a Sudachi Dictionary

You can download a dictionary zip file from WorksApplications/SudachiDict (choose one from small, core, or full), unzip it, and place the system_*.dic file to src/resources/system.dic (Note that the file name is changed to system.dic) .

Alternatively, you can use a quick shell script in the source code; This script will download the core dictionary and place it to src/resources/system.dic.

$ ./fetch_dictionary.sh

3. Build, Install

The built executable will contain the dictionary binary.

$ cargo build

sudachi.rs/ $ cargo install --path .

$ which sudachi
/Users/<USER>/.cargo/bin/sudachi

$ sudachi -h
sudachi 0.1.0
A Japanese tokenizer
...

ToDo

Out of Vocabulary handling
Easy dictionary file install & management, similar to SudachiPy
Registration to crates.io

References

Sudachi

Morphological Analyzers in Rust

Logo

Sudachi Logo
Crab illustration: Pixabay

sudachi.rs - 日本語README

形態素解析器 Sudachi - 公式 Rust 🦀 クローン

English README

注意

これはRust勉強のための趣味実装で、実装が未完の部分があります。特に、未知語が存在するときにエラーが発生します（ラティスで最初から最後までパスが存在しない場合）。

$ echo "あ" | sudachi
あ      感動詞,フィラー,*,*,*,* あー
EOS

$ echo "阿" | sudachi
thread 'main' panicked at 'EOS isn't connected to BOS', src/lattice.rs:70:13
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

他の方によるRust実装も参照ください; Yasu-umi/sudachiclone-rs

利用例

複数粒度での分割

$ echo 選挙管理委員会 | sudachi
選挙管理委員会	名詞,固有名詞,一般,*,*,*	選挙管理委員会
EOS

$ echo 選挙管理委員会 | sudachi --mode A
選挙	名詞,普通名詞,サ変可能,*,*,*	選挙
管理	名詞,普通名詞,サ変可能,*,*,*	管理
委員	名詞,普通名詞,一般,*,*,*	委員
会	名詞,普通名詞,一般,*,*,*	会
EOS

正規化表記

$ echo 打込む かつ丼 附属 vintage | sudachi
打込む	動詞,一般,*,*,五段-マ行,終止形-一般	打ち込む
 	空白,*,*,*,*,*
かつ丼	名詞,普通名詞,一般,*,*,*	カツ丼
 	空白,*,*,*,*,*
附属	名詞,普通名詞,サ変可能,*,*,*	付属
 	空白,*,*,*,*,*
vintage	名詞,普通名詞,一般,*,*,*	ビンテージ

分かち書き出力

$ cat lemon.txt
えたいの知れない不吉な塊が私の心を始終圧えつけていた。
焦躁と言おうか、嫌悪と言おうか――酒を飲んだあとに宿酔があるように、酒を毎日飲んでいると宿酔に相当した時期がやって来る。
それが来たのだ。これはちょっといけなかった。

$ sudachi --wakati lemon.txt
えたい の 知れ ない 不吉 な 塊 が 私 の 心 を 始終 圧え つけ て い た 。
焦躁 と 言おう か 、 嫌悪 と 言おう か ― ― 酒 を 飲ん だ あと に 宿酔 が ある よう に 、 酒 を 毎日 飲ん で いる と 宿酔 に 相当 し た 時期 が やっ て 来る 。
それ が 来 た の だ 。 これ は ちょっと いけ なかっ た 。

利用方法

$ sudachi -h
sudachi 0.1.0
A Japanese tokenizer

USAGE:
    sudachi [FLAGS] [OPTIONS] [file]

FLAGS:
    -d, --debug      Debug mode: Dumps lattice
    -h, --help       Prints help information
    -a, --all        Prints all fields
    -V, --version    Prints version information
    -w, --wakati     Outputs only surface form

OPTIONS:
    -m, --mode <mode>    Split unit: "A" (short), "B" (middle), or "C" (Named Entity) [default: C]

ARGS:
    <file>    Input text file: If not present, read from STDIN

セットアップ

1. ソースコードの取得

$ git clone https://github.com/sorami/sudachi.rs.git

2. Sudachi辞書のダウンロード

WorksApplications/SudachiDictから辞書のzipファイル（ small 、 core 、 full から一つ選択）し、解凍して、中にある system_*.dic ファイルを src/resources/system.dic として置いてください（ファイル名が system.dic に変わっていることに注意）。

上記のように手動で設置する以外に、レポジトリにあるスクリプトを使って自動的に core 辞書をダウンロードし src/resources/system.dic として設置することもできます。

$ ./fetch_dictionary.sh

3. ビルド、インストール

ビルドされた実行ファイルは、辞書バイナリを内包しています。

$ cargo build

もしくは

sudachi.rs/ $ cargo install --path .

$ which sudachi
/Users/<USER>/.cargo/bin/sudachi

$ sudachi -h
sudachi 0.1.0
A Japanese tokenizer
...

ToDo

未知語処理
簡単な辞書ファイルのインストール、管理（SudachiPyでの方式を参考に）
crates.io への登録

リファレンス

Sudachi

Rustによる形態素解析器の実装

ロゴ

Sudachiのロゴ
カニのイラスト: Pixabay

Comments

Unify three arrays in Lattice for locality of reference

This PR suggests to unify three arrays ends, ends_full, and indices in Lattice for locality of reference.

In this modification, I added a new struct PackedNode packing three structs VNode, Node, and NodeIdx, and implemented the three arrays as one array of PackedNode.

The three arrays are often accessed simultaneously with the same index value, and the modification can improve cache efficiency in tokenization.

My microbenchmark results (with Intel i7, 16G RAM) showed that the modification shortened a tokenization time by 10%. (My microbenchmark code can be found here).

opened by kampersanda 6

pyo3_runtime.PanicException occurs in Morpheme.surface() after calling Morpheme.split()

I found an input pattern which causes exception in sudachipy==0.6.0. The reproducing code below is the abstract of the Japanese tokenizer of spaCy v3.2.0.

from sudachipy import dictionary, tokenizer

def get_dtokens(tokenizer, sudachipy_tokens, need_sub_tokens):
    sub_tokens_list = get_sub_tokens(tokenizer, sudachipy_tokens) if need_sub_tokens else None
    dtokens = [
        (
            t.surface(),
            t.part_of_speech()[:4],
            t.part_of_speech()[4:],
            t.dictionary_form(),
            t.normalized_form(),
            t.reading_form(),
            sub_tokens_list[idx] if need_sub_tokens else None,
        ) for idx, t in enumerate(sudachipy_tokens) if len(t.surface()) > 0
    ]
    return dtokens

def get_sub_tokens(tokenizer, sudachipy_tokens):
    sub_tokens_list = []
    for token in sudachipy_tokens:
        sub_a = token.split(tokenizer.SplitMode.A)
        if len(sub_a) == 1:  # no sub tokens
            sub_tokens_list.append(None)
        else:
            sub_b = token.split(tokenizer.SplitMode.B)
            if len(sub_a) == len(sub_b):
                dtokens = get_dtokens(tokenizer, sub_a, False)
                sub_tokens_list.append([dtokens, dtokens])
            else:
                sub_tokens_list.append(
                    [
                        get_dtokens(tokenizer, sub_a, False),
                        get_dtokens(tokenizer, sub_b, False),
                    ]
                )
    return sub_tokens_list

tokenizer = dictionary.Dictionary().create(mode=tokenizer.Tokenizer.SplitMode.C)
sudachipy_tokens = tokenizer.tokenize("Ｔ社はｅコマース（電子商取引）を活用したリサイクル部品の取扱いを系列の部品販売店で平成１３年１０月より始めました。取り扱う部品は、ドア、フェンダー、グリル、バンパー、ランプ類などのＴ社の外装部品（「エ コロパーツ」）全１６品目と大手リサイクル部品流通事業社のＮグループ及びＢ社から供給を受ける国内全メーカーの外装・機能部品で、専用の中古部品ｅコマースサイトを開設し、自動車保有期間の長期化に伴う低価格修理の需要に応えることにしています。")
get_dtokens(tokenizer, sudachipy_tokens, True)

/home/matsuda/ginza/test.py:21: DeprecationWarning: API around this functionality will change. See github issue WorksApplications/SudachiPy#92 for more.
  sub_a = token.split(tokenizer.SplitMode.A)
/home/matsuda/ginza/test.py:25: DeprecationWarning: API around this functionality will change. See github issue WorksApplications/SudachiPy#92 for more.
  sub_b = token.split(tokenizer.SplitMode.B)
thread '<unnamed>' panicked at 'byte index 10 is not a char boundary; it is inside 'ｅ' (bytes 9..12) of `Ｔ社はｅコマース（電子商取引）を活用したリサイクル部品の取扱いを系列の部品販売店で平成１３年１０月より始めました。取り扱う部品は、ドア、フェンダー、グリル、バンパー、ラン`[...]', /github/workspace/sudachi/src/analysis/morpheme.rs:122:10
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Traceback (most recent call last):
  File "/home/matsuda/ginza/test.py", line 40, in <module>
    get_dtokens(tokenizer, sudachipy_tokens, True)
  File "/home/matsuda/ginza/test.py", line 4, in get_dtokens
    sub_tokens_list = get_sub_tokens(tokenizer, sudachipy_tokens) if need_sub_tokens else None
  File "/home/matsuda/ginza/test.py", line 27, in get_sub_tokens
    dtokens = get_dtokens(tokenizer, sub_a, False)
  File "/home/matsuda/ginza/test.py", line 5, in get_dtokens
    dtokens = [
  File "/home/matsuda/ginza/test.py", line 14, in <listcomp>
    ) for idx, t in enumerate(sudachipy_tokens) if len(t.surface()) > 0
pyo3_runtime.PanicException: byte index 10 is not a char boundary; it is inside 'ｅ' (bytes 9..12) of `Ｔ社はｅコマース（電子商取引）を活用したリサイクル部品の取扱いを系列の部品販売店で平成１３年１０月より始めました。取り扱う部品は、ドア、フェンダー、グリル、バンパー、ラン`[...]

bug

opened by hiroshi-matsuda-rit 5

AnalyzedSentence Design

Prereq of #52

We want a ponter-generic version of return container for the result of analysis.

Rust API wants references/mut references; Python binding wants Arc.

opened by eiennohito 4
No license
This project has no license, which means (by default) users do not have rights to use it.

Fix steps:

[x] Add license file to repo

https://docs.github.com/en/github/building-a-strong-community/adding-a-license-to-a-repository

[x] Add "license" field to Cargo.toml

https://doc.rust-lang.org/cargo/reference/manifest.html#the-license-and-license-file-fields

I suggest Apache 2.0 license like Sudachi.
opened by tmfink 4
Add sudachipy.pyi

Fixes #147

Add pyi file for type hints. sudachipy.pyi is enough to cover all classes. I used mypy's stubgen cli tool to generate base of pyi file.

Also fix type signatures. The annotation for positional-only parameters was wrong. ref: PEP570
python

opened by mh-northlander 3
Load sudachidict dictionary

#73

~WIP~: default dictionary path setting SudachiPy: dict_type > config.systemDict > sudachidict_core. sudachi.rs: arg > config.systemDict > baked dict.

opened by mh-northlander 3
Refactor Plugin Loading
Fixes #17 Fixes #19 Fixes #20 Fixes #21 Fixes #24

Plugin loading logic is refactored, no repeated code now

DSO plugins can now be resolved relative to executable with the syntax: $exe/plugin or relative to the configuration folder $cfg/plugin.

It is recommended not to specify platform dependent name parts for DSO plugins: plugin instead of libplugin.so. First name will correctly resolve on Windows/Linux/macOS.

Built-in plugins are moved into the main crate, they can be specified in the configuration by their Java names (but that needs checking).

Main design for the rework of plugin loader

We have core logic of plugin loading in a PluginLoader struct which is generic over implementations of PluginCategory trait. PluginCategory captures plugin category specific logic: providing specific types, configuration extraction, and instantiation of bundled plugins.

PluginLoader is using fully qualified trait call syntax (see previous paragraph before this link). It is required because type inference does not work if T is not present as one of arguments.
opened by eiennohito 3
READMEの手順不備？
README上に記載されている以下のコマンドは、正しくは「pip install --upgrade 'sudachipy>=0.6.2'」ではないでしょうか？

pip install --update 'sudachipy>=0.6.2'

以下のコマンドが通りません。エラー「error: found a virtual manifest at /(略)/sudachi.rs/Cargo.toml instead of a package manifest」となります。

cargo install --path .
opened by microwavePC 2
implement analysis memory reuse via output parameters

Fixes #184

Implement out parameters for Tokenizer.tokenize() and Morpheme.split() Python API.

For the memory sharing to be actually useful, I had to refactor internal MorphemeList to allow multiple references to input data, while having distinct list of morphemes. Let's welcome Arc<RefCell<X>> in our codebase. Python MorphemeListWrapper has also changed. As a side effect there is no copy in custom pretokenizers (win). Need to document semantics of everything, but will do that as a documentation pass.
python

opened by eiennohito 2
Add HuggingFace-compatible PreTokenizer

Depends on PR #176 (includes its comments)

Fixes #38 Fixes #166 #Fixes #168

Created as Dictionary.pre_tokenizer([mode]), where the default mode is C.

Implementation can be used from multiple threads, with a single PreTokenizer instance shared between all threads. It will create thread local Sudachi tokenizer and use it to perform the actual tokenization. Also, implementation releases GIL while doing the analysis, so it should achieve some speedup when used multithreaded.
python

opened by eiennohito 2
Add python api document
#98

Generate doc:

install sudachi.rs python binding: pip install setuptools-rust; python setup.py develop

install sphinx and theme: pip install sphinx sphinx-rtd-theme

run make html in docs directory

html will be generated under docs/build/html
opened by mh-northlander 2
aarch64 Linux Wheels

I noticed you don't have wheels for aarch64 for Linux. Do you have any plans to release such wheels?

Github Actions supports this architecture, so it looks like you can probably just make a small change to the current wheel building actions.

opened by polm 3
Can the default resource files be embedded on compile time?
Hi, thanks for developing this great library. I played a bit with R (my repo), but I had to struggle with this error when I tried to make a static binary for users that don't have Rust installed. It seems this error is because Config::new() cannot find the files under resources directory.

Io(Os { code: 3, kind: NotFound, message: "指定されたパスが見つかりません。" })

As the files are not very huge, is it possible to embed them somehow, e.g. using include_str macro, on compile time?
opened by yutannihilation 1
Python Exception Types

I noticed that if input is too long in Python an Exception is thrown, but it's a plain Exception, not a ValueError or something. I see in the Rust code there are a variety of specific error types.

I'm not familiar with Rust, but surely it's possible to have the Python code throw something more specific like a InputTooLongException?

opened by polm 1

Releases(v0.6.6)

v0.6.6(Jul 25, 2022)
Highlights

Add boundary matching mode to regex oov handler

macOS binary builds are now unversal2 (arm+x64)

MacOS

Binary builds are universal2

Caveat: we don't run tests on arm because there are no public arm instances, so builds may be broken without any warning

Source code(tar.gz)
Source code(zip)
v0.6.5(Jun 21, 2022)
Highlights

Fixed invalid POS tags which appeared when using user-defined POS tags both in user dictionaries and OOV handlers. You are not affected by this bug if you did not use user-defined POS in OOV handlers.

Source code(tar.gz)
Source code(zip)
v0.6.4(Jun 16, 2022)
Highlights

Remove Python 3.6 support which reached end-of-life status on 2021-12-23

OOV handler plugins support user-defined POS, similar to Java version

Added Regex OOV handler

Regex OOV Handler

For details, see Java version changelog

In Rust/Python Regexes do not support backtracking and backreferences

maxLength setting defines maximum length in unicode codepoints, not in utf-8 bytes as in Java (will be changed to codepoints later)

Source code(tar.gz)
Source code(zip)
v0.6.3(Feb 10, 2022)
Highlights

Fixed path resolution algorithm for resources. They are now resolved in the following order (first existing file wins):

Absolute paths stay as they are

Relative to "path" value of the config file

Relative to "resource_dir" parameter of the config object during creation

For SudachiPy it is the parameter of Dictionary constructor

Relative to the location of the configuration file

Relative to the current directory

Python

Dictionary now has __repr__() function which displays absolute paths to dictionaries in use.

Dictionary now has pos_of() function which returns a POS tuple for a given POS id.

PosMatcher supports set operations

union (m1 | m2)

intersection (m1 & m2)

difference (m1 - m2)

negation (~m1)

Source code(tar.gz)
Source code(zip)
v0.6.2(Dec 9, 2021)
Highlights

Fixed analysis differences from 0.5.4

Central dot ・ is handled correctly

Catch-all OOV handler was used even if other OOV handlers could produces some results

Source code(tar.gz)
Source code(zip)
v0.6.1(Dec 8, 2021)
Highlights

Added Fuzzing (see sudachi-fuzz subdirectory), Sudachi.rs seems to be pretty robust towards arbitrary inputs (no crashes and panics)

Issues like https://github.com/WorksApplications/sudachi.rs/issues/182 should never occur more

~5% analysis speed improvement over 0.6.0

Added support for Unicode combining symbols, now Sudachi.rs/py should be much better with emoji (🎅🏾) and more complex Unicode (İstanbul)

Rust

Added partial dictionary read functionality, it is now possible to skip reading certain fields if they are not needed

Improved startup times, especially for debug builds

Python

Morpheme.part_of_speech method now returns Tuple of POS components instead of a list.

Partial Dictionary Read

It is possible to ask for a subset of morpheme fields instead of all fields

Supported API: Dictionary.create(), Dictionary.pre_tokenizer()

HuggingFace PreTokenizer support

We provide a built-in HuggingFace-compatible pre-tokenizer

API: Dictionary.pre_tokenizer()

It is multithreading-compatible and supports customization

Memory allocation reuse

It is possible to reduce re-allocation overhead by using out parameters which accept MorphemeLists

Supported API: Tokenizer.tokenize(), Morpheme.split()

It is now a recommended way to use both those APIs

PosMatcher

New API for checking if a morpheme has a POS tag from a set

Strongly prefer using it instead of string comparison of POS components

Performance

Greatly decreased cost of accessing POS components

len(Morpheme) now returns the length of the morpheme in Unicode codepoints. Use it instead of len(m.surface())

Morpheme.split() has new add_single parameter, which can be used to check whether the split has produced anything

E.g. with if m.split(SplitMode.A, out=res, add_single=False): handle_splits(res)

add_single=True, returning the list with the current morpheme is the current behavior

Morpheme/MorphemeList now have readable __repr__ and __str__

https://github.com/WorksApplications/sudachi.rs/pull/187

Source code(tar.gz)
Source code(zip)
SudachiPy-0.6.1-cp310-cp310-macosx_10_14_x86_64.whl(1.15 MB)
SudachiPy-0.6.1-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl(2.07 MB)
SudachiPy-0.6.1-cp310-cp310-win_amd64.whl(984.59 KB)
SudachiPy-0.6.1-cp36-cp36m-macosx_10_14_x86_64.whl(1.15 MB)
SudachiPy-0.6.1-cp36-cp36m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl(2.06 MB)
SudachiPy-0.6.1-cp36-cp36m-win_amd64.whl(984.50 KB)
SudachiPy-0.6.1-cp37-cp37m-macosx_10_14_x86_64.whl(1.15 MB)
SudachiPy-0.6.1-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl(2.07 MB)
SudachiPy-0.6.1-cp37-cp37m-win_amd64.whl(984.62 KB)
SudachiPy-0.6.1-cp38-cp38-macosx_10_14_x86_64.whl(1.15 MB)
SudachiPy-0.6.1-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl(2.07 MB)
SudachiPy-0.6.1-cp38-cp38-win_amd64.whl(984.92 KB)
SudachiPy-0.6.1-cp39-cp39-macosx_10_14_x86_64.whl(1.15 MB)
SudachiPy-0.6.1-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl(2.07 MB)
SudachiPy-0.6.1-cp39-cp39-win_amd64.whl(984.93 KB)
SudachiPy-0.6.1.tar.gz(140.85 KB)
v0.6.0(Nov 11, 2021)
Full Changelog

Highlights

Full feature parity with Java version

~15% analysis speed improvement over 0.6.0-rc1

SudachiPy compatible Python bindings

~30x speed improvement over original SudachiPy

Rust

No public API at the moment (contact us if you want to use Rust version directly, internals will significantly change and names are not finalized)

Added dictionary build functionality

https://github.com/WorksApplications/sudachi.rs/pull/143

Added an option to perform analysis without sentence splitting

Use it with --split-sentences=no

Python

Added bindings for dictionary build (undocumented and not supported as API).

See https://github.com/WorksApplications/sudachi.rs/issues/157

sudachipy build and sudachipy ubuild should work once more

Report on build times and dictionary part sizes can differ from the original SudachiPy

Source code(tar.gz)
Source code(zip)
SudachiPy-0.6.0-cp310-cp310-macosx_10_14_x86_64.whl(1.09 MB)
SudachiPy-0.6.0-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl(1.97 MB)
SudachiPy-0.6.0-cp310-cp310-win_amd64.whl(919.08 KB)
SudachiPy-0.6.0-cp36-cp36m-macosx_10_14_x86_64.whl(1.08 MB)
SudachiPy-0.6.0-cp36-cp36m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl(1.97 MB)
SudachiPy-0.6.0-cp36-cp36m-win_amd64.whl(915.98 KB)
SudachiPy-0.6.0-cp37-cp37m-macosx_10_14_x86_64.whl(1.08 MB)
SudachiPy-0.6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl(1.97 MB)
SudachiPy-0.6.0-cp37-cp37m-win_amd64.whl(920.34 KB)
SudachiPy-0.6.0-cp38-cp38-macosx_10_14_x86_64.whl(1.09 MB)
SudachiPy-0.6.0-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl(1.97 MB)
SudachiPy-0.6.0-cp38-cp38-win_amd64.whl(920.07 KB)
SudachiPy-0.6.0-cp39-cp39-macosx_10_14_x86_64.whl(1.09 MB)
SudachiPy-0.6.0-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl(1.97 MB)
SudachiPy-0.6.0-cp39-cp39-win_amd64.whl(919.40 KB)
SudachiPy-0.6.0.tar.gz(124.68 KB)
v0.6.0-rc1(Oct 26, 2021)
Highlights

First release of Sudachi.rs

SudachiPy compatible Python bindings

~30x speed improvement over original SudachiPy

Dictionary build mode will be done before 0.6.0 final (See #13)

Rust

Analysis: feature parity with Python and Java version

Dictionary build is not supported in rc1

~2x faster than Java version (with sentence splitting)

No public API at the moment (contact us if you want to use Rust version directly, internals will significantly change and names are not finalized)

Python

Mostly compatible with SudachiPy 0.5.4

We provide binary wheels for popular platforms

~30x faster than 0.5.4

IgnoreYomigana input text plugin is now supported (and enabled by default)

We provide binary wheels for convenience (and additional speed on Linux)

Known Issues

List of deprecated SudachiPy API:

MorphemeList.empty(dict: Dictionary)

This also needs a dictionary as an argument.

Morpheme.split(mode: SplitMode)

Morpheme.get_word_info()

Most of instance attributes are not exported: e.g. Dictionary.grammar, Dictionary.lexicon.

See API reference page for supported APIs.

Dictionary Build is not supported: sudachipy build and sudachipy ubuild will not work, please use 0.5.3 in another virtual environment for the time being until the feature is implemented: #13

Source code(tar.gz)
Source code(zip)
SudachiPy-0.6.0rc1-cp310-cp310-macosx_10_14_x86_64.whl(984.59 KB)
SudachiPy-0.6.0rc1-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl(9.13 MB)
SudachiPy-0.6.0rc1-cp310-cp310-win_amd64.whl(804.68 KB)
SudachiPy-0.6.0rc1-cp36-cp36m-macosx_10_14_x86_64.whl(979.54 KB)
SudachiPy-0.6.0rc1-cp36-cp36m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl(1.83 MB)
SudachiPy-0.6.0rc1-cp36-cp36m-win_amd64.whl(800.88 KB)
SudachiPy-0.6.0rc1-cp37-cp37m-macosx_10_14_x86_64.whl(986.04 KB)
SudachiPy-0.6.0rc1-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl(3.66 MB)
SudachiPy-0.6.0rc1-cp37-cp37m-win_amd64.whl(805.43 KB)
SudachiPy-0.6.0rc1-cp38-cp38-macosx_10_14_x86_64.whl(985.77 KB)
SudachiPy-0.6.0rc1-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl(5.48 MB)
SudachiPy-0.6.0rc1-cp38-cp38-win_amd64.whl(805.36 KB)
SudachiPy-0.6.0rc1-cp39-cp39-macosx_10_14_x86_64.whl(985.40 KB)
SudachiPy-0.6.0rc1-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl(7.30 MB)
SudachiPy-0.6.0rc1-cp39-cp39-win_amd64.whl(805.06 KB)
SudachiPy-0.6.0rc1.tar.gz(96.59 KB)

Owner

Works Applications

GitHub

Rust-nlp is a library to use Natural Language Processing algorithm with RUST

nlp Rust-nlp Implemented algorithm Distance Levenshtein (Explanation) Jaro / Jaro-Winkler (Explanation) Phonetics Soundex (Explanation) Metaphone (Exp

34 Dec 20, 2022

Fast suffix arrays for Rust (with Unicode support).

suffix Fast linear time & space suffix arrays for Rust. Supports Unicode! Dual-licensed under MIT or the UNLICENSE. Documentation https://docs.rs/suff

207 Dec 26, 2022

Elastic tabstops for Rust.

tabwriter is a crate that implements elastic tabstops. It provides both a library for wrapping Rust Writers and a small program that exposes the same

212 Dec 16, 2022

An efficient and powerful Rust library for word wrapping text.

Textwrap Textwrap is a library for wrapping and indenting text. It is most often used by command-line programs to format dynamic output nicely so it l

322 Dec 26, 2022

⏮ ⏯ ⏭ A Rust library to easily read forwards, backwards or randomly through the lines of huge files.

EasyReader The main goal of this library is to allow long navigations through the lines of large files, freely moving forwards and backwards or gettin

81 Dec 6, 2022

An implementation of regular expressions for Rust. This implementation uses finite automata and guarantees linear time matching on all inputs.

regex A Rust library for parsing, compiling, and executing regular expressions. Its syntax is similar to Perl-style regular expressions, but lacks a f

2.6k Jan 8, 2023

Natural language detection library for Rust. Try demo online: https://www.greyblake.com/whatlang/

Whatlang Natural language detection for Rust with focus on simplicity and performance. Content Features Get started Documentation Supported languages

805 Dec 28, 2022

Multilingual implementation of RAKE algorithm for Rust

RAKE.rs The library provides a multilingual implementation of Rapid Automatic Keyword Extraction (RAKE) algorithm for Rust. How to Use Append rake to

26 Dec 16, 2022

A Rust library for generically joining iterables with a separator

joinery A Rust library for generically joining iterables with a separator. Provides the tragically missing string join functionality to rust. extern c

72 Dec 16, 2022

Rust edit distance routines accelerated using SIMD. Supports fast Hamming, Levenshtein, restricted Damerau-Levenshtein, etc. distance calculations and string search.

triple_accel Rust edit distance routines accelerated using SIMD. Supports fast Hamming, Levenshtein, restricted Damerau-Levenshtein, etc. distance cal

75 Jan 8, 2023

Rust native ready-to-use NLP pipelines and transformer-based models (BERT, DistilBERT, GPT2,...)

rust-bert Rust native Transformer-based models implementation. Port of Hugging Face's Transformers library, using the tch-rs crate and pre-processing

1.3k Jan 8, 2023

👄 The most accurate natural language detection library in the Rust ecosystem, suitable for long and short text alike

Table of Contents What does this library do? Why does this library exist? Which languages are supported? How good is it? Why is it better than other l

569 Jan 3, 2023

Snips NLU rust implementation

Snips NLU Rust Installation Add it to your Cargo.toml: [dependencies] snips-nlu-lib = { git = "https://github.com/snipsco/snips-nlu-rs", branch = "mas

327 Dec 26, 2022

A fast, low-resource Natural Language Processing and Text Correction library written in Rust.

nlprule A fast, low-resource Natural Language Processing and Error Correction library written in Rust. nlprule implements a rule- and lookup-based app

496 Jan 8, 2023

A fast implementation of Aho-Corasick in Rust.

aho-corasick A library for finding occurrences of many patterns at once with SIMD acceleration in some cases. This library provides multiple pattern s

662 Dec 31, 2022

Natural Language Processing for Rust

rs-natural Natural language processing library written in Rust. Still very much a work in progress. Basically an experiment, but hey maybe something c

211 Dec 28, 2022

finalfusion embeddings in Rust

Introduction finalfusion is a crate for reading, writing, and using embeddings in Rust. finalfusion primarily works with its own format which supports

55 Jan 2, 2023

Rust-tokenizer offers high-performance tokenizers for modern language models, including WordPiece, Byte-Pair Encoding (BPE) and Unigram (SentencePiece) models

rust-tokenizers Rust-tokenizer offers high-performance tokenizers for modern language models, including WordPiece, Byte-Pair Encoding (BPE) and Unigra

165 Jan 1, 2023

Context-sensitive word embeddings with subwords. In Rust.

finalfrontier Introduction finalfrontier is a Rust program for training word embeddings. finalfrontier currently has the following features: Models: s

74 Dec 29, 2022

An official Sudachi clone in Rust (incomplete) 🦀

Related tags

Overview

sudachi.rs

Caution

Example

Usage

Setup

1. Get the source code

2. Download a Sudachi Dictionary

3. Build, Install

ToDo

References

Sudachi

Morphological Analyzers in Rust

Logo

sudachi.rs - 日本語README

注意

利用例

利用方法

セットアップ

1. ソースコードの取得

2. Sudachi辞書のダウンロード

3. ビルド、インストール

ToDo

リファレンス

Sudachi

Rustによる形態素解析器の実装

ロゴ

Comments

Main design for the rework of plugin loader

Releases(v0.6.6)

v0.6.6(Jul 25, 2022)

Highlights

MacOS

v0.6.5(Jun 21, 2022)

Highlights

v0.6.4(Jun 16, 2022)

Highlights

Regex OOV Handler

v0.6.3(Feb 10, 2022)

Highlights

Python

v0.6.2(Dec 9, 2021)

Highlights

v0.6.1(Dec 8, 2021)

Highlights

Rust

Python

v0.6.0(Nov 11, 2021)

Highlights

Rust

Python

v0.6.0-rc1(Oct 26, 2021)

Highlights

Rust

Python

Known Issues

Owner

Works Applications

Rust-nlp is a library to use Natural Language Processing algorithm with RUST

Fast suffix arrays for Rust (with Unicode support).

Elastic tabstops for Rust.

An efficient and powerful Rust library for word wrapping text.

⏮ ⏯ ⏭ A Rust library to easily read forwards, backwards or randomly through the lines of huge files.

An implementation of regular expressions for Rust. This implementation uses finite automata and guarantees linear time matching on all inputs.

Natural language detection library for Rust. Try demo online: https://www.greyblake.com/whatlang/

Multilingual implementation of RAKE algorithm for Rust

A Rust library for generically joining iterables with a separator

Rust edit distance routines accelerated using SIMD. Supports fast Hamming, Levenshtein, restricted Damerau-Levenshtein, etc. distance calculations and string search.

Rust native ready-to-use NLP pipelines and transformer-based models (BERT, DistilBERT, GPT2,...)

👄 The most accurate natural language detection library in the Rust ecosystem, suitable for long and short text alike

Snips NLU rust implementation

A fast, low-resource Natural Language Processing and Text Correction library written in Rust.

A fast implementation of Aho-Corasick in Rust.

Natural Language Processing for Rust

finalfusion embeddings in Rust

Rust-tokenizer offers high-performance tokenizers for modern language models, including WordPiece, Byte-Pair Encoding (BPE) and Unigram (SentencePiece) models

Context-sensitive word embeddings with subwords. In Rust.