An official Sudachi clone in Rust (incomplete) 🦀

Overview

2021-07-07 UPDATE: The official Sudachi team will take over this project (cf. 日本語形態素解析器 SudachiPy の 現状と今後について - Speaker Deck)

sudachi.rs

sudachi.rs logo

An official Sudachi clone in Rust 🦀

日本語 README

Caution

This is my hobby project to try out Rust, and the implementation is incomplete; One fatal problem is that it will throw an error when there is an Out-of-Vocabulary word (i.e., when there is no lattice path from the beginning to the end).

$ echo "" | sudachi
あ      感動詞,フィラー,*,*,*,* あー
EOS

$ echo "" | sudachi
thread 'main' panicked at 'EOS isn't connected to BOS', src/lattice.rs:70:13
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Please also have a look at an alternative by another person, Yasu-umi/sudachiclone-rs.

Example

Multi-granular Tokenization

$ echo 選挙管理委員会 | sudachi
選挙管理委員会	名詞,固有名詞,一般,*,*,*	選挙管理委員会
EOS

$ echo 選挙管理委員会 | sudachi --mode A
選挙	名詞,普通名詞,サ変可能,*,*,*	選挙
管理	名詞,普通名詞,サ変可能,*,*,*	管理
委員	名詞,普通名詞,一般,*,*,*	委員
会	名詞,普通名詞,一般,*,*,*	会
EOS

Normalized Form

$ echo 打込む かつ丼 附属 vintage | sudachi
打込む	動詞,一般,*,*,五段-マ行,終止形-一般	打ち込む
 	空白,*,*,*,*,*
かつ丼	名詞,普通名詞,一般,*,*,*	カツ丼
 	空白,*,*,*,*,*
附属	名詞,普通名詞,サ変可能,*,*,*	付属
 	空白,*,*,*,*,*
vintage	名詞,普通名詞,一般,*,*,*	ビンテージ

Wakati (space-delimited surface form) Output

$ cat lemon.txt
えたいの知れない不吉な塊が私の心を始終圧えつけていた。
焦躁と言おうか、嫌悪と言おうか――酒を飲んだあとに宿酔があるように、酒を毎日飲んでいると宿酔に相当した時期がやって来る。
それが来たのだ。これはちょっといけなかった。

$ sudachi --wakati lemon.txt
えたい の 知れ ない 不吉 な 塊 が 私 の 心 を 始終 圧え つけ て い た 。
焦躁 と 言おう か 、 嫌悪 と 言おう か ― ― 酒 を 飲ん だ あと に 宿酔 が ある よう に 、 酒 を 毎日 飲ん で いる と 宿酔 に 相当 し た 時期 が やっ て 来る 。
それ が 来 た の だ 。 これ は ちょっと いけ なかっ た 。

Usage

$ sudachi -h
sudachi 0.1.0
A Japanese tokenizer

USAGE:
    sudachi [FLAGS] [OPTIONS] [file]

FLAGS:
    -d, --debug      Debug mode: Dumps lattice
    -h, --help       Prints help information
    -a, --all        Prints all fields
    -V, --version    Prints version information
    -w, --wakati     Outputs only surface form

OPTIONS:
    -m, --mode <mode>    Split unit: "A" (short), "B" (middle), or "C" (Named Entity) [default: C]

ARGS:
    <file>    Input text file: If not present, read from STDIN

Setup

1. Get the source code

$ git clone https://github.com/sorami/sudachi.rs.git

2. Download a Sudachi Dictionary

You can download a dictionary zip file from WorksApplications/SudachiDict (choose one from small, core, or full), unzip it, and place the system_*.dic file to src/resources/system.dic (Note that the file name is changed to system.dic) .

Alternatively, you can use a quick shell script in the source code; This script will download the core dictionary and place it to src/resources/system.dic.

$ ./fetch_dictionary.sh

3. Build, Install

The built executable will contain the dictionary binary.

$ cargo build

or

sudachi.rs/ $ cargo install --path .

$ which sudachi
/Users/<USER>/.cargo/bin/sudachi

$ sudachi -h
sudachi 0.1.0
A Japanese tokenizer
...

ToDo

  • Out of Vocabulary handling
  • Easy dictionary file install & management, similar to SudachiPy
  • Registration to crates.io

References

Sudachi

Morphological Analyzers in Rust

Logo


sudachi.rs - 日本語README

sudachi.rs logo

形態素解析器 Sudachi - 公式 Rust 🦀 クローン

English README

注意

これはRust勉強のための趣味実装で、実装が未完の部分があります。特に、未知語が存在するときにエラーが発生します(ラティスで最初から最後までパスが存在しない場合)。

$ echo "" | sudachi
あ      感動詞,フィラー,*,*,*,* あー
EOS

$ echo "" | sudachi
thread 'main' panicked at 'EOS isn't connected to BOS', src/lattice.rs:70:13
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

他の方によるRust実装も参照ください; Yasu-umi/sudachiclone-rs

利用例

複数粒度での分割

$ echo 選挙管理委員会 | sudachi
選挙管理委員会	名詞,固有名詞,一般,*,*,*	選挙管理委員会
EOS

$ echo 選挙管理委員会 | sudachi --mode A
選挙	名詞,普通名詞,サ変可能,*,*,*	選挙
管理	名詞,普通名詞,サ変可能,*,*,*	管理
委員	名詞,普通名詞,一般,*,*,*	委員
会	名詞,普通名詞,一般,*,*,*	会
EOS

正規化表記

$ echo 打込む かつ丼 附属 vintage | sudachi
打込む	動詞,一般,*,*,五段-マ行,終止形-一般	打ち込む
 	空白,*,*,*,*,*
かつ丼	名詞,普通名詞,一般,*,*,*	カツ丼
 	空白,*,*,*,*,*
附属	名詞,普通名詞,サ変可能,*,*,*	付属
 	空白,*,*,*,*,*
vintage	名詞,普通名詞,一般,*,*,*	ビンテージ

分かち書き出力

$ cat lemon.txt
えたいの知れない不吉な塊が私の心を始終圧えつけていた。
焦躁と言おうか、嫌悪と言おうか――酒を飲んだあとに宿酔があるように、酒を毎日飲んでいると宿酔に相当した時期がやって来る。
それが来たのだ。これはちょっといけなかった。

$ sudachi --wakati lemon.txt
えたい の 知れ ない 不吉 な 塊 が 私 の 心 を 始終 圧え つけ て い た 。
焦躁 と 言おう か 、 嫌悪 と 言おう か ― ― 酒 を 飲ん だ あと に 宿酔 が ある よう に 、 酒 を 毎日 飲ん で いる と 宿酔 に 相当 し た 時期 が やっ て 来る 。
それ が 来 た の だ 。 これ は ちょっと いけ なかっ た 。

利用方法

$ sudachi -h
sudachi 0.1.0
A Japanese tokenizer

USAGE:
    sudachi [FLAGS] [OPTIONS] [file]

FLAGS:
    -d, --debug      Debug mode: Dumps lattice
    -h, --help       Prints help information
    -a, --all        Prints all fields
    -V, --version    Prints version information
    -w, --wakati     Outputs only surface form

OPTIONS:
    -m, --mode <mode>    Split unit: "A" (short), "B" (middle), or "C" (Named Entity) [default: C]

ARGS:
    <file>    Input text file: If not present, read from STDIN

セットアップ

1. ソースコードの取得

$ git clone https://github.com/sorami/sudachi.rs.git

2. Sudachi辞書のダウンロード

WorksApplications/SudachiDictから辞書のzipファイル( smallcorefull から一つ選択)し、解凍して、中にある system_*.dic ファイルを src/resources/system.dic として置いてください (ファイル名が system.dic に変わっていることに注意)。

上記のように手動で設置する以外に、レポジトリにあるスクリプトを使って自動的に core 辞書をダウンロードし src/resources/system.dic として設置することもできます。

$ ./fetch_dictionary.sh

3. ビルド、インストール

ビルドされた実行ファイルは、辞書バイナリを内包しています

$ cargo build

もしくは

sudachi.rs/ $ cargo install --path .

$ which sudachi
/Users/<USER>/.cargo/bin/sudachi

$ sudachi -h
sudachi 0.1.0
A Japanese tokenizer
...

ToDo

リファレンス

Sudachi

Rustによる形態素解析器の実装

ロゴ

Comments
  • Unify three arrays in Lattice for locality of reference

    Unify three arrays in Lattice for locality of reference

    This PR suggests to unify three arrays ends, ends_full, and indices in Lattice for locality of reference.

    In this modification, I added a new struct PackedNode packing three structs VNode, Node, and NodeIdx, and implemented the three arrays as one array of PackedNode.

    The three arrays are often accessed simultaneously with the same index value, and the modification can improve cache efficiency in tokenization.

    My microbenchmark results (with Intel i7, 16G RAM) showed that the modification shortened a tokenization time by 10%. (My microbenchmark code can be found here).

    opened by kampersanda 6
  • pyo3_runtime.PanicException occurs in Morpheme.surface() after calling Morpheme.split()

    pyo3_runtime.PanicException occurs in Morpheme.surface() after calling Morpheme.split()

    I found an input pattern which causes exception in sudachipy==0.6.0. The reproducing code below is the abstract of the Japanese tokenizer of spaCy v3.2.0.

    from sudachipy import dictionary, tokenizer
    
    def get_dtokens(tokenizer, sudachipy_tokens, need_sub_tokens):
        sub_tokens_list = get_sub_tokens(tokenizer, sudachipy_tokens) if need_sub_tokens else None
        dtokens = [
            (
                t.surface(),
                t.part_of_speech()[:4],
                t.part_of_speech()[4:],
                t.dictionary_form(),
                t.normalized_form(),
                t.reading_form(),
                sub_tokens_list[idx] if need_sub_tokens else None,
            ) for idx, t in enumerate(sudachipy_tokens) if len(t.surface()) > 0
        ]
        return dtokens
    
    def get_sub_tokens(tokenizer, sudachipy_tokens):
        sub_tokens_list = []
        for token in sudachipy_tokens:
            sub_a = token.split(tokenizer.SplitMode.A)
            if len(sub_a) == 1:  # no sub tokens
                sub_tokens_list.append(None)
            else:
                sub_b = token.split(tokenizer.SplitMode.B)
                if len(sub_a) == len(sub_b):
                    dtokens = get_dtokens(tokenizer, sub_a, False)
                    sub_tokens_list.append([dtokens, dtokens])
                else:
                    sub_tokens_list.append(
                        [
                            get_dtokens(tokenizer, sub_a, False),
                            get_dtokens(tokenizer, sub_b, False),
                        ]
                    )
        return sub_tokens_list
    
    tokenizer = dictionary.Dictionary().create(mode=tokenizer.Tokenizer.SplitMode.C)
    sudachipy_tokens = tokenizer.tokenize("T社はeコマース(電子商取引)を活用したリサイクル部品の取扱いを系列の部品販売店で平成13年10月より始めました。取り扱う部品は、ドア、フェンダー、グリル、バンパー、ランプ類などのT社の外装部品(「エ コロパーツ」)全16品目と大手リサイクル部品流通事業社のNグループ及びB社から供給を受ける国内全メーカーの外装・機能部品で、専用の中古部品eコマースサイトを開設し、自動車保有期間の長期化に伴う低価格修理の需要に応えることにしています。")
    get_dtokens(tokenizer, sudachipy_tokens, True)
    
    /home/matsuda/ginza/test.py:21: DeprecationWarning: API around this functionality will change. See github issue WorksApplications/SudachiPy#92 for more.
      sub_a = token.split(tokenizer.SplitMode.A)
    /home/matsuda/ginza/test.py:25: DeprecationWarning: API around this functionality will change. See github issue WorksApplications/SudachiPy#92 for more.
      sub_b = token.split(tokenizer.SplitMode.B)
    thread '<unnamed>' panicked at 'byte index 10 is not a char boundary; it is inside 'e' (bytes 9..12) of `T社はeコマース(電子商取引)を活用したリサイクル部品の取扱いを系列の部品販売店で平成13年10月より始めました。取り扱う部品は、ドア、フェンダー、グリル、バンパー、ラン`[...]', /github/workspace/sudachi/src/analysis/morpheme.rs:122:10
    note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
    Traceback (most recent call last):
      File "/home/matsuda/ginza/test.py", line 40, in <module>
        get_dtokens(tokenizer, sudachipy_tokens, True)
      File "/home/matsuda/ginza/test.py", line 4, in get_dtokens
        sub_tokens_list = get_sub_tokens(tokenizer, sudachipy_tokens) if need_sub_tokens else None
      File "/home/matsuda/ginza/test.py", line 27, in get_sub_tokens
        dtokens = get_dtokens(tokenizer, sub_a, False)
      File "/home/matsuda/ginza/test.py", line 5, in get_dtokens
        dtokens = [
      File "/home/matsuda/ginza/test.py", line 14, in <listcomp>
        ) for idx, t in enumerate(sudachipy_tokens) if len(t.surface()) > 0
    pyo3_runtime.PanicException: byte index 10 is not a char boundary; it is inside 'e' (bytes 9..12) of `T社はeコマース(電子商取引)を活用したリサイクル部品の取扱いを系列の部品販売店で平成13年10月より始めました。取り扱う部品は、ドア、フェンダー、グリル、バンパー、ラン`[...]
    
    bug 
    opened by hiroshi-matsuda-rit 5
  • AnalyzedSentence Design

    AnalyzedSentence Design

    Prereq of #52

    We want a ponter-generic version of return container for the result of analysis.

    Rust API wants references/mut references; Python binding wants Arc.

    opened by eiennohito 4
  • No license

    No license

    This project has no license, which means (by default) users do not have rights to use it.

    Fix steps:

    • [x] Add license file to repo
      • https://docs.github.com/en/github/building-a-strong-community/adding-a-license-to-a-repository
    • [x] Add "license" field to Cargo.toml
      • https://doc.rust-lang.org/cargo/reference/manifest.html#the-license-and-license-file-fields

    I suggest Apache 2.0 license like Sudachi.

    opened by tmfink 4
  • Add sudachipy.pyi

    Add sudachipy.pyi

    Fixes #147

    Add pyi file for type hints. sudachipy.pyi is enough to cover all classes. I used mypy's stubgen cli tool to generate base of pyi file.

    Also fix type signatures. The annotation for positional-only parameters was wrong. ref: PEP570

    python 
    opened by mh-northlander 3
  • Load sudachidict dictionary

    Load sudachidict dictionary

    #73

    ~WIP~: default dictionary path setting SudachiPy: dict_type > config.systemDict > sudachidict_core. sudachi.rs: arg > config.systemDict > baked dict.

    opened by mh-northlander 3
  • Refactor Plugin Loading

    Refactor Plugin Loading

    Fixes #17 Fixes #19 Fixes #20 Fixes #21 Fixes #24

    • Plugin loading logic is refactored, no repeated code now
    • DSO plugins can now be resolved relative to executable with the syntax: $exe/plugin or relative to the configuration folder $cfg/plugin.
    • It is recommended not to specify platform dependent name parts for DSO plugins: plugin instead of libplugin.so. First name will correctly resolve on Windows/Linux/macOS.
    • Built-in plugins are moved into the main crate, they can be specified in the configuration by their Java names (but that needs checking).

    Main design for the rework of plugin loader

    We have core logic of plugin loading in a PluginLoader struct which is generic over implementations of PluginCategory trait. PluginCategory captures plugin category specific logic: providing specific types, configuration extraction, and instantiation of bundled plugins.

    PluginLoader is using fully qualified trait call syntax (see previous paragraph before this link). It is required because type inference does not work if T is not present as one of arguments.

    opened by eiennohito 3
  • READMEの手順不備?

    READMEの手順不備?

    1. README上に記載されている以下のコマンドは、正しくは「pip install --upgrade 'sudachipy>=0.6.2'」ではないでしょうか?

    pip install --update 'sudachipy>=0.6.2'

    1. 以下のコマンドが通りません。エラー「error: found a virtual manifest at /(略)/sudachi.rs/Cargo.toml instead of a package manifest」となります。

    cargo install --path .

    opened by microwavePC 2
  • implement analysis memory reuse via output parameters

    implement analysis memory reuse via output parameters

    Fixes #184

    Implement out parameters for Tokenizer.tokenize() and Morpheme.split() Python API.

    For the memory sharing to be actually useful, I had to refactor internal MorphemeList to allow multiple references to input data, while having distinct list of morphemes. Let's welcome Arc<RefCell<X>> in our codebase. Python MorphemeListWrapper has also changed. As a side effect there is no copy in custom pretokenizers (win). Need to document semantics of everything, but will do that as a documentation pass.

    python 
    opened by eiennohito 2
  • Add HuggingFace-compatible PreTokenizer

    Add HuggingFace-compatible PreTokenizer

    Depends on PR #176 (includes its comments)

    Fixes #38 Fixes #166 #Fixes #168

    Created as Dictionary.pre_tokenizer([mode]), where the default mode is C.

    Implementation can be used from multiple threads, with a single PreTokenizer instance shared between all threads. It will create thread local Sudachi tokenizer and use it to perform the actual tokenization. Also, implementation releases GIL while doing the analysis, so it should achieve some speedup when used multithreaded.

    python 
    opened by eiennohito 2
  • Add python api document

    Add python api document

    #98

    Generate doc:

    • install sudachi.rs python binding: pip install setuptools-rust; python setup.py develop
    • install sphinx and theme: pip install sphinx sphinx-rtd-theme
    • run make html in docs directory
      • html will be generated under docs/build/html
    opened by mh-northlander 2
  • aarch64 Linux Wheels

    aarch64 Linux Wheels

    I noticed you don't have wheels for aarch64 for Linux. Do you have any plans to release such wheels?

    Github Actions supports this architecture, so it looks like you can probably just make a small change to the current wheel building actions.

    opened by polm 3
  • Can the default resource files be embedded on compile time?

    Can the default resource files be embedded on compile time?

    Hi, thanks for developing this great library. I played a bit with R (my repo), but I had to struggle with this error when I tried to make a static binary for users that don't have Rust installed. It seems this error is because Config::new() cannot find the files under resources directory.

    Io(Os { code: 3, kind: NotFound, message: "指定されたパスが見つかりません。" })
    

    As the files are not very huge, is it possible to embed them somehow, e.g. using include_str macro, on compile time?

    opened by yutannihilation 1
  • Python Exception Types

    Python Exception Types

    I noticed that if input is too long in Python an Exception is thrown, but it's a plain Exception, not a ValueError or something. I see in the Rust code there are a variety of specific error types.

    I'm not familiar with Rust, but surely it's possible to have the Python code throw something more specific like a InputTooLongException?

    opened by polm 1
Releases(v0.6.6)
Owner
Works Applications
Works Applications
Rust-nlp is a library to use Natural Language Processing algorithm with RUST

nlp Rust-nlp Implemented algorithm Distance Levenshtein (Explanation) Jaro / Jaro-Winkler (Explanation) Phonetics Soundex (Explanation) Metaphone (Exp

Simon Paitrault 34 Dec 20, 2022
Fast suffix arrays for Rust (with Unicode support).

suffix Fast linear time & space suffix arrays for Rust. Supports Unicode! Dual-licensed under MIT or the UNLICENSE. Documentation https://docs.rs/suff

Andrew Gallant 207 Dec 26, 2022
Elastic tabstops for Rust.

tabwriter is a crate that implements elastic tabstops. It provides both a library for wrapping Rust Writers and a small program that exposes the same

Andrew Gallant 212 Dec 16, 2022
An efficient and powerful Rust library for word wrapping text.

Textwrap Textwrap is a library for wrapping and indenting text. It is most often used by command-line programs to format dynamic output nicely so it l

Martin Geisler 322 Dec 26, 2022
⏮ ⏯ ⏭ A Rust library to easily read forwards, backwards or randomly through the lines of huge files.

EasyReader The main goal of this library is to allow long navigations through the lines of large files, freely moving forwards and backwards or gettin

Michele Federici 81 Dec 6, 2022
An implementation of regular expressions for Rust. This implementation uses finite automata and guarantees linear time matching on all inputs.

regex A Rust library for parsing, compiling, and executing regular expressions. Its syntax is similar to Perl-style regular expressions, but lacks a f

The Rust Programming Language 2.6k Jan 8, 2023
Natural language detection library for Rust. Try demo online: https://www.greyblake.com/whatlang/

Whatlang Natural language detection for Rust with focus on simplicity and performance. Content Features Get started Documentation Supported languages

Sergey Potapov 805 Dec 28, 2022
Multilingual implementation of RAKE algorithm for Rust

RAKE.rs The library provides a multilingual implementation of Rapid Automatic Keyword Extraction (RAKE) algorithm for Rust. How to Use Append rake to

Navid 26 Dec 16, 2022
A Rust library for generically joining iterables with a separator

joinery A Rust library for generically joining iterables with a separator. Provides the tragically missing string join functionality to rust. extern c

Nathan West 72 Dec 16, 2022
Rust edit distance routines accelerated using SIMD. Supports fast Hamming, Levenshtein, restricted Damerau-Levenshtein, etc. distance calculations and string search.

triple_accel Rust edit distance routines accelerated using SIMD. Supports fast Hamming, Levenshtein, restricted Damerau-Levenshtein, etc. distance cal

Daniel Liu 75 Jan 8, 2023
Rust native ready-to-use NLP pipelines and transformer-based models (BERT, DistilBERT, GPT2,...)

rust-bert Rust native Transformer-based models implementation. Port of Hugging Face's Transformers library, using the tch-rs crate and pre-processing

null 1.3k Jan 8, 2023
👄 The most accurate natural language detection library in the Rust ecosystem, suitable for long and short text alike

Table of Contents What does this library do? Why does this library exist? Which languages are supported? How good is it? Why is it better than other l

Peter M. Stahl 569 Jan 3, 2023
Snips NLU rust implementation

Snips NLU Rust Installation Add it to your Cargo.toml: [dependencies] snips-nlu-lib = { git = "https://github.com/snipsco/snips-nlu-rs", branch = "mas

Snips 327 Dec 26, 2022
A fast, low-resource Natural Language Processing and Text Correction library written in Rust.

nlprule A fast, low-resource Natural Language Processing and Error Correction library written in Rust. nlprule implements a rule- and lookup-based app

Benjamin Minixhofer 496 Jan 8, 2023
A fast implementation of Aho-Corasick in Rust.

aho-corasick A library for finding occurrences of many patterns at once with SIMD acceleration in some cases. This library provides multiple pattern s

Andrew Gallant 662 Dec 31, 2022
Natural Language Processing for Rust

rs-natural Natural language processing library written in Rust. Still very much a work in progress. Basically an experiment, but hey maybe something c

Chris Tramel 211 Dec 28, 2022
finalfusion embeddings in Rust

Introduction finalfusion is a crate for reading, writing, and using embeddings in Rust. finalfusion primarily works with its own format which supports

finalfusion 55 Jan 2, 2023
Rust-tokenizer offers high-performance tokenizers for modern language models, including WordPiece, Byte-Pair Encoding (BPE) and Unigram (SentencePiece) models

rust-tokenizers Rust-tokenizer offers high-performance tokenizers for modern language models, including WordPiece, Byte-Pair Encoding (BPE) and Unigra

null 165 Jan 1, 2023
Context-sensitive word embeddings with subwords. In Rust.

finalfrontier Introduction finalfrontier is a Rust program for training word embeddings. finalfrontier currently has the following features: Models: s

finalfusion 74 Dec 29, 2022