7 Repositories
Rust segmentation Libraries
Viterbi-based accelerated tokenizer (Python wrapper)
๐ python-vibrato ๐ค Vibrato is a fast implementation of tokenization (or morphological analysis) based on the Viterbi algorithm. This is a Python wra
ik-analyzer for rust; chinese tokenizer for tantivy
ik-rs ik-analyzer for Rust support Tantivy Usage Chinese Segment let mut ik = IKSegmenter::new(); let text = "ไธญๅไบบๆฐๅ ฑๅๅฝ"; let tokens = ik.to
๐ฅ Vaporetto is a fast and lightweight pointwise prediction based tokenizer. This is a Python wrapper for Vaporetto.
๐ python-vaporetto ๐ฅ Vaporetto is a fast and lightweight pointwise prediction based tokenizer. This is a Python wrapper for Vaporetto. Installation
A rule based sentence segmentation library.
cutters A rule based sentence segmentation library. ๐ง This library is experimental. ๐ง Features Full UTF-8 support. Robust parsing. Language specific
An official Sudachi clone in Rust ๐ฆ
sudachi.rs - English README 2021-12-09 UPDATE: 0.6.2 Release Try it: pip install --update 'sudachipy=0.6.2' sudachi.rs is a Rust implementation of Su
An official Sudachi clone in Rust (incomplete) ๐ฆ
2021-07-07 UPDATE: The official Sudachi team will take over this project (cf. ๆฅๆฌ่ชๅฝขๆ ็ด ่งฃๆๅจ SudachiPy ใฎ ็พ็ถใจไปๅพใซใคใใฆ - Speaker Deck) sudachi.rs An official S
Semantic text segmentation. For sentence boundary detection, compound splitting and more.
NNSplit A tool to split text using a neural network. The main application is sentence boundary detection, but e. g. compound splitting for German is a