🛥 Vaporetto is a fast and lightweight pointwise prediction based tokenizer. This is a Python wrapper for Vaporetto.

Last update: Dec 22, 2022

Related tags

Text processing python nlp rust japanese tokenizer analyzer segmentation morphological-analysis tokenization

Overview

🐍 python-vaporetto 🛥

Vaporetto is a fast and lightweight pointwise prediction based tokenizer. This is a Python wrapper for Vaporetto.

Installation

To use Vaporetto, run the following command:

$ pip install vaporetto

Or you can also build from the source:

$ python -m venv .env
$ source .env/bin/activate
$ pip install maturin
$ maturin develop -r

Example Usage

python-vaporetto does not contain model files. To perform tokenization, follow the document of Vaporetto to download distribution models or train your own models beforehand.

# Import vaporetto module
import vaporetto

# Load the model file
with open('path/to/model.zst', 'rb') as fp:
    model = fp.read()

# Create an instance of the Vaporetto
tokenizer = vaporetto.Vaporetto(model, predict_tags = True)

# Tokenize
tokenizer.tokenize_to_string('まぁ社長は火星猫だ')
#=> 'まぁ/名詞/マー 社長/名詞/シャチョー は/助詞/ワ 火星/名詞/カセー 猫/名詞/ネコ だ/助動詞/ダ'

tokens = tokenizer.tokenize('まぁ社長は火星猫だ')
len(tokens)
#=> 6
tokens[0].surface()
#=> 'まぁ'
tokens[0].tag(0)
#=> '名詞'
tokens[0].tag(1)
#=> 'マー'
[token.surface() for token in tokens]
#=> ['まぁ', '社長', 'は', '火星', '猫', 'だ']

You can also use KyTea's models as follows:

with open('path/to/jp-0.4.7-5.mod', 'rb') as fp:
    model = fp.read()

tokenizer = vaporetto.Vaporetto.create_from_kytea_model(model)

Note: Vaporetto does not support tag prediction with KyTea's models.

Documentation

Use the help function to show the API reference.

import vaporetto
help(vaporetto)

Speed Comparison

License

Licensed under either of

Apache License, Version 2.0 (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
MIT license (LICENSE-MIT or http://opensource.org/licenses/MIT)

at your option.

Disclaimer

This software is developed by LegalForce, Inc., but not an officially supported LegalForce product.

A Rust wrapper for the Text synthesization service TextSynth API

2 Mar 24, 2022

nombytes is a library that provides a wrapper for the bytes::Bytes byte container for use with nom.

NomBytes nombytes is a library that provides a wrapper for the bytes::Bytes byte container for use with nom. I originally made this so that I could ha

2 Jul 25, 2022

A lightweight library with vehicle tuning utilities.

A lightweight library with vehicle tuning utilities. This includes utilities for communicating with OBD-II services, firmware downloading/flashing, and table modifications.

6 Oct 3, 2022

A lightweight platform-accelerated library for biological motif scanning using position weight matrices.

🎼 🧬 lightmotif A lightweight platform-accelerated library for biological motif scanning using position weight matrices. 🗺️ Overview Motif scanning

16 May 4, 2023

Simple NLP in Rust with Python bindings

vtext NLP in Rust with Python bindings This package aims to provide a high performance toolkit for ingesting textual data for machine learning applica

133 Jan 3, 2023

A simple and fast linear algebra library for games and graphics

glam A simple and fast 3D math library for games and graphics. Development status glam is in beta stage. Base functionality has been implemented and t

953 Jan 3, 2023

Checks all your documentation for spelling and grammar mistakes with hunspell and a nlprule based checker for grammar

cargo-spellcheck Check your spelling with hunspell and/or nlprule. Use Cases Run cargo spellcheck --fix or cargo spellcheck fix to fix all your docume

274 Nov 5, 2022

Rust edit distance routines accelerated using SIMD. Supports fast Hamming, Levenshtein, restricted Damerau-Levenshtein, etc. distance calculations and string search.

triple_accel Rust edit distance routines accelerated using SIMD. Supports fast Hamming, Levenshtein, restricted Damerau-Levenshtein, etc. distance cal

75 Jan 8, 2023

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production

Provides an implementation of today's most used tokenizers, with a focus on performance and versatility. Main features: Train new vocabularies and tok