🛥 Vaporetto is a fast and lightweight pointwise prediction based tokenizer. This is a Python wrapper for Vaporetto.

Overview

🐍 python-vaporetto 🛥

Vaporetto is a fast and lightweight pointwise prediction based tokenizer. This is a Python wrapper for Vaporetto.

PyPI Build Status

Installation

To use Vaporetto, run the following command:

$ pip install vaporetto

Or you can also build from the source:

$ python -m venv .env
$ source .env/bin/activate
$ pip install maturin
$ maturin develop -r

Example Usage

python-vaporetto does not contain model files. To perform tokenization, follow the document of Vaporetto to download distribution models or train your own models beforehand.

# Import vaporetto module
import vaporetto

# Load the model file
with open('path/to/model.zst', 'rb') as fp:
    model = fp.read()

# Create an instance of the Vaporetto
tokenizer = vaporetto.Vaporetto(model, predict_tags = True)

# Tokenize
tokenizer.tokenize_to_string('まぁ社長は火星猫だ')
#=> 'まぁ/名詞/マー 社長/名詞/シャチョー は/助詞/ワ 火星/名詞/カセー 猫/名詞/ネコ だ/助動詞/ダ'

tokens = tokenizer.tokenize('まぁ社長は火星猫だ')
len(tokens)
#=> 6
tokens[0].surface()
#=> 'まぁ'
tokens[0].tag(0)
#=> '名詞'
tokens[0].tag(1)
#=> 'マー'
[token.surface() for token in tokens]
#=> ['まぁ', '社長', 'は', '火星', '猫', 'だ']

You can also use KyTea's models as follows:

with open('path/to/jp-0.4.7-5.mod', 'rb') as fp:
    model = fp.read()

tokenizer = vaporetto.Vaporetto.create_from_kytea_model(model)

Note: Vaporetto does not support tag prediction with KyTea's models.

Documentation

Use the help function to show the API reference.

import vaporetto
help(vaporetto)

Speed Comparison

License

Licensed under either of

at your option.

Disclaimer

This software is developed by LegalForce, Inc., but not an officially supported LegalForce product.

You might also like...
A Rust wrapper for the Text synthesization service TextSynth API

A Rust wrapper for the Text synthesization service TextSynth API

nombytes is a library that provides a wrapper for the bytes::Bytes byte container for use with nom.

NomBytes nombytes is a library that provides a wrapper for the bytes::Bytes byte container for use with nom. I originally made this so that I could ha

A lightweight library with vehicle tuning utilities.

A lightweight library with vehicle tuning utilities. This includes utilities for communicating with OBD-II services, firmware downloading/flashing, and table modifications.

A lightweight platform-accelerated library for biological motif scanning using position weight matrices.

🎼 🧬 lightmotif A lightweight platform-accelerated library for biological motif scanning using position weight matrices. 🗺️ Overview Motif scanning

Simple NLP in Rust with Python bindings

vtext NLP in Rust with Python bindings This package aims to provide a high performance toolkit for ingesting textual data for machine learning applica

A simple and fast linear algebra library for games and graphics

glam A simple and fast 3D math library for games and graphics. Development status glam is in beta stage. Base functionality has been implemented and t

Checks all your documentation for spelling and grammar mistakes with hunspell and a nlprule based checker for grammar

cargo-spellcheck Check your spelling with hunspell and/or nlprule. Use Cases Run cargo spellcheck --fix or cargo spellcheck fix to fix all your docume

Rust edit distance routines accelerated using SIMD. Supports fast Hamming, Levenshtein, restricted Damerau-Levenshtein, etc. distance calculations and string search.

triple_accel Rust edit distance routines accelerated using SIMD. Supports fast Hamming, Levenshtein, restricted Damerau-Levenshtein, etc. distance cal

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production
💥 Fast State-of-the-Art Tokenizers optimized for Research and Production

Provides an implementation of today's most used tokenizers, with a focus on performance and versatility. Main features: Train new vocabularies and tok

Releases(v0.1.1)
Owner
null
Viterbi-based accelerated tokenizer (Python wrapper)

?? python-vibrato ?? Vibrato is a fast implementation of tokenization (or morphological analysis) based on the Viterbi algorithm. This is a Python wra

null 20 Dec 29, 2022
Rust-tokenizer offers high-performance tokenizers for modern language models, including WordPiece, Byte-Pair Encoding (BPE) and Unigram (SentencePiece) models

rust-tokenizers Rust-tokenizer offers high-performance tokenizers for modern language models, including WordPiece, Byte-Pair Encoding (BPE) and Unigra

null 165 Jan 1, 2023
A WHATWG-compliant HTML5 tokenizer and tag soup parser

html5gum html5gum is a WHATWG-compliant HTML tokenizer. use std::fmt::Write; use html5gum::{Tokenizer, Token}; let html = "<title >hello world</tit

Markus Unterwaditzer 129 Dec 30, 2022
The Bytepiece Tokenizer Implemented in Rust.

bytepiece Implementation of Su's bytepiece. Bytepiece is a new tokenize method, which uses UTF-8 Byte as unigram to process text. It needs little prep

Yam(长琴) 11 Oct 2, 2023
A lightning-fast Sanskrit toolkit. For Python bindings, see `vidyut-py`.

Vidyut मा भूदेवं क्षणमपि च ते विद्युता विप्रयोगः ॥ Vidyut is a lightning-fast toolkit for processing Sanskrit text. Vidyut aims to provide standard co

Ambuda 14 Dec 30, 2022
A lightweight and snappy crate to remove emojis from a string.

A lightweight and snappy crate to remove emojis from a string.

Tejas Ravishankar 8 Jul 19, 2022
lingua-rs Python binding. An accurate natural language detection library, suitable for long and short text alike.

lingua-py lingua-rs Python binding. An accurate natural language detection library, suitable for long and short text alike. Installation pip install l

messense 7 Dec 30, 2022
Quickner is a new tool to quickly annotate texts for NER (Named Entity Recognition). It is written in Rust and accessible through a Python API.

Quickner ⚡ A simple, fast, and easy to use NER annotator for Python Quickner is a new tool to quickly annotate texts for NER (Named Entity Recognition

Omar MHAIMDAT 7 Mar 3, 2023
Rust wrapper for the BlingFire tokenization library

BlingFire in Rust blingfire is a thin Rust wrapper for the BlingFire tokenization library. Add the library to Cargo.toml to get started cargo add blin

Re:infer 14 Sep 5, 2022
Wrapper around Microsoft CNTK library

Bindings for CNTK library Simple low level bindings for CNTK library from Microsoft. API Documentation Status Currently exploring ways how to interact

Vlado Boza 21 Nov 30, 2021