Viterbi-based accelerated tokenizer (Python wrapper)

Last update: Dec 29, 2022

Related tags

Overview

🐍 python-vibrato 🎤

Vibrato is a fast implementation of tokenization (or morphological analysis) based on the Viterbi algorithm. This is a Python wrapper for Vibrato.

Installation

Install pre-built package from PyPI

Run the following command:

$ pip install vibrato

Build from source

You need to install the Rust compiler following the documentation beforehand. daachorse uses pyproject.toml, so you also need to upgrade pip to version 19 or later.

$ pip install --upgrade pip

After setting up the environment, you can install daachorse as follows:

$ pip install git+https://github.com/daac-tools/python-vibrato

Example Usage

python-vibrato does not contain model files. To perform tokenization, follow the document of Vibrato to download distribution models or train your own models beforehand.

Check the version number as shown below to use compatible models:

import vibrato
vibrato.VIBRATO_VERSION
#=> "0.3.3"

Examples:

import vibrato

with open('path/to/system.dic', 'rb') as fp:
    dict_data = fp.read()
tokenizer = vibrato.Vibrato(dict_data)

tokens = tokenizer.tokenize('社長は火星猫だ')

len(tokens)
#=> 5

list(tokens)
#=> [Token { surface: "社長", feature: "名詞,一般,*,*,*,*,社長,シャチョウ,シャチョー,," },
#    Token { surface: "は", feature: "助詞,係助詞,*,*,*,*,は,ハ,ワ,," },
#    Token { surface: "火星", feature: "名詞,一般,*,*,*,*,火星,カセイ,カセイ,," },
#    Token { surface: "猫", feature: "名詞,一般,*,*,*,*,猫,ネコ,ネコ,," },
#    Token { surface: "だ", feature: "助動詞,*,*,*,特殊・ダ,基本形,だ,ダ,ダ,," }]

tokens[0].surface()
#=> '社長'

tokens[0].feature()
#=> '名詞,一般,*,*,*,*,社長,シャチョウ,シャチョー,,'

tokens[0].start()
#=> 0

tokens[0].end()
#=> 2

Documentation

Use the help function to show the API reference.

import vibrato
help(vibrato)

License

Licensed under either of

Apache License, Version 2.0 (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
MIT license (LICENSE-MIT or http://opensource.org/licenses/MIT)

at your option.

nombytes is a library that provides a wrapper for the bytes::Bytes byte container for use with nom.

NomBytes nombytes is a library that provides a wrapper for the bytes::Bytes byte container for use with nom. I originally made this so that I could ha

2 Jul 25, 2022

Simple NLP in Rust with Python bindings

vtext NLP in Rust with Python bindings This package aims to provide a high performance toolkit for ingesting textual data for machine learning applica

133 Jan 3, 2023

lingua-rs Python binding. An accurate natural language detection library, suitable for long and short text alike.

lingua-py lingua-rs Python binding. An accurate natural language detection library, suitable for long and short text alike. Installation pip install l

7 Dec 30, 2022

A lightning-fast Sanskrit toolkit. For Python bindings, see `vidyut-py`.

Vidyut मा भूदेवं क्षणमपि च ते विद्युता विप्रयोगः ॥ Vidyut is a lightning-fast toolkit for processing Sanskrit text. Vidyut aims to provide standard co

14 Dec 30, 2022

Quickner is a new tool to quickly annotate texts for NER (Named Entity Recognition). It is written in Rust and accessible through a Python API.

Quickner ⚡ A simple, fast, and easy to use NER annotator for Python Quickner is a new tool to quickly annotate texts for NER (Named Entity Recognition

7 Mar 3, 2023

Comments

Cache PyString only when the surface is registered in dictionary

The current implementation caches all Python strings into the memory. However, if the tokenizer receives a large amount of text that generates unknown words, all of them will be cached and runs out of memory.

This branch changes to not cache unknown words.

opened by vbkaisetsu 0