Viterbi-based accelerated tokenizer (Python wrapper)

Overview

🐍 python-vibrato 🎤

Vibrato is a fast implementation of tokenization (or morphological analysis) based on the Viterbi algorithm. This is a Python wrapper for Vibrato.

PyPI Build Status Documentation Status

Installation

Install pre-built package from PyPI

Run the following command:

$ pip install vibrato

Build from source

You need to install the Rust compiler following the documentation beforehand. daachorse uses pyproject.toml, so you also need to upgrade pip to version 19 or later.

$ pip install --upgrade pip

After setting up the environment, you can install daachorse as follows:

$ pip install git+https://github.com/daac-tools/python-vibrato

Example Usage

python-vibrato does not contain model files. To perform tokenization, follow the document of Vibrato to download distribution models or train your own models beforehand.

Check the version number as shown below to use compatible models:

import vibrato
vibrato.VIBRATO_VERSION
#=> "0.3.3"

Examples:

import vibrato

with open('path/to/system.dic', 'rb') as fp:
    dict_data = fp.read()
tokenizer = vibrato.Vibrato(dict_data)

tokens = tokenizer.tokenize('社長は火星猫だ')

len(tokens)
#=> 5

list(tokens)
#=> [Token { surface: "社長", feature: "名詞,一般,*,*,*,*,社長,シャチョウ,シャチョー,," },
#    Token { surface: "は", feature: "助詞,係助詞,*,*,*,*,は,ハ,ワ,," },
#    Token { surface: "火星", feature: "名詞,一般,*,*,*,*,火星,カセイ,カセイ,," },
#    Token { surface: "猫", feature: "名詞,一般,*,*,*,*,猫,ネコ,ネコ,," },
#    Token { surface: "だ", feature: "助動詞,*,*,*,特殊・ダ,基本形,だ,ダ,ダ,," }]

tokens[0].surface()
#=> '社長'

tokens[0].feature()
#=> '名詞,一般,*,*,*,*,社長,シャチョウ,シャチョー,,'

tokens[0].start()
#=> 0

tokens[0].end()
#=> 2

Documentation

Use the help function to show the API reference.

import vibrato
help(vibrato)

License

Licensed under either of

at your option.

You might also like...
nombytes is a library that provides a wrapper for the bytes::Bytes byte container for use with nom.

NomBytes nombytes is a library that provides a wrapper for the bytes::Bytes byte container for use with nom. I originally made this so that I could ha

Simple NLP in Rust with Python bindings

vtext NLP in Rust with Python bindings This package aims to provide a high performance toolkit for ingesting textual data for machine learning applica

lingua-rs Python binding. An accurate natural language detection library, suitable for long and short text alike.

lingua-py lingua-rs Python binding. An accurate natural language detection library, suitable for long and short text alike. Installation pip install l

A lightning-fast Sanskrit toolkit. For Python bindings, see `vidyut-py`.

Vidyut मा भूदेवं क्षणमपि च ते विद्युता विप्रयोगः ॥ Vidyut is a lightning-fast toolkit for processing Sanskrit text. Vidyut aims to provide standard co

Quickner is a new tool to quickly annotate texts for NER (Named Entity Recognition). It is written in Rust and accessible through a Python API.
Quickner is a new tool to quickly annotate texts for NER (Named Entity Recognition). It is written in Rust and accessible through a Python API.

Quickner ⚡ A simple, fast, and easy to use NER annotator for Python Quickner is a new tool to quickly annotate texts for NER (Named Entity Recognition

Rust native ready-to-use NLP pipelines and transformer-based models (BERT, DistilBERT, GPT2,...)

rust-bert Rust native Transformer-based models implementation. Port of Hugging Face's Transformers library, using the tch-rs crate and pre-processing

Neural network transition-based dependency parser (in Rust)

dpar Introduction dpar is a neural network transition-based dependency parser. The original Go version can be found in the oldgo branch. Dependencies

Simple STM32F103 based glitcher FW

Airtag glitcher (Bluepill firmware) Simple glitcher firmware running on an STM32F103 on a bluepill board. See https://github.com/pd0wm/airtag-dump for

Difftastic is an experimental structured diff tool that compares files based on their syntax.
Difftastic is an experimental structured diff tool that compares files based on their syntax.

Difftastic is an experimental structured diff tool that compares files based on their syntax.

Comments
  • Cache PyString only when the surface is registered in dictionary

    Cache PyString only when the surface is registered in dictionary

    The current implementation caches all Python strings into the memory. However, if the tokenizer receives a large amount of text that generates unknown words, all of them will be cached and runs out of memory.

    This branch changes to not cache unknown words.

    opened by vbkaisetsu 0
Releases(v0.1.1)
Owner
null
Vaporetto: a fast and lightweight pointwise prediction based tokenizer

?? VAporetto: POintwise pREdicTion based TOkenizer Vaporetto is a fast and lightweight pointwise prediction based tokenizer. Overview This repository

null 184 Dec 22, 2022
Rust-tokenizer offers high-performance tokenizers for modern language models, including WordPiece, Byte-Pair Encoding (BPE) and Unigram (SentencePiece) models

rust-tokenizers Rust-tokenizer offers high-performance tokenizers for modern language models, including WordPiece, Byte-Pair Encoding (BPE) and Unigra

null 165 Jan 1, 2023
A WHATWG-compliant HTML5 tokenizer and tag soup parser

html5gum html5gum is a WHATWG-compliant HTML tokenizer. use std::fmt::Write; use html5gum::{Tokenizer, Token}; let html = "<title >hello world</tit

Markus Unterwaditzer 129 Dec 30, 2022
The Bytepiece Tokenizer Implemented in Rust.

bytepiece Implementation of Su's bytepiece. Bytepiece is a new tokenize method, which uses UTF-8 Byte as unigram to process text. It needs little prep

Yam(长琴) 11 Oct 2, 2023
Rust edit distance routines accelerated using SIMD. Supports fast Hamming, Levenshtein, restricted Damerau-Levenshtein, etc. distance calculations and string search.

triple_accel Rust edit distance routines accelerated using SIMD. Supports fast Hamming, Levenshtein, restricted Damerau-Levenshtein, etc. distance cal

Daniel Liu 75 Jan 8, 2023
SIMD-accelerated UTF-8 validation for Rust.

simdutf8 – High-speed UTF-8 validation for Rust Blazingly fast API-compatible UTF-8 validation for Rust using SIMD extensions, based on the implementa

null 441 Jan 8, 2023
A lightweight platform-accelerated library for biological motif scanning using position weight matrices.

?? ?? lightmotif A lightweight platform-accelerated library for biological motif scanning using position weight matrices. ??️ Overview Motif scanning

Martin Larralde 16 May 4, 2023
Rust wrapper for the BlingFire tokenization library

BlingFire in Rust blingfire is a thin Rust wrapper for the BlingFire tokenization library. Add the library to Cargo.toml to get started cargo add blin

Re:infer 14 Sep 5, 2022
Wrapper around Microsoft CNTK library

Bindings for CNTK library Simple low level bindings for CNTK library from Microsoft. API Documentation Status Currently exploring ways how to interact

Vlado Boza 21 Nov 30, 2021
A Rust wrapper for the Text synthesization service TextSynth API

A Rust wrapper for the Text synthesization service TextSynth API

ALinuxPerson 2 Mar 24, 2022