A morphological analysis library.

Overview

Lindera

License: MIT Join the chat at https://gitter.im/lindera-morphology/lindera

A Japanese morphological analysis library in Rust. This project fork from fulmicoton's kuromoji-rs.

Lindera aims to build a library which is easy to install and provides concise APIs for various Rust applications.

Build

The following products are required to build:

  • Rust >= 1.46.0
% cargo build --release

Usage

Basic example

This example covers the basic usage of Lindera.

It will:

  • Create a tokenizer in normal mode
  • Tokenize the input text
  • Output the tokens
use lindera::tokenizer::Tokenizer;
use lindera_core::core::viterbi::Mode;

fn main() -> std::io::Result<()> {
    // create tokenizer
    let mut tokenizer = Tokenizer::new(Mode::Normal, "");

    // tokenize the text
    let tokens = tokenizer.tokenize("関西国際空港限定トートバッグ");

    // output the tokens
    for token in tokens {
        println!("{}", token.text);
    }

    Ok(())
}

The above example can be run as follows:

% cargo run --example basic_example

You can see the result as follows:

関西国際空港
限定
トートバッグ

User dictionary example

You can give user dictionary entries along with the default system dictionary. User dictionary should be a CSV with following format.

<surface_form>,<part_of_speech>,<reading>

For example:

% cat userdic.csv
東京スカイツリー,カスタム名詞,トウキョウスカイツリー
東武スカイツリーライン,カスタム名詞,トウブスカイツリーライン
とうきょうスカイツリー駅,カスタム名詞,トウキョウスカイツリーエキ

With an user dictionary, Tokenizer will be created as follows:

use lindera::tokenizer::Tokenizer;
use lindera_core::core::viterbi::Mode;

fn main() -> std::io::Result<()> {
    // create tokenizer
    let mut tokenizer = Tokenizer::new_with_userdic(Mode::Normal, "", "resources/userdic.csv");

    // tokenize the text
    let tokens = tokenizer.tokenize("東京スカイツリーの最寄り駅はとうきょうスカイツリー駅です");

    // output the tokens
    for token in tokens {
        println!("{}", token.text);
    }

    Ok(())
}

The above example can be by cargo run --example:

% cd lindera/lindera
% cargo run --example userdic_example
東京スカイツリー
の
最寄り駅
は
とうきょうスカイツリー駅
です

API reference

The API reference is available. Please see following URL:

Comments
  • Lindera-ipadict randomly as issue during build

    Lindera-ipadict randomly as issue during build

    When compiling lindera we frequently have a building error:

     error: failed to run custom build command for `lindera-ipadic v0.10.0`
    
    Caused by:
      process didn't exit successfully: `D:\a\milli\milli\target\release\build\lindera-ipadic-caf28ea0e76b9e29\build-script-build` (exit code: 1)
      --- stdout
      cargo:rerun-if-changed=build.rs
      cargo:rerun-if-changed=Cargo.toml
    
      --- stderr
      Error: Custom { kind: UnexpectedEof, error: TarError { desc: "failed to iterate over archive", io: Error { kind: UnexpectedEof, message: "failed to fill whole buffer" } } }
    

    It seems to be related to dictionaries.

    Any idea of what could be the reason, the google drive download? 🤔

    opened by ManyTheFish 10
  • Downloading and decompressing dictionaries takes a lot of time

    Downloading and decompressing dictionaries takes a lot of time

    Hey @mosuka,

    We were facing compilation slow dows at Meilisearch recently and investigated, we found out that it was lindera-ipadic that was taking a lot of time to probably download the mecab-ipadic-2.7.0-20070801.tar.gz tarball from SourceForge.

    If you want to look at the time it takes on our side, you can just execute the below command and open the generated HTML report.

    rustup update
    cargo +nightly build --timings
    

    But as we can see, the CPU is idle for a long time when it builds.

    opened by Kerollmops 9
  • Switch FST to DoubleArrayTrie

    Switch FST to DoubleArrayTrie

    Switch FST library to yada (Double Array Trie) in PrefixDict. Need rust version >= 1.46.0 to build.

    This PR breaks the data structure from the previous version.

    Still a work in progress. Remain tasks are here:

    • [x] Change .fst file to .da
    • [x] Add some tests
    • [x] Update CHANGES.md
    • [x] Check benches
      • [x] Debug an error on long text bench
    • ~~Check/test lindera-server~~
    • [x] Check/test lindera-tantivy
      • [x] Need tokenizer to be Clone-able.
    • ~~Change other builders, e.g. neologd, unidic~~
      • After new version released
    enhancement 
    opened by johtani 9
  • Build tokenizer for ko/ja

    Build tokenizer for ko/ja

    Hello, I'm trying to build tokenizer app which supports korean/japanese with lindera module. Seems japanese is default supported, but korean needs to build dictionary with following https://github.com/lindera-morphology/lindera-ko-dic-builder.

    Is there some guide to use this?

    opened by kination 8
  • Can't build lindera-ipadic on Raspberry Pi 4B

    Can't build lindera-ipadic on Raspberry Pi 4B

    Lindera-ipadic is a requirement of the zola static website generator written in Rust.

    During the zola build, it fails while building lindera-ipadic with this error: memory allocation of 805306368 bytes failed error: could not compile lindera-ipadic.

    Environment: Raspberry Pi 4B, 4GB memory, debian.

    I have tried to give it more contiguous memory by rebooting and trying again with a fresh system and no user apps running. Even then, the system apparently can't give it 800MB (!) of presumably contiguous memory. free -mh shows 2.7GB free, but not contiguous, I imagine.

    Zola developers have asked me to report this to you. They do not think lindera-ipadic requires 800MB to build.

    Thanks.

    opened by ncoleman 7
  • Lindera doesn’t build

    Lindera doesn’t build

    Currently, we can’t import lindera in the latest version. It doesn’t build, and since the change has been pushed as a minor version, it probably broke every project relying on lindera.

    ...
        Checking lindera-decompress v0.13.5
        Checking bstr v0.2.17
        Checking lindera-core v0.13.5
        Checking csv v1.1.6
       Compiling character_converter v2.1.0
        Checking lindera-unidic-builder v0.13.5
        Checking lindera-ipadic-builder v0.13.5
        Checking lindera-dictionary v0.13.5
        Checking lindera-ko-dic-builder v0.13.5
        Checking lindera-cc-cedict-builder v0.13.5
       Compiling lindera-ipadic v0.13.5
        Checking lindera v0.13.5
    error[E0599]: no variant or associated item named `DictionaryTypeError` found for enum `lindera_core::error::LinderaErrorKind` in the current scope
      --> /Users/irevoire/.cargo/registry/src/github.com-1ecc6299db9ec823/lindera-0.13.5/src/tokenizer.rs:64:40
       |
    64 |             _ => Err(LinderaErrorKind::DictionaryTypeError
       |                                        ^^^^^^^^^^^^^^^^^^^
       |                                        |
       |                                        variant or associated item not found in `lindera_core::error::LinderaErrorKind`
       |                                        help: there is a variant with a similar name: `DictionaryLoadError`
    
    error[E0599]: no variant or associated item named `UserDictionaryTypeError` found for enum `lindera_core::error::LinderaErrorKind` in the current scope
      --> /Users/irevoire/.cargo/registry/src/github.com-1ecc6299db9ec823/lindera-0.13.5/src/tokenizer.rs:84:40
       |
    84 |             _ => Err(LinderaErrorKind::UserDictionaryTypeError
       |                                        ^^^^^^^^^^^^^^^^^^^^^^^ variant or associated item not found in `lindera_core::error::LinderaErrorKind`
    
    For more information about this error, try `rustc --explain E0599`.
    error: could not compile `lindera` due to 2 previous errors
    

    You can check this repository to reproduce the issue; https://github.com/meilisearch/charabia on this sha 82c9f3b

    opened by irevoire 5
  • Question for user dictionary parsing when using non-compressed local dictionary

    Question for user dictionary parsing when using non-compressed local dictionary

    While keep using non-compressed local dictionary along with user dictionary, the build_user_dict is failed with error user dictionary path is not set.. I think the related code is here and want to confirm if it is ok to fallback the user dictionary parsing to use IpadicBuilder while using local dictionary?

    bug 
    opened by ypenglyn 5
  • Make the binary smaller by compressing the dictionary

    Make the binary smaller by compressing the dictionary

    I made the binary smaller by compressing the dictionary. (Instead, the decompression runtime is taken first). I implemented this feature in my lindera-js with the aim of making the executable file smaller.

    If you guys are interested in this feature, I'd like to include it in this repository by using the features flag to toggle it on and off.

    achivement:

    cargo build --release
    ls -lah target/release/
    

    the execute file target/release/lindera size is 72M to 11M. but the bench-constructor result is slower than no compress.

    bench-constructor       time:   [28.210 ms 28.319 ms 28.431 ms] 
    
    bench-constructor       time:   [590.35 ms 591.92 ms 593.58 ms]                              
    
    opened by higumachan 4
  • Use Unicode values instead of `HashMap` for performance

    Use Unicode values instead of `HashMap` for performance

    I was using this library in a personal project and noticed a place where you can improve performance. Instead of using a static HashMap<char, char>, you can significantly improve performance by processing the character codes directly. Hopefully this suggestion is welcome.

    A few notes:

    • I noticed that the KATAKANA_DAKUON_MAP contains a single handakuten→dakuten conversion: パ→バ. Not sure if this was intentional, so I commented it out for now.
    • There are some uncommon characters that are not covered: ゔ・ヴ, ヷ・ヸ・ヹ・ヺ. I'm assuming the exclusion of the latter four is intentional, but I wonder about the first two.

    Output of bench_hiragana_detection test:

    new: 197ms 4000000
    old: 4248ms 4000000
    

    Output of bench_hiragana_conversion test:

    new: 129ms 4000000
    old: 934ms 4000000
    
    opened by encody 3
  • Reconsider default LZMA dependency without any option to avoid it

    Reconsider default LZMA dependency without any option to avoid it

    Issue

    The PR #139 introduced in v0.9.0 make LZMA (rust-lzma or lzma-rs) a mandatory dependency. This forces all users to install the external library liblzma to be able to compile Lindera.

    In comparison, the v0.8.1 needs only to add lindera in the project's cargo.toml.

    Context

    In Meilisearch we plan to use Lindera to tokenize Japanese texts, but we don't want to ask our users to install external libraries manually, in order to keep Meilisearch easy to install and easy to use.

    Potential solutions

    • reconsider #139
    • choose a compression library that doesn't need a manually installed library (vendoring or rust library)
    • provide a feature flag to choose the compression method

    Thanks for maintaining Lindera 😊

    opened by ManyTheFish 3
  • implement binary data reading and writing for user dictionary

    implement binary data reading and writing for user dictionary

    Pull Request

    Problem

    The more lines in userdic.csv, the longer it will take to create the tokenizer.

    In my environment, with 180,000 rows of CSV, it takes about 14 seconds to run lindera/examples/userdic_example.rs.

    Solution

    Create a binary version of the user dictionary in advance. How about loading it in the tokenizer?

    With the same CSV data of 180,000 rows It takes about 7 seconds to run lindera/examples/userdic_example.rs using the binary loading method.

    The code is a proposed implementation. Please let us know what you think of this approach.

    opened by abetomo 3
  • Bump actions/upload-artifact from 2.3.1 to 3.1.1

    Bump actions/upload-artifact from 2.3.1 to 3.1.1

    Bumps actions/upload-artifact from 2.3.1 to 3.1.1.

    Release notes

    Sourced from actions/upload-artifact's releases.

    v3.1.1

    • Update actions/core package to latest version to remove set-output deprecation warning #351

    v3.1.0

    What's Changed

    v3.0.0

    What's Changed

    • Update default runtime to node16 (#293)
    • Update package-lock.json file version to 2 (#302)

    Breaking Changes

    With the update to Node 16, all scripts will now be run with Node 16 rather than Node 12.

    Commits
    • 83fd05a Bump actions-core to v1.10.0 (#356)
    • 3cea537 Merge pull request #327 from actions/robherley/artifact-1.1.0
    • 849aa77 nvm use 12 & npm run release
    • 4d39869 recompile with correct ncc version
    • 2e0d362 bump @​actions/artifact to 1.1.0
    • 09a5d6a Merge pull request #320 from actions/dependabot/npm_and_yarn/ansi-regex-4.1.1
    • 189315d Bump ansi-regex from 4.1.0 to 4.1.1
    • d159c2d Merge pull request #297 from actions/dependabot/npm_and_yarn/ajv-6.12.6
    • c26a7ba Bump ajv from 6.11.0 to 6.12.6
    • 6ed6c72 Merge pull request #303 from actions/dependabot/npm_and_yarn/yargs-parser-13.1.2
    • Additional commits viewable in compare view

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    dependencies github_actions 
    opened by dependabot[bot] 0
Releases(v0.19.4)
Owner
Lindera Morphology
A morphological analysis libraries and commands.
Lindera Morphology
Font independent text analysis support for shaping and layout.

lipi Lipi (Sanskrit for 'writing, letters, alphabet') is a pure Rust crate that provides font independent text analysis support for shaping and layout

Chad Brokaw 12 Sep 22, 2022
A command-line tool and library for generating regular expressions from user-provided test cases

Table of Contents What does this tool do? Do I still need to learn to write regexes then? Current features How to install? 4.1 The command-line tool 4

Peter M. Stahl 5.8k Dec 30, 2022
An efficient and powerful Rust library for word wrapping text.

Textwrap Textwrap is a library for wrapping and indenting text. It is most often used by command-line programs to format dynamic output nicely so it l

Martin Geisler 322 Dec 26, 2022
⏮ ⏯ ⏭ A Rust library to easily read forwards, backwards or randomly through the lines of huge files.

EasyReader The main goal of this library is to allow long navigations through the lines of large files, freely moving forwards and backwards or gettin

Michele Federici 81 Dec 6, 2022
Natural language detection library for Rust. Try demo online: https://www.greyblake.com/whatlang/

Whatlang Natural language detection for Rust with focus on simplicity and performance. Content Features Get started Documentation Supported languages

Sergey Potapov 805 Dec 28, 2022
A Rust library for generically joining iterables with a separator

joinery A Rust library for generically joining iterables with a separator. Provides the tragically missing string join functionality to rust. extern c

Nathan West 72 Dec 16, 2022
👄 The most accurate natural language detection library in the Rust ecosystem, suitable for long and short text alike

Table of Contents What does this library do? Why does this library exist? Which languages are supported? How good is it? Why is it better than other l

Peter M. Stahl 569 Jan 3, 2023
A fast, low-resource Natural Language Processing and Text Correction library written in Rust.

nlprule A fast, low-resource Natural Language Processing and Error Correction library written in Rust. nlprule implements a rule- and lookup-based app

Benjamin Minixhofer 496 Jan 8, 2023
Rust wrapper for the BlingFire tokenization library

BlingFire in Rust blingfire is a thin Rust wrapper for the BlingFire tokenization library. Add the library to Cargo.toml to get started cargo add blin

Re:infer 14 Sep 5, 2022
A Rust library containing an offline version of webster's dictionary.

webster-rs A Rust library containing an offline version of webster's dictionary. Add to Cargo.toml webster = 0.3.0 Simple example: fn main() { le

Grant Handy 12 Sep 27, 2022
Wrapper around Microsoft CNTK library

Bindings for CNTK library Simple low level bindings for CNTK library from Microsoft. API Documentation Status Currently exploring ways how to interact

Vlado Boza 21 Nov 30, 2021
Rust-nlp is a library to use Natural Language Processing algorithm with RUST

nlp Rust-nlp Implemented algorithm Distance Levenshtein (Explanation) Jaro / Jaro-Winkler (Explanation) Phonetics Soundex (Explanation) Metaphone (Exp

Simon Paitrault 34 Dec 20, 2022
A lightweight library with vehicle tuning utilities.

A lightweight library with vehicle tuning utilities. This includes utilities for communicating with OBD-II services, firmware downloading/flashing, and table modifications.

LibreTuner 6 Oct 3, 2022
lingua-rs Python binding. An accurate natural language detection library, suitable for long and short text alike.

lingua-py lingua-rs Python binding. An accurate natural language detection library, suitable for long and short text alike. Installation pip install l

messense 7 Dec 30, 2022
A small random number generator hacked on top of Rust's standard library. An exercise in pointlessness.

attorand from 'atto', meaning smaller than small, and 'rand', short for random. A small random number generator hacked on top of Rust's standard libra

Isaac Clayton 1 Nov 24, 2021
Library to calculate TF-IDF (Term Frequency - Inverse Document Frequency) for generic documents.

Library to calculate TF-IDF (Term Frequency - Inverse Document Frequency) for generic documents. The library provides strategies to act on objects that implement certain document traits (NaiveDocument, ProcessedDocument, ExpandableDocument).

Ferris Tseng 13 Oct 31, 2022
A simple and fast linear algebra library for games and graphics

glam A simple and fast 3D math library for games and graphics. Development status glam is in beta stage. Base functionality has been implemented and t

Cameron Hart 953 Jan 3, 2023
A small rust library for creating regex-based lexers

A small rust library for creating regex-based lexers

nph 1 Feb 5, 2022
A rule based sentence segmentation library.

cutters A rule based sentence segmentation library. ?? This library is experimental. ?? Features Full UTF-8 support. Robust parsing. Language specific

null 11 Jul 29, 2022