A morphological analysis library.

Lindera Morphology

Last update: Dec 27, 2022

Related tags

Overview

Lindera

A Japanese morphological analysis library in Rust. This project fork from fulmicoton's kuromoji-rs.

Lindera aims to build a library which is easy to install and provides concise APIs for various Rust applications.

Build

The following products are required to build:

Rust >= 1.46.0

% cargo build --release

Usage

Basic example

This example covers the basic usage of Lindera.

It will:

Create a tokenizer in normal mode
Tokenize the input text
Output the tokens

use lindera::tokenizer::Tokenizer;
use lindera_core::core::viterbi::Mode;

fn main() -> std::io::Result<()> {
    // create tokenizer
    let mut tokenizer = Tokenizer::new(Mode::Normal, "");

    // tokenize the text
    let tokens = tokenizer.tokenize("関西国際空港限定トートバッグ");

    // output the tokens
    for token in tokens {
        println!("{}", token.text);
    }

    Ok(())
}

The above example can be run as follows:

% cargo run --example basic_example

You can see the result as follows:

関西国際空港
限定
トートバッグ

User dictionary example

You can give user dictionary entries along with the default system dictionary. User dictionary should be a CSV with following format.

<surface_form>,<part_of_speech>,<reading>

For example:

% cat userdic.csv
東京スカイツリー,カスタム名詞,トウキョウスカイツリー
東武スカイツリーライン,カスタム名詞,トウブスカイツリーライン
とうきょうスカイツリー駅,カスタム名詞,トウキョウスカイツリーエキ

With an user dictionary, Tokenizer will be created as follows:

use lindera::tokenizer::Tokenizer;
use lindera_core::core::viterbi::Mode;

fn main() -> std::io::Result<()> {
    // create tokenizer
    let mut tokenizer = Tokenizer::new_with_userdic(Mode::Normal, "", "resources/userdic.csv");

    // tokenize the text
    let tokens = tokenizer.tokenize("東京スカイツリーの最寄り駅はとうきょうスカイツリー駅です");

    // output the tokens
    for token in tokens {
        println!("{}", token.text);
    }

    Ok(())
}

The above example can be by cargo run --example:

% cd lindera/lindera
% cargo run --example userdic_example
東京スカイツリー
の
最寄り駅
は
とうきょうスカイツリー駅
です

API reference

The API reference is available. Please see following URL:

lindera

Comments

Lindera-ipadict randomly as issue during build

When compiling lindera we frequently have a building error:

 error: failed to run custom build command for `lindera-ipadic v0.10.0`

Caused by:
  process didn't exit successfully: `D:\a\milli\milli\target\release\build\lindera-ipadic-caf28ea0e76b9e29\build-script-build` (exit code: 1)
  --- stdout
  cargo:rerun-if-changed=build.rs
  cargo:rerun-if-changed=Cargo.toml

  --- stderr
  Error: Custom { kind: UnexpectedEof, error: TarError { desc: "failed to iterate over archive", io: Error { kind: UnexpectedEof, message: "failed to fill whole buffer" } } }

It seems to be related to dictionaries.

Any idea of what could be the reason, the google drive download? 🤔

opened by ManyTheFish 10

Downloading and decompressing dictionaries takes a lot of time
Hey @mosuka,

We were facing compilation slow dows at Meilisearch recently and investigated, we found out that it was lindera-ipadic that was taking a lot of time to probably download the mecab-ipadic-2.7.0-20070801.tar.gz tarball from SourceForge.

If you want to look at the time it takes on our side, you can just execute the below command and open the generated HTML report.

rustup update cargo +nightly build --timings

But as we can see, the CPU is idle for a long time when it builds.
opened by Kerollmops 9
Switch FST to DoubleArrayTrie
Switch FST library to yada (Double Array Trie) in PrefixDict. Need rust version >= 1.46.0 to build.

This PR breaks the data structure from the previous version.

Still a work in progress. Remain tasks are here:

[x] Change .fst file to .da

[x] Add some tests

[x] Update CHANGES.md

[x] Check benches

[x] Debug an error on long text bench

~~Check/test lindera-server~~

[x] Check/test lindera-tantivy

[x] Need tokenizer to be Clone-able.

~~Change other builders, e.g. neologd, unidic~~

After new version released

enhancement
opened by johtani 9
Build tokenizer for ko/ja

Hello, I'm trying to build tokenizer app which supports korean/japanese with lindera module. Seems japanese is default supported, but korean needs to build dictionary with following https://github.com/lindera-morphology/lindera-ko-dic-builder.

Is there some guide to use this?

opened by kination 8
Can't build lindera-ipadic on Raspberry Pi 4B

Lindera-ipadic is a requirement of the zola static website generator written in Rust.

During the zola build, it fails while building lindera-ipadic with this error: memory allocation of 805306368 bytes failed error: could not compile lindera-ipadic.

Environment: Raspberry Pi 4B, 4GB memory, debian.

I have tried to give it more contiguous memory by rebooting and trying again with a fresh system and no user apps running. Even then, the system apparently can't give it 800MB (!) of presumably contiguous memory. free -mh shows 2.7GB free, but not contiguous, I imagine.

Zola developers have asked me to report this to you. They do not think lindera-ipadic requires 800MB to build.

Thanks.

opened by ncoleman 7

Lindera doesn’t build

Currently, we can’t import lindera in the latest version. It doesn’t build, and since the change has been pushed as a minor version, it probably broke every project relying on lindera.

...
    Checking lindera-decompress v0.13.5
    Checking bstr v0.2.17
    Checking lindera-core v0.13.5
    Checking csv v1.1.6
   Compiling character_converter v2.1.0
    Checking lindera-unidic-builder v0.13.5
    Checking lindera-ipadic-builder v0.13.5
    Checking lindera-dictionary v0.13.5
    Checking lindera-ko-dic-builder v0.13.5
    Checking lindera-cc-cedict-builder v0.13.5
   Compiling lindera-ipadic v0.13.5
    Checking lindera v0.13.5
error[E0599]: no variant or associated item named `DictionaryTypeError` found for enum `lindera_core::error::LinderaErrorKind` in the current scope
  --> /Users/irevoire/.cargo/registry/src/github.com-1ecc6299db9ec823/lindera-0.13.5/src/tokenizer.rs:64:40
   |
64 |             _ => Err(LinderaErrorKind::DictionaryTypeError
   |                                        ^^^^^^^^^^^^^^^^^^^
   |                                        |
   |                                        variant or associated item not found in `lindera_core::error::LinderaErrorKind`
   |                                        help: there is a variant with a similar name: `DictionaryLoadError`

error[E0599]: no variant or associated item named `UserDictionaryTypeError` found for enum `lindera_core::error::LinderaErrorKind` in the current scope
  --> /Users/irevoire/.cargo/registry/src/github.com-1ecc6299db9ec823/lindera-0.13.5/src/tokenizer.rs:84:40
   |
84 |             _ => Err(LinderaErrorKind::UserDictionaryTypeError
   |                                        ^^^^^^^^^^^^^^^^^^^^^^^ variant or associated item not found in `lindera_core::error::LinderaErrorKind`

For more information about this error, try `rustc --explain E0599`.
error: could not compile `lindera` due to 2 previous errors

You can check this repository to reproduce the issue; https://github.com/meilisearch/charabia on this sha 82c9f3b

opened by irevoire 5

Question for user dictionary parsing when using non-compressed local dictionary

While keep using non-compressed local dictionary along with user dictionary, the build_user_dict is failed with error user dictionary path is not set.. I think the related code is here and want to confirm if it is ok to fallback the user dictionary parsing to use IpadicBuilder while using local dictionary?
bug

opened by ypenglyn 5
Make the binary smaller by compressing the dictionary
I made the binary smaller by compressing the dictionary. (Instead, the decompression runtime is taken first). I implemented this feature in my lindera-js with the aim of making the executable file smaller.

If you guys are interested in this feature, I'd like to include it in this repository by using the features flag to toggle it on and off.

achivement:

cargo build --release ls -lah target/release/

the execute file target/release/lindera size is 72M to 11M. but the bench-constructor result is slower than no compress.

bench-constructor time: [28.210 ms 28.319 ms 28.431 ms]

bench-constructor time: [590.35 ms 591.92 ms 593.58 ms]
opened by higumachan 4
Use Unicode values instead of `HashMap` for performance
I was using this library in a personal project and noticed a place where you can improve performance. Instead of using a static HashMap<char, char>, you can significantly improve performance by processing the character codes directly. Hopefully this suggestion is welcome.

A few notes:

I noticed that the KATAKANA_DAKUON_MAP contains a single handakuten→dakuten conversion: パ→バ. Not sure if this was intentional, so I commented it out for now.

There are some uncommon characters that are not covered: ゔ・ヴ, ヷ・ヸ・ヹ・ヺ. I'm assuming the exclusion of the latter four is intentional, but I wonder about the first two.

Output of bench_hiragana_detection test:

new: 197ms 4000000 old: 4248ms 4000000

Output of bench_hiragana_conversion test:

new: 129ms 4000000 old: 934ms 4000000
opened by encody 3
Reconsider default LZMA dependency without any option to avoid it
Issue

The PR #139 introduced in v0.9.0 make LZMA (rust-lzma or lzma-rs) a mandatory dependency. This forces all users to install the external library liblzma to be able to compile Lindera.

In comparison, the v0.8.1 needs only to add lindera in the project's cargo.toml.

Context

In Meilisearch we plan to use Lindera to tokenize Japanese texts, but we don't want to ask our users to install external libraries manually, in order to keep Meilisearch easy to install and easy to use.

Potential solutions

reconsider #139

choose a compression library that doesn't need a manually installed library (vendoring or rust library)

flate2

snap

brotli ? (🤷)

provide a feature flag to choose the compression method

Thanks for maintaining Lindera 😊
opened by ManyTheFish 3
implement binary data reading and writing for user dictionary

Pull Request

Problem

The more lines in userdic.csv, the longer it will take to create the tokenizer.

In my environment, with 180,000 rows of CSV, it takes about 14 seconds to run lindera/examples/userdic_example.rs.

Solution

Create a binary version of the user dictionary in advance. How about loading it in the tokenizer?

With the same CSV data of 180,000 rows It takes about 7 seconds to run lindera/examples/userdic_example.rs using the binary loading method.

The code is a proposed implementation. Please let us know what you think of this approach.

opened by abetomo 3
Bump actions/upload-artifact from 2.3.1 to 3.1.1
Bumps actions/upload-artifact from 2.3.1 to 3.1.1.

Release notes

Sourced from actions/upload-artifact's releases.

v3.1.1

Update actions/core package to latest version to remove set-output deprecation warning #351

v3.1.0

What's Changed

Bump @actions/artifact to v1.1.0 (actions/upload-artifact#327)

Adds checksum headers on artifact upload (actions/toolkit#1095) (actions/toolkit#1063)

v3.0.0

What's Changed

Update default runtime to node16 (#293)

Update package-lock.json file version to 2 (#302)

Breaking Changes

With the update to Node 16, all scripts will now be run with Node 16 rather than Node 12.

Commits

83fd05a Bump actions-core to v1.10.0 (#356)

3cea537 Merge pull request #327 from actions/robherley/artifact-1.1.0

849aa77 nvm use 12 & npm run release

4d39869 recompile with correct ncc version

2e0d362 bump @actions/artifact to 1.1.0

09a5d6a Merge pull request #320 from actions/dependabot/npm_and_yarn/ansi-regex-4.1.1

189315d Bump ansi-regex from 4.1.0 to 4.1.1

d159c2d Merge pull request #297 from actions/dependabot/npm_and_yarn/ajv-6.12.6

c26a7ba Bump ajv from 6.11.0 to 6.12.6

6ed6c72 Merge pull request #303 from actions/dependabot/npm_and_yarn/yargs-parser-13.1.2

Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

dependencies github_actions
opened by dependabot[bot] 0