🦞 Rust library of natural language dictionaries using character-wise double-array tries.

Last update: Dec 15, 2022

Related tags

Command-line crawdad

Overview

🦞 Crawdad: ChaRActer-Wise Double-Array Dictionary

Overview

Crawdad is a library of natural language dictionaries using character-wise double-array tries. The implementation is optimized for strings of multibyte-characters, and you can enjoy fast text processing on strings such as Japanese or Chinese.

Experimental results can be found in Wiki.

What can do

Key-value mapping: Crawdad stores a set of string keys with mapping arbitrary integer values.
Exact match: Crawdad supports a fast lookup for an input key.
Common prefix search: Crawdad supports fast common prefix search that can be used to enumerate all keys appearing in a text.

Data structures

Crawdad contains the two trie implementations:

crawdad::Trie is a standard trie form that often provides the fastest queries.
crawdad::MpTrie is a minimal-prefix trie form that is memory-efficient for long strings.

Installation

To use crawdad, depend on it in your Cargo manifest:

# Cargo.toml

[dependencies]
crawdad = "0.1"

Disclaimer

This software is developed by LegalForce, Inc., but not an officially supported LegalForce product.

License

Licensed under either of

Apache License, Version 2.0 (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
MIT license (LICENSE-MIT or http://opensource.org/licenses/MIT)

at your option.

For softwares under bench/data, follow the license terms of each software.

Contribution

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.

Comments

Support unsorted keys

We often want to use unsorted search patterns, such as hashmap keys, but if we try to build a trie with such a pattern set, we get an error message, and the operation fails.

It is inconvenient to sort a pattern set every time, so it is necessary to address unsorted pattern sets.

opened by vbkaisetsu 3
Represent a mapped code in 2 bytes
In this PR, I attempted to restrict the number of bytes used to represent a mapped code to 2 bytes.

This restriction allows us to

implement CodeMapper as a 2-byte array, and

implement the tail of MP-trie with 2-byte alignment (i.e., skipping unpacking integers from a byte sequence).

Summary

Unfortunately, these modifications do NOT provide reasonable improvements. So, I close this PR.

Background

I first investigated the number of character kinds included in several word dictionaries. As the following results show, in practice, it is expected that two bytes are enough to represent a mapped code.

| Dataset | #chars | |---|---:| | jawiki-20220720-all-titles-in-ns0 | 11132 | | zhwiki-20220720-all-titles-in-ns0 | 16829 | | kowiki-20220720-all-titles-in-ns0 | 4780 | | mecab-ipadic-2.7.0 | 5443 | | mecab-ipadic-neologd-0.0.7 | 8904 | | mecab-cedict-20220509 | 14362 | | mecab-kodic-2.1.1 | 9554 | | unidic | 7545 |

Implement CodeMapper as a 2-byte array.

Second, I replaced CodeMapper::table from Vec<u32> into Vec<u16> and performed the benchmark (see https://github.com/daac-tools/crawdad/commit/f24d38df3bcc5c78fb3ab739900b93a558a8dbcf).

This modification is minor, but surprisingly, the search performance was degraded, maybe due to casting mapped codes from u16 to u32.

unidic/exact/crawdad/trie time: [11.060 µs 11.096 µs 11.128 µs] change: [+13.713% +14.242% +14.703%] (p = 0.00 < 0.05) Performance has regressed. unidic/exact/crawdad/mptrie time: [16.076 µs 16.190 µs 16.293 µs] change: [+4.5033% +5.5107% +6.3737%] (p = 0.00 < 0.05) Performance has regressed.

The improvement in memory was not reasonable.

Trie: 9.405 MiB -> 9.077 MiB

MpTrie: 10.967 MiB -> 10.640 MiB

Implement the TAIL of MP-trie with 2-byte alignment

Third, I implemented MpTrie::tails as Vec<u16> to skip unpacking mapped codes from a byte sequence (in addition to the above modification).

The implementation is the same as that in this PR. Unfortunately, we did not see any reasonable improvement.

UniDic

| Version | Exact Match (ns/query) | Memory (MiB) | |---|---:|---:| | Old | 22.591 | 10.967 | | New | 20.763 | 11.723 |

Jawiki-titles

| Version | Exact Match (ns/query) | Memory (MiB) | |---|---:|---:| | Old | 51.710 | 46.148 | | New | 51.579 | 49.414 |

Zhwiki-titles (Chinese)

| Version | Exact Match (ns/query) | Memory (MiB) | |---|---:|---:| | Old | 42.558 | 50.401 | | New | 48.005 | 52.440 |
opened by kampersanda 1
Simplify and speed up common prefix searcher
This PR simplifies the implementation of common prefix search and avoids storing mapped characters in an auxiliary data structure. With this modification, the search speed was improved as follows.

unidic/cps/crawdad/trie time: [6.9536 ms 6.9676 ms 6.9839 ms] change: [-14.553% -13.982% -13.439%] (p = 0.00 < 0.05) Performance has improved. Found 2 outliers among 10 measurements (20.00%) 2 (20.00%) high mild unidic/cps/crawdad/mptrie time: [8.1081 ms 8.1424 ms 8.1795 ms] change: [-18.109% -17.562% -16.983%] (p = 0.00 < 0.05) Performance has improved. Found 1 outliers among 10 measurements (10.00%) 1 (10.00%) high mild

Discussion

This modification contains the following trade off.

Pros.

Make the API simpler.

Remove the auxiliary data structure to store mapped characters.

Cons.

Map the same characters in a haystack repeatedly.

Omit byte positions from matched results.

The experimental results indicate that storing and accessing mapped characters in an array is more costly than mapping characters repeatedly. The new simpler API would be easier to use, since it is not necessary to create an instance to store characters. Byte positions are not a necessary result for all users, so there would be no need to provide them in this library.
opened by kampersanda 1

Proposal to change `common_prefix_searcher()` API

Currently, common_prefix_searcher() takes a slice generated by map_text(). However, there is no mechanism to link the slice with the trie, so users can give a wrong slice generated by different trie to the common_prefix_searcher():

// example
let keys1 = vec!["アイス", "アイスクリーム"];
let trie1 = MpTrie::from_keys(&keys1).unwrap();
let keys2 = vec!["東京", "京都"];
let trie2 = MpTrie::from_keys(&keys2).unwrap();

let mut mapped = vec![];
trie1.map_text("東京都", &mut mapped);

trie2.common_prefix_searcher(&mapped[1..]);

To avoid this problem, this issue suggests changing common_prefix_searcher() to return an intermediate structure with a mapped text and a reference to the trie:

impl Trie {
   /// Returns an intermediate structure.
   pub fn common_prefix_searcher(&self) -> CommonPrefixSearcher {
       CommonPrefixSearcher {
           trie: self,
           mapped_text: vec![],
       }
   }
}

pub struct CommonPrefixSearcher<'t> {
   trie: &'t Trie,
   mapped_text: Vec<MappedChar>,
}

impl CommonPrefixSearcher {
    /// Updates the haystack.
    pub fn set_text(&mut self, text: &str) {
        self.mapped_text.reset();
        ...
    }

    /// Returns an iterator of the common prefix search.
    pub fn search(&self, range: Range<usize>) -> CommonPrefixSearchIter {
        ...
    }
}

impl Iterator for CommonPrefixSearchIter {
    ...
}

Usage:

let keys = vec!["東京", "京都"];
let trie = MpTrie::from_keys(&keys).unwrap();

let mut searcher = trie.common_prefix_searcher();

searcher.set_text("東京都");
trie.search(1..);

enhancement

opened by vbkaisetsu 1

Add skipping approach for faster construction

This PR implemented a skipping approach for faster construction.

On UniDic, the construction times were improved from

[crawdad/trie]
heap_bytes: 9795880 bytes, 9.342 MiB
num_elems: 1138688
num_vacants: 119677
vacant_ratio: 0.105
construction: 7.125 [sec]
[crawdad/mptrie]
heap_bytes: 11500152 bytes, 10.967 MiB
num_elems: 1024000
num_vacants: 180321
vacant_ratio: 0.176
construction: 15.686 [sec]

[crawdad/trie]
heap_bytes: 9861416 bytes, 9.405 MiB
num_elems: 1146880
num_vacants: 127869
vacant_ratio: 0.111
construction: 3.333 [sec]
[crawdad/mptrie]
heap_bytes: 11500152 bytes, 10.967 MiB
num_elems: 1024000
num_vacants: 180321
vacant_ratio: 0.176
construction: 4.526 [sec]

In this version, I do not add functions to set Builder::num_free_blocks because of avoiding to complicate the API. (i.e., num_free_blocks is always fixed with 16)

opened by kampersanda 0

Releases(v0.4.0)

v0.4.0(Sep 2, 2022)
Support unsorted input keys (https://github.com/daac-tools/crawdad/pull/31)

Use non-character for the end marker (https://github.com/daac-tools/crawdad/pull/32)

Source code(tar.gz)
Source code(zip)
v0.3.0(Aug 1, 2022)
Speeding up common prefix searches

Extending benchmark results

Source code(tar.gz)
Source code(zip)
v0.2.0(Apr 23, 2022)

Bump up to 0.2.0
Source code(tar.gz)
Source code(zip)

Owner

GitHub https://docs.rs/crawdad

Command line tool to query the Oxford Dictionaries API.

oxd Oxd is a client library for the Oxford Dictionary API. It provides a series of structs modeling entries returned from the API, a function [get_ent

3 Dec 15, 2022

A super simple but lightweight logging library that tries to capture the most important (status) information.

Hackerlog A super simple but lightweight logging library that tries to capture the most important (status) information. The following is supported: Lo

3 Aug 22, 2023

A natural language shell interface for *nix systems

Orphic A natural language shell interface for *nix systems. Overview Orphic is a CLI tool that uses GPT to translate complex tasks into shell commands

42 Mar 29, 2023

nl-sh: Natural Language Shell

The Natural Language Shell integrates GPT4 or local GGUF-formatted models directly into the terminal experience, allowing operators to describe their tasks in either POSIX commands or fluent human language

30 Mar 12, 2024

j is a limited subset of J, an array programming language

j is a limited subset of J, an array programming language. this file is an accompanying essay.

7 Nov 8, 2023

Shellfirm - Intercept any risky patterns (default or defined by you) and prompt you a small challenge for double verification

shellfirm Opppppsss you did it again? ?? ?? ?? Protect yourself from yourself! rm -rf * git reset --hard before saving? kubectl delete ns which going

652 Dec 29, 2022

An Interpreter for Brainfuck programming language implemented in the Rust programming language with zero dependencies.

Brainfuck Hello, Visitor! Hey there, welcome to my project showcase website! It's great to have you here. I hope you're ready to check out some awesom

7 Mar 31, 2023

A C-like programming language that is similar to Rust's syntax. Toy programming language.

CRUST This is a hobby project to learn about compilers and language design. I've designed the language to be similar to C and Rust. Heavily inspired b

50 Jul 13, 2024

Rust API Server: A versatile template for building RESTful interfaces, designed for simplicity in setup and configuration using the Rust programming language.

RUST API SERVER Introduction Welcome to the Rust API Server! This server provides a simple REST interface for your applications. This README will guid

3 Feb 25, 2024

The Amp programming language: a language designed for building high performance systems.

A language designed for building high performance systems. Platform Support x86_64-pc-windows ✅ x86_64-unknown-linux ⚠️ untested x86_64-unknown-darwin

5 Mar 17, 2023

Nexa programming language. A language for game developers by a game developer

NexaLang Nexa programming language. A language for game developers by a game developer. Features High-Level: Nexa is an easy high level language Two M

3 Aug 21, 2023

A clone of linux cal command, using rust programming language

CLI Calendar command What the project does This command was inspired by Linux Cal command that shows the current month calendar as output (by default)

2 Nov 16, 2022

Horus is an open source tool for running forensic and administrative tasks at the kernel level using eBPF, a low-overhead in-kernel virtual machine, and the Rust programming language.

Horus Horus is an open-source tool for running forensic and administrative tasks at the kernel level using eBPF, a low-overhead in-kernel virtual mach

4 Dec 15, 2022

Attempt to summarize text from `stdin`, using a large language model (locally and offline), to `stdout`

summarize-cli Attempt to summarize text from stdin, using a large language model (locally and offline), to stdout. cargo build --release target/releas

4 Aug 23, 2023

Terminal UI to chat with large language models (LLM) using different model backends, and integrations with your favourite editors!

Oatmeal Terminal UI to chat with large language models (LLM) using different model backends, and integrations with your favourite editors! Overview In

88 Dec 4, 2023

🦞 Rust library of natural language dictionaries using character-wise double-array tries.

Related tags

Overview

🦞 Crawdad: ChaRActer-Wise Double-Array Dictionary

Overview

What can do

Data structures

Installation

Disclaimer

License

Contribution

Comments

Support unsorted keys

Represent a mapped code in 2 bytes

Summary

Background

Implement CodeMapper as a 2-byte array.

Implement the TAIL of MP-trie with 2-byte alignment

UniDic

Jawiki-titles

Zhwiki-titles (Chinese)

Simplify and speed up common prefix searcher

Discussion

Proposal to change `common_prefix_searcher()` API

Add skipping approach for faster construction

Releases(v0.4.0)

v0.4.0(Sep 2, 2022)

v0.3.0(Aug 1, 2022)

v0.2.0(Apr 23, 2022)

Owner

Command line tool to query the Oxford Dictionaries API.

A super simple but lightweight logging library that tries to capture the most important (status) information.

A natural language shell interface for *nix systems

nl-sh: Natural Language Shell

j is a limited subset of J, an array programming language

Shellfirm - Intercept any risky patterns (default or defined by you) and prompt you a small challenge for double verification

An Interpreter for Brainfuck programming language implemented in the Rust programming language with zero dependencies.

A C-like programming language that is similar to Rust's syntax. Toy programming language.

Rust API Server: A versatile template for building RESTful interfaces, designed for simplicity in setup and configuration using the Rust programming language.

The Amp programming language: a language designed for building high performance systems.

Nexa programming language. A language for game developers by a game developer

A clone of linux cal command, using rust programming language

Horus is an open source tool for running forensic and administrative tasks at the kernel level using eBPF, a low-overhead in-kernel virtual machine, and the Rust programming language.

Attempt to summarize text from `stdin`, using a large language model (locally and offline), to `stdout`

Terminal UI to chat with large language models (LLM) using different model backends, and integrations with your favourite editors!

Using BDK from nodejs using WASM webpack 🦀

A Text User Interface library for the Rust programming language

A Text User Interface library for the Rust programming language

A Source library for Ara Programming Language 🗃

Implement `CodeMapper` as a 2-byte array.