🦞 Rust library of natural language dictionaries using character-wise double-array tries.

Related tags

Command-line crawdad
Overview

🦞 Crawdad: ChaRActer-Wise Double-Array Dictionary

Crates.io Documentation Build Status

Overview

Crawdad is a library of natural language dictionaries using character-wise double-array tries. The implementation is optimized for strings of multibyte-characters, and you can enjoy fast text processing on strings such as Japanese or Chinese.

Experimental results can be found in Wiki.

What can do

  • Key-value mapping: Crawdad stores a set of string keys with mapping arbitrary integer values.
  • Exact match: Crawdad supports a fast lookup for an input key.
  • Common prefix search: Crawdad supports fast common prefix search that can be used to enumerate all keys appearing in a text.

Data structures

Crawdad contains the two trie implementations:

  • crawdad::Trie is a standard trie form that often provides the fastest queries.
  • crawdad::MpTrie is a minimal-prefix trie form that is memory-efficient for long strings.

Installation

To use crawdad, depend on it in your Cargo manifest:

# Cargo.toml

[dependencies]
crawdad = "0.1"

Disclaimer

This software is developed by LegalForce, Inc., but not an officially supported LegalForce product.

License

Licensed under either of

at your option.

For softwares under bench/data, follow the license terms of each software.

Contribution

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.

Comments
  • Support unsorted keys

    Support unsorted keys

    We often want to use unsorted search patterns, such as hashmap keys, but if we try to build a trie with such a pattern set, we get an error message, and the operation fails.

    It is inconvenient to sort a pattern set every time, so it is necessary to address unsorted pattern sets.

    opened by vbkaisetsu 3
  • Represent a mapped code in 2 bytes

    Represent a mapped code in 2 bytes

    In this PR, I attempted to restrict the number of bytes used to represent a mapped code to 2 bytes.

    This restriction allows us to

    • implement CodeMapper as a 2-byte array, and
    • implement the tail of MP-trie with 2-byte alignment (i.e., skipping unpacking integers from a byte sequence).

    Summary

    Unfortunately, these modifications do NOT provide reasonable improvements. So, I close this PR.

    Background

    I first investigated the number of character kinds included in several word dictionaries. As the following results show, in practice, it is expected that two bytes are enough to represent a mapped code.

    | Dataset | #chars | |---|---:| | jawiki-20220720-all-titles-in-ns0 | 11132 | | zhwiki-20220720-all-titles-in-ns0 | 16829 | | kowiki-20220720-all-titles-in-ns0 | 4780 | | mecab-ipadic-2.7.0 | 5443 | | mecab-ipadic-neologd-0.0.7 | 8904 | | mecab-cedict-20220509 | 14362 | | mecab-kodic-2.1.1 | 9554 | | unidic | 7545 |

    Implement CodeMapper as a 2-byte array.

    Second, I replaced CodeMapper::table from Vec<u32> into Vec<u16> and performed the benchmark (see https://github.com/daac-tools/crawdad/commit/f24d38df3bcc5c78fb3ab739900b93a558a8dbcf).

    This modification is minor, but surprisingly, the search performance was degraded, maybe due to casting mapped codes from u16 to u32.

    unidic/exact/crawdad/trie                                                                            
                            time:   [11.060 µs 11.096 µs 11.128 µs]
                            change: [+13.713% +14.242% +14.703%] (p = 0.00 < 0.05)
                            Performance has regressed.
    unidic/exact/crawdad/mptrie                                                                            
                            time:   [16.076 µs 16.190 µs 16.293 µs]
                            change: [+4.5033% +5.5107% +6.3737%] (p = 0.00 < 0.05)
                            Performance has regressed.
    

    The improvement in memory was not reasonable.

    • Trie: 9.405 MiB -> 9.077 MiB
    • MpTrie: 10.967 MiB -> 10.640 MiB

    Implement the TAIL of MP-trie with 2-byte alignment

    Third, I implemented MpTrie::tails as Vec<u16> to skip unpacking mapped codes from a byte sequence (in addition to the above modification).

    The implementation is the same as that in this PR. Unfortunately, we did not see any reasonable improvement.

    UniDic

    | Version | Exact Match (ns/query) | Memory (MiB) | |---|---:|---:| | Old | 22.591 | 10.967 | | New | 20.763 | 11.723 |

    Jawiki-titles

    | Version | Exact Match (ns/query) | Memory (MiB) | |---|---:|---:| | Old | 51.710 | 46.148 | | New | 51.579 | 49.414 |

    Zhwiki-titles (Chinese)

    | Version | Exact Match (ns/query) | Memory (MiB) | |---|---:|---:| | Old | 42.558 | 50.401 | | New | 48.005 | 52.440 |

    opened by kampersanda 1
  • Simplify and speed up common prefix searcher

    Simplify and speed up common prefix searcher

    This PR simplifies the implementation of common prefix search and avoids storing mapped characters in an auxiliary data structure. With this modification, the search speed was improved as follows.

    unidic/cps/crawdad/trie time:   [6.9536 ms 6.9676 ms 6.9839 ms]
                            change: [-14.553% -13.982% -13.439%] (p = 0.00 < 0.05)
                            Performance has improved.
    Found 2 outliers among 10 measurements (20.00%)
      2 (20.00%) high mild
    unidic/cps/crawdad/mptrie
                            time:   [8.1081 ms 8.1424 ms 8.1795 ms]
                            change: [-18.109% -17.562% -16.983%] (p = 0.00 < 0.05)
                            Performance has improved.
    Found 1 outliers among 10 measurements (10.00%)
      1 (10.00%) high mild
    

    Discussion

    This modification contains the following trade off.

    • Pros.
      • Make the API simpler.
      • Remove the auxiliary data structure to store mapped characters.
    • Cons.
      • Map the same characters in a haystack repeatedly.
      • Omit byte positions from matched results.

    The experimental results indicate that storing and accessing mapped characters in an array is more costly than mapping characters repeatedly. The new simpler API would be easier to use, since it is not necessary to create an instance to store characters. Byte positions are not a necessary result for all users, so there would be no need to provide them in this library.

    opened by kampersanda 1
  • Proposal to change `common_prefix_searcher()` API

    Proposal to change `common_prefix_searcher()` API

    Currently, common_prefix_searcher() takes a slice generated by map_text(). However, there is no mechanism to link the slice with the trie, so users can give a wrong slice generated by different trie to the common_prefix_searcher():

    // example
    let keys1 = vec!["アイス", "アイスクリーム"];
    let trie1 = MpTrie::from_keys(&keys1).unwrap();
    let keys2 = vec!["東京", "京都"];
    let trie2 = MpTrie::from_keys(&keys2).unwrap();
    
    let mut mapped = vec![];
    trie1.map_text("東京都", &mut mapped);
    
    trie2.common_prefix_searcher(&mapped[1..]);
    

    To avoid this problem, this issue suggests changing common_prefix_searcher() to return an intermediate structure with a mapped text and a reference to the trie:

    impl Trie {
       /// Returns an intermediate structure.
       pub fn common_prefix_searcher(&self) -> CommonPrefixSearcher {
           CommonPrefixSearcher {
               trie: self,
               mapped_text: vec![],
           }
       }
    }
    
    pub struct CommonPrefixSearcher<'t> {
       trie: &'t Trie,
       mapped_text: Vec<MappedChar>,
    }
    
    impl CommonPrefixSearcher {
        /// Updates the haystack.
        pub fn set_text(&mut self, text: &str) {
            self.mapped_text.reset();
            ...
        }
    
        /// Returns an iterator of the common prefix search.
        pub fn search(&self, range: Range<usize>) -> CommonPrefixSearchIter {
            ...
        }
    }
    
    impl Iterator for CommonPrefixSearchIter {
        ...
    }
    

    Usage:

    let keys = vec!["東京", "京都"];
    let trie = MpTrie::from_keys(&keys).unwrap();
    
    let mut searcher = trie.common_prefix_searcher();
    
    searcher.set_text("東京都");
    trie.search(1..);
    
    enhancement 
    opened by vbkaisetsu 1
  • Add skipping approach for faster construction

    Add skipping approach for faster construction

    This PR implemented a skipping approach for faster construction.

    On UniDic, the construction times were improved from

    [crawdad/trie]
    heap_bytes: 9795880 bytes, 9.342 MiB
    num_elems: 1138688
    num_vacants: 119677
    vacant_ratio: 0.105
    construction: 7.125 [sec]
    [crawdad/mptrie]
    heap_bytes: 11500152 bytes, 10.967 MiB
    num_elems: 1024000
    num_vacants: 180321
    vacant_ratio: 0.176
    construction: 15.686 [sec]
    

    to

    [crawdad/trie]
    heap_bytes: 9861416 bytes, 9.405 MiB
    num_elems: 1146880
    num_vacants: 127869
    vacant_ratio: 0.111
    construction: 3.333 [sec]
    [crawdad/mptrie]
    heap_bytes: 11500152 bytes, 10.967 MiB
    num_elems: 1024000
    num_vacants: 180321
    vacant_ratio: 0.176
    construction: 4.526 [sec]
    

    In this version, I do not add functions to set Builder::num_free_blocks because of avoiding to complicate the API. (i.e., num_free_blocks is always fixed with 16)

    opened by kampersanda 0
Releases(v0.4.0)
Command line tool to query the Oxford Dictionaries API.

oxd Oxd is a client library for the Oxford Dictionary API. It provides a series of structs modeling entries returned from the API, a function [get_ent

Chunji Wang 3 Dec 15, 2022
A super simple but lightweight logging library that tries to capture the most important (status) information.

Hackerlog A super simple but lightweight logging library that tries to capture the most important (status) information. The following is supported: Lo

434b 3 Aug 22, 2023
A natural language shell interface for *nix systems

Orphic A natural language shell interface for *nix systems. Overview Orphic is a CLI tool that uses GPT to translate complex tasks into shell commands

Will Savage 42 Mar 29, 2023
nl-sh: Natural Language Shell

The Natural Language Shell integrates GPT4 or local GGUF-formatted models directly into the terminal experience, allowing operators to describe their tasks in either POSIX commands or fluent human language

Mike Cvet 30 Mar 12, 2024
j is a limited subset of J, an array programming language

j is a limited subset of J, an array programming language. this file is an accompanying essay.

katelyn martin 7 Nov 8, 2023
Shellfirm - Intercept any risky patterns (default or defined by you) and prompt you a small challenge for double verification

shellfirm Opppppsss you did it again? ?? ?? ?? Protect yourself from yourself! rm -rf * git reset --hard before saving? kubectl delete ns which going

elad 652 Dec 29, 2022
An Interpreter for Brainfuck programming language implemented in the Rust programming language with zero dependencies.

Brainfuck Hello, Visitor! Hey there, welcome to my project showcase website! It's great to have you here. I hope you're ready to check out some awesom

Syed Vilayat Ali Rizvi 7 Mar 31, 2023
Rust API Server: A versatile template for building RESTful interfaces, designed for simplicity in setup and configuration using the Rust programming language.

RUST API SERVER Introduction Welcome to the Rust API Server! This server provides a simple REST interface for your applications. This README will guid

Harry Nguyen 3 Feb 25, 2024
The Amp programming language: a language designed for building high performance systems.

A language designed for building high performance systems. Platform Support x86_64-pc-windows ✅ x86_64-unknown-linux ⚠️ untested x86_64-unknown-darwin

The Amp Programming Language 5 Mar 17, 2023
Nexa programming language. A language for game developers by a game developer

NexaLang Nexa programming language. A language for game developers by a game developer. Features High-Level: Nexa is an easy high level language Two M

Sabe 3 Aug 21, 2023
A clone of linux cal command, using rust programming language

CLI Calendar command What the project does This command was inspired by Linux Cal command that shows the current month calendar as output (by default)

Mohamed Djoudi Benarfa 2 Nov 16, 2022
Horus is an open source tool for running forensic and administrative tasks at the kernel level using eBPF, a low-overhead in-kernel virtual machine, and the Rust programming language.

Horus Horus is an open-source tool for running forensic and administrative tasks at the kernel level using eBPF, a low-overhead in-kernel virtual mach

null 4 Dec 15, 2022
Attempt to summarize text from `stdin`, using a large language model (locally and offline), to `stdout`

summarize-cli Attempt to summarize text from stdin, using a large language model (locally and offline), to stdout. cargo build --release target/releas

null 4 Aug 23, 2023
Terminal UI to chat with large language models (LLM) using different model backends, and integrations with your favourite editors!

Oatmeal Terminal UI to chat with large language models (LLM) using different model backends, and integrations with your favourite editors! Overview In

Dustin Blackman 88 Dec 4, 2023
Using BDK from nodejs using WASM webpack 🦀

BDK + nodejs = ❤️ This repository shows how to use the bdk library in nodejs. It's just a proof-of-concept, not a complete example, and as such, it's

Daniela Brozzoni 10 Feb 21, 2023
A Text User Interface library for the Rust programming language

Cursive Cursive is a TUI (Text User Interface) library for rust. It uses ncurses by default, but other backends are available. It allows you to build

Alexandre Bury 3.3k Jan 9, 2023
A Text User Interface library for the Rust programming language

Cursive Cursive is a TUI (Text User Interface) library for rust. It uses ncurses by default, but other backends are available. It allows you to build

Alexandre Bury 3.3k Jan 3, 2023
A Source library for Ara Programming Language 🗃

Ara Source A Source library for Ara Programming Language ?? Usage Add ara_source to your Cargo.toml, and you're good to go! [dependencies] ara_source

Ara Programming Language 3 Jan 10, 2023
A Reporting library for for Ara Programming Language 📃

Ara Reporting A Reporting library for Ara Programming Language ?? Internally, Ara reporting uses the codespan-reporting library to build a report of t

Ara Programming Language 4 Jan 7, 2023