🐎 Daac Horse: Double-Array Aho-Corasick in Rust

Overview

🐎 daachorse

Daac Horse: Double-Array Aho-Corasick

Crates.io Documentation

Overview

A fast implementation of the Aho-Corasick algorithm using Double-Array Trie.

Examples

use daachorse::DoubleArrayAhoCorasick;

let patterns = vec!["bcd", "ab", "a"];
let pma = DoubleArrayAhoCorasick::new(patterns).unwrap();

let mut it = pma.find_overlapping_iter("abcd");

let m = it.next().unwrap();
assert_eq!((0, 1, 2), (m.start(), m.end(), m.pattern()));

let m = it.next().unwrap();
assert_eq!((0, 2, 1), (m.start(), m.end(), m.pattern()));

let m = it.next().unwrap();
assert_eq!((1, 4, 0), (m.start(), m.end(), m.pattern()));

assert_eq!(None, it.next());

Disclaimer

This software is developed by LegalForce, Inc., but not an officially supported LegalForce product.

License

Licensed under either of

at your option.

Contribution

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.

Comments
  • Memory efficient management of outputs

    Memory efficient management of outputs

    Daachorse handles a value-length pair as a pattern's output (see https://github.com/daac-tools/daachorse/blob/main/src/lib.rs#L289-L292)

    In the current implementation, 31 bits and 32 bits are assigned to length and value, respectively. (1 bit is used for the flag,) But, in many cases, the assignment is too rich. For example, when the maximum length is 255, 1 byte is sufficient to represent.

    If we know the maximum length and value, we can memory-efficiently store members on byte-aligned memory. For example, if a length is represented in 1 byte and a value (with flag) is represented in 3 bytes, we can interleave them in a byte array outputs as follows.

    outputs[0] = length 1
    outputs[1] = value 1
    outputs[2] = value 1
    outputs[3] = value 1
    outputs[4] = length 2
    outputs[5] = value 2
    outputs[6] = value 2
    outputs[7] = value 2
    ...
    
    enhancement 
    opened by kampersanda 5
  • Add serialization and deserialization functions

    Add serialization and deserialization functions

    This branch adds the following serialization and deserialization functions for convenience.

    • serialize()
    • serialize_to_vec()
    • deserialize_unchecked()
    • deserialize_from_slice_unchecked()

    Deserialization functions do not validate the given automaton, so these are defined as unsafe functions. It means that the deserialization should only be used when the source of the data is completely trusted.

    serialize() and deserialize_unchecked() use traits under std::io, so these functions are unavailable in the no_std environment. For bare-metal programming, they must use serialize_to_vec() and deserialize_from_slice_unchecked() instead.

    opened by vbkaisetsu 2
  • Remove INVALID_BASE and use Option<NonZeroU32> for base

    Remove INVALID_BASE and use Option for base

    I am not sure if the following part is correct, could you check carefully? https://github.com/daac-tools/daachorse/blob/4794bcc2d3525dfcb92c93c54ff0a4185c0e7c8e/src/charwise/builder.rs#L324-L327

    opened by vbkaisetsu 1
  • Check overflow in BuildHelper

    Check overflow in BuildHelper

    If users specify the large number to num_free_blocks, the capacity size will be wrapped. This branch checks the value before calculating the capacity size.

    Blocked by #52

    opened by vbkaisetsu 1
  • Generalize the auxiliary data structure in double-array construction

    Generalize the auxiliary data structure in double-array construction

    This PR divided Extra from builder.rs and generalized the auxiliary data structure in double-array construction. BuildHelper can be used in both the cases of bytewise and charwise, although I will modify charwise/builder.rs in another PR.

    opened by kampersanda 1
  • Add `n_free_blocks()` option

    Add `n_free_blocks()` option

    This branch removes the following constants:

    • INIT_CAPACITY: states is initialized by with_capacity() with this value, but it is often reallocated for a large automaton. On the other hand, this capacity is too large for small devices.
    • FREE_BLOCKS and FREE_STATES: Some small devices do not have enough memory, so I added an option to change these parameters.
    opened by vbkaisetsu 1
  • Refactor const params

    Refactor const params

    I moved constant parameters used only in builder.rs to the file. And, FAIL_INVALID is replaced into FAIL_MAX because a default fail value can be set to zero and the constant parameter will be used only for indicating the maximum possible value.

    opened by kampersanda 1
  • Rename some functions following API guidelines

    Rename some functions following API guidelines

    This branch renames some functions as follows:

    • Removes the prefix get_.
    • Removes the prefix set_ from zero-cost functions, adds the suffix _mut, and changes to return the mutable reference.

    For some structures, leave set_ for consistency.

    opened by vbkaisetsu 0
  • Enable sparse-registry on nightly Cargo

    Enable sparse-registry on nightly Cargo

    This branch enables sparse-registry on Nightly Cargo.

    The build time on Nightly is reduced as follows: https://github.com/daac-tools/daachorse/actions/runs/2660385065

    opened by vbkaisetsu 0
Releases(v1.0.0)
Owner
null
A full-text search and indexing server written in Rust.

Bayard Bayard is a full-text search and indexing server written in Rust built on top of Tantivy that implements Raft Consensus Algorithm and gRPC. Ach

Bayard Search 1.8k Dec 26, 2022
AI-powered search engine for Rust

txtai: AI-powered search engine for Rust txtai executes machine-learning workflows to transform data and build AI-powered text indices to perform simi

NeuML 69 Jan 2, 2023
A full-text search engine in rust

Toshi A Full-Text Search Engine in Rust Please note that this is far from production ready, also Toshi is still under active development, I'm just slo

Toshi Search 3.8k Jan 7, 2023
Official Elasticsearch Rust Client

elasticsearch   Official Rust Client for Elasticsearch. Full documentation is available at https://docs.rs/elasticsearch The project is still very muc

elastic 574 Jan 5, 2023
Strongly typed Elasticsearch DSL written in Rust

Strongly typed Elasticsearch DSL written in Rust This is an unofficial library and doesn't yet support all the DSL, it's still work in progress. Featu

null 173 Jan 6, 2023
Searching for plain-text files for lines that match a given string. Built with Rust.

Getting Started This is a minimal grep command-line utility built on Rust. It provides searching for plain-text files for lines that match a given str

Harsh Karande 0 Dec 31, 2021
A Rust API search engine

Roogle Roogle is a Rust API search engine, which allows you to search functions by names and type signatures. Progress Available Queries Function quer

Roogle 342 Dec 26, 2022
Easy c̵̰͠r̵̛̠ö̴̪s̶̩̒s̵̭̀-t̶̲͝h̶̯̚r̵̺͐e̷̖̽ḁ̴̍d̶̖̔ ȓ̵͙ė̶͎ḟ̴͙e̸̖͛r̶̖͗ë̶̱́ṉ̵̒ĉ̷̥e̷͚̍ s̷̹͌h̷̲̉a̵̭͋r̷̫̊ḭ̵̊n̷̬͂g̵̦̃ f̶̻̊ơ̵̜ṟ̸̈́ R̵̞̋ù̵̺s̷̖̅ţ̸͗!̸̼͋

Rust S̵̓i̸̓n̵̉ I̴n̴f̶e̸r̵n̷a̴l mutability! Howdy, friendly Rust developer! Ever had a value get m̵̯̅ð̶͊v̴̮̾ê̴̼͘d away right under your nose just when

null 294 Dec 23, 2022
Message authentication code algorithms written in pure Rust

RustCrypto: Message Authentication Codes Collection of Message Authentication Code (MAC) algorithms written in pure Rust. Supported Algorithms Algorit

Rust Crypto 155 Dec 31, 2022
Experimenting with Rust implementations of IP address lookup algorithms.

This repository contains some very rough experimentation with Rust implementations of IP address lookup algorithms, both my own and comparison with th

Ximon Eighteen 0 Jan 10, 2022
An example of web application by using Rust and Axum with Clean Architecture.

stock-metrics Stock price and stats viewer. Getting Started Middleware Launch the middleware by executing docker compose: cd local-middleware docker c

Yuki Toyoda 62 Dec 10, 2022
2 and 3-dimensional collision detection library in Rust.

2D Documentation | 3D Documentation | User Guide | Forum ⚠️ **This crate is now passively-maintained. It is being superseded by the Parry project.** ⚠

dimforge 914 Dec 24, 2022
Shogun search - Learning the principle of search engine. This is the first time I've written Rust.

shogun_search Learning the principle of search engine. This is the first time I've written Rust. A search engine written in Rust. Current Features: Bu

Yuxiang Liu 5 Mar 9, 2022
Image search example by approximate nearest-neighbor library In Rust

rust-ann-search-example Image search example by approximate nearest-neighbor library In Rust use - tensorflow 0.17.0 - pretrain ResNet50 - hora (Ru

vaaaaanquish 8 Jan 3, 2022
Simplified Find command made with Rust.

Hunt Hunt is a (highly-opinionated) simplified Find command made with Rust. It searches a file/folder by name on the entire drive. If the --first flag

LyonSyonII 65 Jan 4, 2023
Tantivy is a full text search engine library written in Rust.

Tantivy is a full text search engine library written in Rust. It is closer to Apache Lucene than to Elasticsearch or Apache Solr in the sense it is no

Quickwit OSS 7.4k Dec 30, 2022
Tantivy is a full-text search engine library inspired by Apache Lucene and written in Rust

Tantivy is a full-text search engine library written in Rust. It is closer to Apache Lucene than to Elasticsearch or Apache Solr in the sense it is no

Quickwit OSS 7.5k Jan 9, 2023
ik-analyzer for rust; chinese tokenizer for tantivy

ik-rs ik-analyzer for Rust support Tantivy Usage Chinese Segment let mut ik = IKSegmenter::new(); let text = "中华人民共和国"; let tokens = ik.to

Shen Yanchao 4 Dec 26, 2022
Python bindings for Milli, the embeddable Rust-based search engine powering Meilisearch

milli-py Python bindings for Milli, the embeddable Rust-based search engine powering Meilisearch. Due to limitations around Rust lifecycles, methods a

Alexandro Sanchez 92 Feb 21, 2023