A fast implementation of Aho-Corasick in Rust.

Overview

aho-corasick

A library for finding occurrences of many patterns at once with SIMD acceleration in some cases. This library provides multiple pattern search principally through an implementation of the Aho-Corasick algorithm, which builds a finite state machine for executing searches in linear time. Features include case insensitive matching, overlapping matches, fast searching via SIMD and optional full DFA construction and search & replace in streams.

Build status

Dual-licensed under MIT or the UNLICENSE.

Documentation

https://docs.rs/aho-corasick

Usage

Add this to your Cargo.toml:

[dependencies]
aho-corasick = "0.7"

and this to your crate root (if you're using Rust 2015):

extern crate aho_corasick;

Example: basic searching

This example shows how to search for occurrences of multiple patterns simultaneously. Each match includes the pattern that matched along with the byte offsets of the match.

use aho_corasick::AhoCorasick;

let patterns = &["apple", "maple", "Snapple"];
let haystack = "Nobody likes maple in their apple flavored Snapple.";

let ac = AhoCorasick::new(patterns);
let mut matches = vec![];
for mat in ac.find_iter(haystack) {
    matches.push((mat.pattern(), mat.start(), mat.end()));
}
assert_eq!(matches, vec![
    (1, 13, 18),
    (0, 28, 33),
    (2, 43, 50),
]);

Example: case insensitivity

This is like the previous example, but matches Snapple case insensitively using AhoCorasickBuilder:

use aho_corasick::AhoCorasickBuilder;

let patterns = &["apple", "maple", "snapple"];
let haystack = "Nobody likes maple in their apple flavored Snapple.";

let ac = AhoCorasickBuilder::new()
    .ascii_case_insensitive(true)
    .build(patterns);
let mut matches = vec![];
for mat in ac.find_iter(haystack) {
    matches.push((mat.pattern(), mat.start(), mat.end()));
}
assert_eq!(matches, vec![
    (1, 13, 18),
    (0, 28, 33),
    (2, 43, 50),
]);

Example: replacing matches in a stream

This example shows how to execute a search and replace on a stream without loading the entire stream into memory first.

use aho_corasick::AhoCorasick;

let patterns = &["fox", "brown", "quick"];
let replace_with = &["sloth", "grey", "slow"];

// In a real example, these might be `std::fs::File`s instead. All you need to
// do is supply a pair of `std::io::Read` and `std::io::Write` implementations.
let rdr = "The quick brown fox.";
let mut wtr = vec![];

let ac = AhoCorasick::new(patterns);
ac.stream_replace_all(rdr.as_bytes(), &mut wtr, replace_with)
    .expect("stream_replace_all failed");
assert_eq!(b"The slow grey sloth.".to_vec(), wtr);

Example: finding the leftmost first match

In the textbook description of Aho-Corasick, its formulation is typically structured such that it reports all possible matches, even when they overlap with another. In many cases, overlapping matches may not be desired, such as the case of finding all successive non-overlapping matches like you might with a standard regular expression.

Unfortunately the "obvious" way to modify the Aho-Corasick algorithm to do this doesn't always work in the expected way, since it will report matches as soon as they are seen. For example, consider matching the regex Samwise|Sam against the text Samwise. Most regex engines (that are Perl-like, or non-POSIX) will report Samwise as a match, but the standard Aho-Corasick algorithm modified for reporting non-overlapping matches will report Sam.

A novel contribution of this library is the ability to change the match semantics of Aho-Corasick (without additional search time overhead) such that Samwise is reported instead. For example, here's the standard approach:

use aho_corasick::AhoCorasick;

let patterns = &["Samwise", "Sam"];
let haystack = "Samwise";

let ac = AhoCorasick::new(patterns);
let mat = ac.find(haystack).expect("should have a match");
assert_eq!("Sam", &haystack[mat.start()..mat.end()]);

And now here's the leftmost-first version, which matches how a Perl-like regex will work:

use aho_corasick::{AhoCorasickBuilder, MatchKind};

let patterns = &["Samwise", "Sam"];
let haystack = "Samwise";

let ac = AhoCorasickBuilder::new()
    .match_kind(MatchKind::LeftmostFirst)
    .build(patterns);
let mat = ac.find(haystack).expect("should have a match");
assert_eq!("Samwise", &haystack[mat.start()..mat.end()]);

In addition to leftmost-first semantics, this library also supports leftmost-longest semantics, which match the POSIX behavior of a regular expression alternation. See MatchKind in the docs for more details.

Minimum Rust version policy

This crate's minimum supported rustc version is 1.28.0.

The current policy is that the minimum Rust version required to use this crate can be increased in minor version updates. For example, if crate 1.0 requires Rust 1.20.0, then crate 1.0.z for all values of z will also require Rust 1.20.0 or newer. However, crate 1.y for y > 0 may require a newer minimum version of Rust.

In general, this crate will be conservative with respect to the minimum supported version of Rust.

Future work

Here are some plans for the future:

  • Assuming the current API is sufficient, I'd like to commit to it and release a 1.0 version of this crate some time in the next 6-12 months.
  • Support stream searching with leftmost match semantics. Currently, only standard match semantics are supported. Getting this right seems possible, but is tricky since the match state needs to be propagated through multiple searches. (With standard semantics, as soon as a match is seen the search ends.)
Comments
  • Substring Value

    Substring Value

    Hi, how do I use the following crate to return a string value so that "I like maple apples." Would return "I like maple" on pati: 0, start: 0, end: 11.

    opened by AlexCreamer 14
  • Discussion: Location for Python wrapper

    Discussion: Location for Python wrapper

    Hi,

    I think we briefly talked on Hacker News about preliminary experiment I did wrapping your library with Python. After further testing, it appears meaningfully faster than existing alternatives (pyahocorasick, cyac) so I am going to be turning this into a real Python package.

    There are two ways this could work:

    • Separate repository.
    • It's included in this repository. This is the approach taken by https://github.com/ritchie46/polars, which has both Rust package and corresponding Python wrapper in the same repository.

    In either case I would write the code, so not asking for any of your effort in the first pass.

    The benefits of the latter are that things like new releases of Rust package could (probably?) automatically do Python releases with latest Rust version. The downsides to you are that the Python package is now on your critical path to releases etc., so plausibly it will eventually cause extra work for you in the future.

    What do you prefer?

    opened by itamarst 13
  • Error

    Error "use of undeclared trait name `IntoIterator`" when building.

    I'm building rusti which is based on this project, and got this error preventing me to install it. I happens with the 0.3.0 version, whole log is listed below:

    /Users/windrunner/.cargo/registry/src/github.com-1ecc6299db9ec823/aho-corasick-0.3.0/src/lib.rs:208:22: 208:42 error: use of undeclared trait name `IntoIterator`
    /Users/windrunner/.cargo/registry/src/github.com-1ecc6299db9ec823/aho-corasick-0.3.0/src/lib.rs:208             where I: IntoIterator<Item=P> {
                                                                                                                             ^~~~~~~~~~~~~~~~~~~~
    /Users/windrunner/.cargo/registry/src/github.com-1ecc6299db9ec823/aho-corasick-0.3.0/src/lib.rs:221:22: 221:42 error: use of undeclared trait name `IntoIterator`
    /Users/windrunner/.cargo/registry/src/github.com-1ecc6299db9ec823/aho-corasick-0.3.0/src/lib.rs:221             where I: IntoIterator<Item=P> {
                                                                                                                             ^~~~~~~~~~~~~~~~~~~~
    /Users/windrunner/.cargo/registry/src/github.com-1ecc6299db9ec823/aho-corasick-0.3.0/src/lib.rs:467:55: 467:75 error: use of undeclared trait name `IntoIterator`
    /Users/windrunner/.cargo/registry/src/github.com-1ecc6299db9ec823/aho-corasick-0.3.0/src/lib.rs:467     fn from_iter<T>(it: T) -> AcAutomaton<S> where T: IntoIterator<Item=S> {
                                                                                                                                                              ^~~~~~~~~~~~~~~~~~~~
    error: aborting due to 3 previous errors
    /Users/windrunner/.cargo/registry/src/github.com-1ecc6299db9ec823/tempfile-1.1.0/src/named.rs:119:9: 119:21 error: failed to resolve. Use of undeclared type or module `Self`
    /Users/windrunner/.cargo/registry/src/github.com-1ecc6299db9ec823/tempfile-1.1.0/src/named.rs:119         Self::new_in(&env::temp_dir())
                                                                                                              ^~~~~~~~~~~~
    /Users/windrunner/.cargo/registry/src/github.com-1ecc6299db9ec823/tempfile-1.1.0/src/named.rs:119:9: 119:21 error: unresolved name `Self::new_in`
    /Users/windrunner/.cargo/registry/src/github.com-1ecc6299db9ec823/tempfile-1.1.0/src/named.rs:119         Self::new_in(&env::temp_dir())
                                                                                                              ^~~~~~~~~~~~
    /Users/windrunner/.cargo/registry/src/github.com-1ecc6299db9ec823/tempfile-1.1.0/src/unnamed.rs:47:9: 47:21 error: failed to resolve. Use of undeclared type or module `Self`
    /Users/windrunner/.cargo/registry/src/github.com-1ecc6299db9ec823/tempfile-1.1.0/src/unnamed.rs:47         Self::new_in(&env::temp_dir())
                                                                                                               ^~~~~~~~~~~~
    /Users/windrunner/.cargo/registry/src/github.com-1ecc6299db9ec823/tempfile-1.1.0/src/unnamed.rs:47:9: 47:21 error: unresolved name `Self::new_in`
    /Users/windrunner/.cargo/registry/src/github.com-1ecc6299db9ec823/tempfile-1.1.0/src/unnamed.rs:47         Self::new_in(&env::temp_dir())
                                                                                                               ^~~~~~~~~~~~
    /Users/windrunner/.cargo/registry/src/github.com-1ecc6299db9ec823/tempfile-1.1.0/src/unnamed.rs:65:9: 65:24 errorBuild failed, waiting for other jobs to finish...
    : failed to resolve. Use of undeclared type or module `Self`
    /Users/windrunner/.cargo/registry/src/github.com-1ecc6299db9ec823/tempfile-1.1.0/src/unnamed.rs:65         Self::shared_in(&env::temp_dir(), count)
                                                                                                               ^~~~~~~~~~~~~~~
    /Users/windrunner/.cargo/registry/src/github.com-1ecc6299db9ec823/tempfile-1.1.0/src/unnamed.rs:65:9: 65:24 error: unresolved name `Self::shared_in`
    /Users/windrunner/.cargo/registry/src/github.com-1ecc6299db9ec823/tempfile-1.1.0/src/unnamed.rs:65         Self::shared_in(&env::temp_dir(), count)
                                                                                                               ^~~~~~~~~~~~~~~
    error: aborting due to 6 previous errors
    Could not compile `aho-corasick`.
    
    To learn more, run the command again with --verbose.
    

    The relecant code is here: https://github.com/BurntSushi/aho-corasick/blob/master/src%2Flib.rs#L208

    opened by kxxoling 13
  • why does this library return byte offsets instead of codepoint offsets?

    why does this library return byte offsets instead of codepoint offsets?

    I'm using your package to match some patterns against long pieces of text, sometimes containing non-ASCII characters. When I investigate the matches I find that the offsets correspond to byte-offsets, not character offsets. This is confusing. It would be great to get character offsets instead, or as an option.

    character at position 36 is code point 8217 in this example: The Project Gutenberg eBook of Alice’s Adventures in Wonderland, by Lewis Carroll

    So a match searching from eBook correctly returns 22 as the offset, while a match searching for Adventures returns 41 though it should be 39

    opened by pavelgrib 12
  • Motion to grant V1.0.0 status

    Motion to grant V1.0.0 status

    I move that you version the current (or next) release as version 1.0.0. Recall the Semantic Versioning spec:

    1. Major version zero (0.y.z) is for initial development. Anything MAY change at any time. The public API SHOULD NOT be considered stable.

    2. Version 1.0.0 defines the public API. The way in which the version number is incremented after this release is dependent on this public API and how it changes.

    Commits to this repo are now sparse. The last commit was 7 months ago. This is strong evidence of its API stability. Indeed, at this point, given the importance of this crate and how widespread its use is, changing the public API at any time is impractical anyway. The crate is functionally at v1. Officially bumping the version would just align the version number to its de facto use.

    opened by rljacobson 9
  • perf: Skip the first goto call in full construction (-35%)

    perf: Skip the first goto call in full construction (-35%)

    • perf: Skip the first goto call in full construction (-35%)

    This is probably less significant when larger number of needles are used but should still be a decent speed improvement.

     name                     old ns/iter  new ns/iter  diff ns/iter   diff %  speedup
     bench_construction       131,148      122,956            -8,192   -6.25%   x 1.07
     bench_full_construction  309,493      200,595          -108,898  -35.19%   x 1.54
    
    • perf: Use for_each_transition in AcAutomaton::fill (-84%)
     name                     old ns/iter  new ns/iter  diff ns/iter   diff %  speedup
     bench_construction       167,415      25,804           -141,611  -84.59%   x 6.49
     bench_full_construction  275,170      133,253          -141,917  -51.57%   x 2.07
    

    (The two benchmarks were done on different computers)

    opened by Marwes 8
  • Add support for async/non-blocking

    Add support for async/non-blocking

    Unless I'm missing something, aho-corasick either requires the entire haystack, or a Read on which it blocks.

    It would be nice if you could feed it bytes as they come in.

    opened by jakajancar 7
  • next_state?

    next_state?

    We were using (abusing?) v0.6's exposure of next_state, get_match, and has_match in order to run on streams (running over a chunk/frame at a time). Can this API be brought back?

    opened by Triften 6
  • RFC: remove configuration knobs

    RFC: remove configuration knobs

    This RFC is basically a mirror of https://github.com/BurntSushi/regex-automata/issues/7

    The idea here is that I would like to make premultiply=true, byte_classes=true be the only option that one can choose for DFAs. (It is currently the default for DFAs.) I think it was a mistake to expose these knobs, as disabling them doesn't really confer benefits (if any). But they are quite costly in terms of code size and clarity.

    Are there any strenuous objections to removing these options?

    opened by BurntSushi 6
  • RareBytesTwo prefilter could cause false negative matches

    RareBytesTwo prefilter could cause false negative matches

    Hi,

    We recently got a user report about a path matching correctness issue in our library using globset. I have bisected the issue to https://github.com/BurntSushi/aho-corasick/commit/063ca0d253b1cfcef9f6b5533bee0d1fd2737c33. The reproduce test case can be found at https://gist.github.com/quark-zju/600f9f869f83384b87c66a4408c78bef#file-lib-rs-L523. I wonder if you have any quick insights about what went wrong in this area. Meanwhile I'm still trying to understand the related code and further minimize the test.

    opened by quark-zju 6
  • Add experimental feature flagged no_std support

    Add experimental feature flagged no_std support

    What

    The addition of feature flags, optional dependencies, and conditional compilation blocks to make aho-corasick work in either std or no_std + alloc environments.

    The default-on "use_std" feature flag controls whether or not to compile in #![no_std] mode.

    The alloc feature flag must be specified and the toolchain fixed to nightly-2018-03-07 for the no_std experimental mode to operate, due to the rather unpleasant pinning of core_io to particular versions.

    How

    To build for std, nothing changes. cargo build

    To build for no_std:

    cargo build --no-default-features --features "alloc"
    

    The test suites pass as normal in either mode, with the exception of one which relies on HashSet, which is not readily available in alloc, and one doc-test that makes use of the filesystem.

    Why

    This PR is part of a thought experiment regarding what it will take to get regex and its dependencies to operate in no_std + alloc mode. In particular, it is aimed to prompt discussion about whether the general approach is acceptable and feel out options for the "io in no_std" situation. The option used in this experiment thus far, https://github.com/jethrogb/rust-core_io , is thoroughly unofficial and, as mentioned, is both unfortunately version-pinned and somewhat annoying to interoperate.

    opened by ZackPierce 6
  • For Python wrapper, is it worth switching to new NFA, or sticking to DFA?

    For Python wrapper, is it worth switching to new NFA, or sticking to DFA?

    I explicitly chose DFA for the previous version with the goal of maximum runtime performance, since that's the typical Python tradeoff for bulk data processing: live with slow setup in return for fast bulk operations.

    It sounds like the new contiguous NFA is a lot faster in 1.0, though. What do you think?

    opened by itamarst 2
Owner
Andrew Gallant
I love to code.
Andrew Gallant
Fast suffix arrays for Rust (with Unicode support).

suffix Fast linear time & space suffix arrays for Rust. Supports Unicode! Dual-licensed under MIT or the UNLICENSE. Documentation https://docs.rs/suff

Andrew Gallant 207 Dec 26, 2022
Rust edit distance routines accelerated using SIMD. Supports fast Hamming, Levenshtein, restricted Damerau-Levenshtein, etc. distance calculations and string search.

triple_accel Rust edit distance routines accelerated using SIMD. Supports fast Hamming, Levenshtein, restricted Damerau-Levenshtein, etc. distance cal

Daniel Liu 75 Jan 8, 2023
A fast, low-resource Natural Language Processing and Text Correction library written in Rust.

nlprule A fast, low-resource Natural Language Processing and Error Correction library written in Rust. nlprule implements a rule- and lookup-based app

Benjamin Minixhofer 496 Jan 8, 2023
Find files (ff) by name, fast!

Find Files (ff) Find Files (ff) utility recursively searches the files whose names match the specified RegExp pattern in the provided directory (defau

Vishal Telangre 310 Dec 29, 2022
💥 Fast State-of-the-Art Tokenizers optimized for Research and Production

Provides an implementation of today's most used tokenizers, with a focus on performance and versatility. Main features: Train new vocabularies and tok

Hugging Face 6.2k Jan 5, 2023
Fast and easy random number generation.

alea A zero-dependency crate for fast number generation, with a focus on ease of use (no more passing &mut rng everywhere!). The implementation is bas

Jeff Shen 26 Dec 13, 2022
Vaporetto: a fast and lightweight pointwise prediction based tokenizer

?? VAporetto: POintwise pREdicTion based TOkenizer Vaporetto is a fast and lightweight pointwise prediction based tokenizer. Overview This repository

null 184 Dec 22, 2022
Blazingly fast framework for in-process microservices on top of Tower ecosystem

norpc = not remote procedure call Motivation Developing an async application is often a very difficult task but building an async application as a set

Akira Hayakawa 15 Dec 12, 2022
Composable n-gram combinators that are ergonomic and bare-metal fast

CREATURE FEATUR(ization) A crate for polymorphic ML & NLP featurization that leverages zero-cost abstraction. It provides composable n-gram combinator

null 3 Aug 25, 2022
A simple and fast linear algebra library for games and graphics

glam A simple and fast 3D math library for games and graphics. Development status glam is in beta stage. Base functionality has been implemented and t

Cameron Hart 953 Jan 3, 2023
Ultra-fast, spookily accurate text summarizer that works on any language

pithy 0.1.0 - an absurdly fast, strangely accurate, summariser Quick example: pithy -f your_file_here.txt --sentences 4 --help: Print this help messa

Catherine Koshka 13 Oct 31, 2022
Fast PDF password cracking utility equipped with commonly encountered password format builders and dictionary attacks.

PDFRip Fast PDF password cracking utility equipped with commonly encountered password format builders and dictionary attacks. ?? Table of Contents Int

Mufeed VH 226 Jan 4, 2023
🛥 Vaporetto is a fast and lightweight pointwise prediction based tokenizer. This is a Python wrapper for Vaporetto.

?? python-vaporetto ?? Vaporetto is a fast and lightweight pointwise prediction based tokenizer. This is a Python wrapper for Vaporetto. Installation

null 17 Dec 22, 2022
A lightning-fast Sanskrit toolkit. For Python bindings, see `vidyut-py`.

Vidyut मा भूदेवं क्षणमपि च ते विद्युता विप्रयोगः ॥ Vidyut is a lightning-fast toolkit for processing Sanskrit text. Vidyut aims to provide standard co

Ambuda 14 Dec 30, 2022
Multilingual implementation of RAKE algorithm for Rust

RAKE.rs The library provides a multilingual implementation of Rapid Automatic Keyword Extraction (RAKE) algorithm for Rust. How to Use Append rake to

Navid 26 Dec 16, 2022
Snips NLU rust implementation

Snips NLU Rust Installation Add it to your Cargo.toml: [dependencies] snips-nlu-lib = { git = "https://github.com/snipsco/snips-nlu-rs", branch = "mas

Snips 327 Dec 26, 2022
🦀 A Rust implementation of a RoBERTa classification model for the SNLI dataset

RustBERTa-SNLI A Rust implementation of a RoBERTa classification model for the SNLI dataset, with support for fine-tuning, predicting, and serving. Th

AI2 11 Oct 17, 2022
A rust implementation of some popular snowball stemming algorithms

Rust Stemmers This crate implements some stemmer algorithms found in the snowball project which are compiled to rust using the rust-backend of the sno

CurrySoftware GmbH 84 Dec 15, 2022
Gomez - A pure Rust framework and implementation of (derivative-free) methods for solving nonlinear (bound-constrained) systems of equations

Gomez A pure Rust framework and implementation of (derivative-free) methods for solving nonlinear (bound-constrained) systems of equations. Warning: T

Datamole 19 Dec 24, 2022