Rust edit distance routines accelerated using SIMD. Supports fast Hamming, Levenshtein, restricted Damerau-Levenshtein, etc. distance calculations and string search.

Overview

triple_accel

Test GitHub Crates.io Docs.rs

Rust edit distance routines accelerated using SIMD. Supports fast Hamming, Levenshtein, restricted Damerau-Levenshtein, etc. distance calculations and string search.

Although vectorized SIMD code allows for up to 20-30x speedups over their scalar counterparts, the difficulty of handling platform-dependent SIMD code makes SIMD routines less attractive. The goal of this library is to provide an easy-to-use abstraction over SIMD edit distance routines that fall back to scalar routines if the target CPU architecture is not supported. Additionally, all limitations and tradeoffs of the edit distance routines should be provided upfront so the user knows exactly what to expect. Finally, this library should lead to performance boosts on both short and longer strings, so it can be used for a variety of tasks, from bioinformatics to natural language processing. triple_accel is very lightweight: it only has dependencies on other crates for benchmarking. It can be built on machines without CPUs that have AVX2 or SSE4.1 support. It can also run on machines without SIMD support by automatically using scalar alternatives.

Install

Add

triple_accel = "*"

to the [dependencies] section of your Cargo.toml. This library is available here on crates.io.

Alternatively, you can clone this repository and run

cargo build --release

In general, for maximum efficiency, use RUSTFLAGS="-C target-cpu=native" if portability is not an issue.

Tests

You can run tests with

cargo test

after cloning the repository.

Continuous integration is used to ensure that the code passes all tests on the latest Linux, Windows, and Mac platforms. Additionally, crate feature flags like jewel-sse, jewel-avx, jewel-8bit, jewel-16bit, and jewel-32bit are used to override the default automatic detection of CPU features, so all features can be thoroughly tested in continuous integration. The debug feature flag is specified, so the exact underlying vector type that is used is printed.

Benchmarks

Benchmarks can be ran with

cargo bench

Docs

The docs are available here. To build them on your machine, run

cargo doc

Features

This library provides routines for both searching for some needle string in a haystack string and calculating the edit distance between two strings. Hamming distance (mismatches only), Levenshtein distance (mismatches + gaps), and restricted Damerau-Levenshtein distance (transpositions + mismatches + gaps) are supported, along with arbitrary edit costs. This library provides a simple interface, in addition to powerful lower-level control over the edit distance calculations.

At runtime, the implementation for a certain algorithm is selected based on CPU support, going down the list:

  1. Vectorized implementation with 256-bit AVX vectors, if AVX2 is supported.
  2. Vectorized implementation with 128-bit SSE vectors, if SSE4.1 is supported.
  3. Scalar implementation.

Currently, vectorized SIMD implementations are only available for x86 or x86-64 CPUs. However, after compiling this library on a machine that supports those SIMD intrinsics, the library can be used on other machines. Additionally, the internal data structure for storing vectors and the bit width of the values in the vectors are selected at runtime for maximum efficiency and accuracy, given the lengths of the input strings.

Limitations

Due to the use of SIMD intrinsics, only binary strings that are represented with u8 bytes are supported. Unicode strings are not currently supported.

Examples

triple_accel provides a very simple and easy to use framework for common edit distance operations. Calculating the Hamming distance (number of mismatches) between two strings is extremely simple:

use triple_accel::*;

let a = b"abcd";
let b = b"abcc";

let dist = hamming(a, b);
assert!(dist == 1);

By default, SIMD will be used if possible. Similarly, we can easily calculate the Levenshtein distance (character mismatches and gaps all have a cost of 1) between two strings with the following code:

use triple_accel::*;

let a = b"abc";
let b = b"abcd";

let dist = levenshtein_exp(a, b);
assert!(dist == 1);

This uses exponential search to estimate the number of edits between a and b, which makes it more efficient than the alternative levenshtein function when the number of edits between a and b is low.

In addition to edit distance routines, triple_accel also provides search routines. These routines return an iterator over matches that indicate where the needle string matches the haystack string. triple_accel will attempt to maximize the length of matches that end at the same position and remove shorter matches when some matches fully overlap.

use triple_accel::*;

let needle = b"helllo";
let haystack = b"hello world";

let matches: Vec<Match> = levenshtein_search(needle, haystack).collect();
// note: start index is inclusive, end index is exclusive!
assert!(matches == vec![Match{start: 0, end: 5, k: 1}]);

Sometimes, it is necessary to use the slightly lower level, but also more powerful routines that triple_accel provides. For example, it is possible to allow transpositions (character swaps) that have a cost of 1, in addition to mismatches and gaps:

use triple_accel::levenshtein::*;

let a = b"abcd";
let b = b"abdc";
let k = 2; // upper bound on allowed cost
let trace_on = false; // return edit traceback?

let dist = levenshtein_simd_k_with_opts(a, b, k, trace_on, RDAMERAU_COSTS);
// note: dist may be None if a and b do not match within a cost of k
assert!(dist.unwrap().0 == 1);

Don't let the name of the function fool you! levenshtein_simd_k_with_opts will still fall back to the scalar implementation if AVX2 or SSE4.1 support is not available. It just prefers to use SIMD where possible.

For most common cases, the re-exported functions are enough, and the low level functions do not have to be used directly.

License

MIT

Contributing

Read the contributing guidelines here.

Code of Conduct

Read the code of conduct here.

Why the name "triple_accel"?

Because "Time Altar - Triple Accel" is a magical ability used by Kiritsugu Emiya to boost his speed and reaction time in Fate/Zero. There are also some other references to the Fate series...

Comments
  • add &str support

    add &str support

    If am no totally wrong, &str can be supported quite easily. As the algorithm only ever checks for equality, the input type can be any T: PartialEq.

    opened by s3bk 7
  • Implement Jaccard index?

    Implement Jaccard index?

    Hi @Daniel-Liu-c0deb0t!

    I didn't dig down too much into the codebase, but I wonder if there is interest in adding a Jaccard index function (as described in Faster Population Counts Using AVX2 Instructions. The implementation from the paper is at https://github.com/CountOnes/hamming_weight/blob/1dd7554c0fc39e01c9d7fa54372fd4eccf458875/src/avx_jaccard_index.c and while it is a bit heavy on the macros I think it could be translated using the infra you already have in triple_accel?

    (This would be incredibly useful for me, so I can start looking deeper into it if you're also interested =])

    opened by luizirber 5
  • Disable Debug Output

    Disable Debug Output

    Hi, I'm pretty new to Rust so I'm not sure if there is a general way, but is it possible to disable the debug output (e.g. "Debug: Levenshtein Jewel vector type Avx1x32x8 for target "avx2".")? I want to still debug other parts of the code, but the I don't need the debug output of this library each time I use it.

    Best, Nicolas

    opened by ncioj10 4
  • Feature request: bounded hamming distance

    Feature request: bounded hamming distance

    I'm working on an application in which I'll be computing the hamming distance between two strings but only care if it's below $k$. There's a bounded Levenshtein method but no corresponding Hamming function. Is it possible to get a bounded Hamming distance?

    opened by Benjamin-Lee 2
  • Custom comparison function

    Custom comparison function

    Forgive me if this is a silly question, but do you think it would be possible to provide an alternate hamming distance function that can take a custom comparison function?

    For instance, in bioinformatics, if I want the hamming distance between two sequences, but I want to ignore Ns.

    For example

    fn hamming(a: &[u8], b: &[u8], f: &dyn Fn(u8, u8) -> u64) -> u64 {
        a.iter().zip(b).fold(0, |acc, (x, y)| acc + f(*x, *y))
    }
    
    fn main() {
        let s1 = b"ACGT";
        let s2 = b"ANCT";
        fn dist(a: u8, b: u8) -> u64 { (a != b'N' && b != b'N' && a != b) as u64 }
        
        assert_eq!(dist(s1[2], s2[2]), 1);
        assert_eq!(dist(s1[0], s2[0]), 0);
        assert_eq!(dist(s1[1], s2[1]), 0);
        
        assert_eq!(hamming(s1, s2, &dist), 1);
    }
    
    opened by mbhall88 2
  • Could not compile `triple_accel`

    Could not compile `triple_accel`

    System information:

    Linux tianyi-optiplex3070 5.6.15-1-MANJARO #1 SMP PREEMPT Wed May 27 20:38:56 UTC 2020 x86_64 GNU/Linux
    

    Cargo version:

    cargo 1.48.0-nightly (875e01232 2020-09-08)
    

    Error message:

    error: attribute should be applied to a function
        --> /home/tianyi/.cargo/registry/src/mirrors.ustc.edu.cn-61ef6e0cd06fb9b8/triple_accel-0.3.2/src/jewel.rs:1500:21
         |
    1499 | /                 unsafe {
    1500 | |                     #![target_feature(enable = "sse4.1")]
         | |                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    1501 | |                     write!(f, "[")?;
    1502 | |
    ...    |
    1528 | |                     write!(f, "]")
    1529 | |                 }
         | |_________________- not a function
    ...
    1540 |   create_sse_nx16x8!(Sse16x16x8, 16);
         |   ----------------------------------- in this macro invocation
         |
         = note: this error originates in a macro (in Nightly builds, run with -Z macro-backtrace for more info)
    
    error: aborting due to 13 previous errors
    
    error: could not compile `triple_accel`
    
    Caused by:
      process didn't exit successfully: `rustc --crate-name triple_accel --edition=2018 /home/tianyi/.cargo/registry/src/mirrors.ustc.edu.cn-61ef6e0cd06fb9b8/triple_accel-0.3.2/src/lib.rs --error-format=json --json=diagnostic-rendered-ansi --crate-type lib --emit=dep-info,metadata,link -C embed-bitcode=no -C debuginfo=2 --cfg 'feature="default"' --cfg 'feature="jewel-16bit"' --cfg 'feature="jewel-32bit"' --cfg 'feature="jewel-8bit"' --cfg 'feature="jewel-avx"' --cfg 'feature="jewel-sse"' -C metadata=ea803cb5cc9b7c41 -C extra-filename=-ea803cb5cc9b7c41 --out-dir /home/tianyi/a/target/debug/deps -L dependency=/home/tianyi/a/target/debug/deps --cap-lints allow` (exit code: 1)
    

    (multiples errors of the same kind, all saying attribute should be applied to a function)

    opened by TianyiShi2001 2
  • Add exponential search with options

    Add exponential search with options

    This allows doing exponential search for cost models other than levenshtein, and allows returns a trace from the exponential search version.

    Also includes:

    • auto formatting
    • some small cleanups / simplifications. Let me know if I should revert those
    opened by RagnarGrootKoerkamp 0
  • Traceback inaccurate with gap open costs

    Traceback inaccurate with gap open costs

    Need to keep track of traceback for the gap matrices (whether continuing or opening a gap). Currently, following gap transitions just end up on an entry in the dp matrix (not the gap matrices) even if its continuing a gap.

    opened by Daniel-Liu-c0deb0t 0
  • Include benchmarks in README.md

    Include benchmarks in README.md

    It would be helpful to include sample benchmark results (probably as X times speedup at each size, maybe in graph format?) in the README so that users could have context for what performance gains can be expected (and whether everything performs as expected).

    If I have time, I can compile some results, but I don't know if there's a convenient criterion API to generate such a table automatically (it does produce JSON files, which could be tablulated using a script).

    opened by philippeitis 0
Owner
Daniel Liu
1st yr cs boi at @ucla, incoming @10xgenomics, prev @ucsd @openmined
Daniel Liu
rga: ripgrep, but also search in PDFs, E-Books, Office documents, zip, tar.gz, etc.

rga: ripgrep, but also search in PDFs, E-Books, Office documents, zip, tar.gz, etc. rga is a line-oriented search tool that allows you to look for a r

null 5.2k Dec 2, 2022
📏 ― Uses the Jaro similarity metric to measure the distance between two strings

distance distance: Uses the Jaro similarity metric to measure the distance between two strings FYI, this was just to test Neon, I do not recommend usi

Demigender 6 Dec 7, 2021
A lightweight and snappy crate to remove emojis from a string.

A lightweight and snappy crate to remove emojis from a string.

Tejas Ravishankar 8 Jul 19, 2022
Front-coding string dictionary in Rust

Front-coding string dictionary in Rust This is a Rust library of the (plain) front-coding string dictionary described in Martínez-Prieto et al., Pract

Shunsuke Kanda 6 Jul 14, 2022
A Markdown to HTML compiler and Syntax Highlighter, built using Rust's pulldown-cmark and tree-sitter-highlight crates.

A blazingly fast( possibly the fastest) markdown to html parser and syntax highlighter built using Rust's pulldown-cmark and tree-sitter-highlight crate natively for Node's Foreign Function Interface.

Ben Wishovich 48 Nov 11, 2022
A simple and fast linear algebra library for games and graphics

glam A simple and fast 3D math library for games and graphics. Development status glam is in beta stage. Base functionality has been implemented and t

Cameron Hart 937 Dec 2, 2022
A fast, low-resource Natural Language Processing and Text Correction library written in Rust.

nlprule A fast, low-resource Natural Language Processing and Error Correction library written in Rust. nlprule implements a rule- and lookup-based app

Benjamin Minixhofer 489 Nov 19, 2022
💥 Fast State-of-the-Art Tokenizers optimized for Research and Production

Provides an implementation of today's most used tokenizers, with a focus on performance and versatility. Main features: Train new vocabularies and tok

Hugging Face 6k Nov 22, 2022
Fast and easy random number generation.

alea A zero-dependency crate for fast number generation, with a focus on ease of use (no more passing &mut rng everywhere!). The implementation is bas

Jeff Shen 25 May 3, 2022
Vaporetto: a fast and lightweight pointwise prediction based tokenizer

?? VAporetto: POintwise pREdicTion based TOkenizer Vaporetto is a fast and lightweight pointwise prediction based tokenizer. Overview This repository

null 179 Nov 18, 2022
Composable n-gram combinators that are ergonomic and bare-metal fast

CREATURE FEATUR(ization) A crate for polymorphic ML & NLP featurization that leverages zero-cost abstraction. It provides composable n-gram combinator

null 3 Aug 25, 2022
Fast PDF password cracking utility equipped with commonly encountered password format builders and dictionary attacks.

PDFRip Fast PDF password cracking utility equipped with commonly encountered password format builders and dictionary attacks. ?? Table of Contents Int

Mufeed VH 215 Nov 30, 2022
🛥 Vaporetto is a fast and lightweight pointwise prediction based tokenizer. This is a Python wrapper for Vaporetto.

?? python-vaporetto ?? Vaporetto is a fast and lightweight pointwise prediction based tokenizer. This is a Python wrapper for Vaporetto. Installation

null 15 Aug 8, 2022
Fast suffix arrays for Rust (with Unicode support).

suffix Fast linear time & space suffix arrays for Rust. Supports Unicode! Dual-licensed under MIT or the UNLICENSE. Documentation https://docs.rs/suff

Andrew Gallant 204 Oct 12, 2022
A fast implementation of Aho-Corasick in Rust.

aho-corasick A library for finding occurrences of many patterns at once with SIMD acceleration in some cases. This library provides multiple pattern s

Andrew Gallant 648 Dec 1, 2022
Find files (ff) by name, fast!

Find Files (ff) Find Files (ff) utility recursively searches the files whose names match the specified RegExp pattern in the provided directory (defau

Vishal Telangre 309 Nov 26, 2022
Blazingly fast framework for in-process microservices on top of Tower ecosystem

norpc = not remote procedure call Motivation Developing an async application is often a very difficult task but building an async application as a set

Akira Hayakawa 13 Oct 21, 2022
Ultra-fast, spookily accurate text summarizer that works on any language

pithy 0.1.0 - an absurdly fast, strangely accurate, summariser Quick example: pithy -f your_file_here.txt --sentences 4 --help: Print this help messa

Catherine Koshka 13 Oct 31, 2022