🐎 A fast implementation of the Aho-Corasick algorithm using the compact double-array data structure. (Python wrapper for daachorse)

Overview

python-daachorse

daachorse is a fast implementation of the Aho-Corasick algorithm using the compact double-array data structure. This is a Python wrapper.

Installation

To use daachorse, run the following command:

$ pip install daachorse

Example usage

Daachorse contains some search options, ranging from basic matching with the Aho-Corasick algorithm to trickier matching. All of them will run very fast based on the double-array data structure and can be easily plugged into your application as shown below.

Finding overlapped occurrences

To search for all occurrences of registered patterns that allow for positional overlap in the input text, use find_overlapping(). When you instantiate a new automaton, unique identifiers are assigned to each pattern in the input order. The match result has the character positions of the occurrence and its identifier.

>> import daachorse
>> patterns = ['bcd', 'ab', 'a']
>> pma = daachorse.Automaton(patterns)
>> pma.find_overlapping('abcd')
[(0, 1, 2), (0, 2, 1), (1, 4, 0)]

Finding non-overlapped occurrences with standard matching

If you do not want to allow positional overlap, use find() instead. It performs the search on the Aho-Corasick automaton and reports patterns first found in each iteration.

>> import daachorse
>> patterns = ['bcd', 'ab', 'a']
>> pma = daachorse.Automaton(patterns)
>> pma.find('abcd')
[(0, 1, 2), (1, 4, 0)]

Finding non-overlapped occurrences with longest matching

If you want to search for the longest pattern without positional overlap in each iteration, use MATCH_KIND_LEFTMOST_LONGEST in the construction.

>> import daachorse
>> patterns = ['ab', 'a', 'abcd']
>> pma = daachorse.Automaton(patterns, daachorse.MATCH_KIND_LEFTMOST_LONGEST)
>> pma.find('abcd')
[(0, 4, 2)]

Finding non-overlapped occurrences with leftmost-first matching

If you want to find the the earliest registered pattern among ones starting from the search position, use MATCH_KIND_LEFTMOST_FIRST.

This is so-called the leftmost first match, a bit tricky search option. For example, in the following code, ab is reported because it is the earliest registered one.

>> import daachorse
>> patterns = ['ab', 'a', 'abcd']
>> pma = daachorse.Automaton(patterns, daachorse.MATCH_KIND_LEFTMOST_FIRST)
>> pma.find('abcd')
[(0, 2, 0)]

License

Licensed under either of

at your option.

For softwares under tests/data, follow the license terms of each software.

You might also like...
Python wrapper for Rust's httparse HTTP parser

httparse Python wrapper for Rust's httparse. See this project on GitHub. Example from httparse import RequestParser parser = RequestParser() buff =

Python wrapper around reth db. Written in Rust.

reth-db-py Bare-bones Python package allowing you to interact with the Reth DB via Python. Written with Rust and Pyo3. This python wrapper can access

Rust Imaging Library's Python binding: A performant and high-level image processing library for Python written in Rust

ril-py Rust Imaging Library for Python: Python bindings for ril, a performant and high-level image processing library written in Rust. What's this? Th

⚡ Blazing fast async/await HTTP client for Python written on Rust using reqwests

Reqsnaked Reqsnaked is a blazing fast async/await HTTP client for Python written on Rust using reqwests. Works 15% faster than aiohttp on average RAII

A Rust implementation of Glidesort, my stable adaptive quicksort/mergesort hybrid sorting algorithm.
A Rust implementation of Glidesort, my stable adaptive quicksort/mergesort hybrid sorting algorithm.

Glidesort Glidesort is a novel stable sorting algorithm that combines the best-case behavior of Timsort-style merge sorts for pre-sorted data with the

Python package for topological data analysis written in Rust. Not limited to just H0 and H1.

Topological Data Analysis (TDA) Contents Installation Compiling from source Roadmap TDA is a python package for topological data analysis written in R

A lightweight and high-performance order-book designed to process level 2 and trades data. Available in Rust and Python

ninjabook A lightweight and high-performance order-book implemented in Rust, designed to process level 2 and trades data. Available in Python and Rust

 A command-line utility that creates project structure.
A command-line utility that creates project structure.

petridish A command-line utility that creates project structure. If you have heard of the cookiecutter project, petridish is a rust implementation of

Chemical structure generation for protein sequences as SMILES string.
Chemical structure generation for protein sequences as SMILES string.

proteinogenic Chemical structure generation for protein sequences as SMILES string. 🔌 Usage This crate builds on top of purr, a crate providing primi

Releases(v0.1.6)
Owner
Koichi Akabe
Wizard in the Forest
Koichi Akabe
A Rust library for building modular, fast and compact indexes over genomic data

mazu A Rust library for building modular, fast and compact indexes over genomic data Mazu (媽祖)... revered as a tutelary deity of seafarers, including

COMBINE lab 6 Aug 15, 2023
A compact implementation of connect four written in rust.

connect-four A compact implementation of connect four written in rust. Run the game At the moment there no pre-built binaries - but you can build it l

Maximilian Schulke 12 Jul 31, 2022
A simple CLI tool to create python project file structure, written in Rust

Ezpie Create python projects blazingly fast What Ezpie can do? It can create a python project directory What kind of directory can Ezpie create? For c

Faseeh 3 Sep 29, 2023
Shellfirm - Intercept any risky patterns (default or defined by you) and prompt you a small challenge for double verification

shellfirm Opppppsss you did it again? ?? ?? ?? Protect yourself from yourself! rm -rf * git reset --hard before saving? kubectl delete ns which going

elad 652 Dec 29, 2022
PyO3 bindings and Python interface to skani, a method for fast fast genomic identity calculation using sparse chaining.

?? ⛓️ ?? Pyskani PyO3 bindings and Python interface to skani, a method for fast fast genomic identity calculation using sparse chaining. ??️ Overview

Martin Larralde 13 Mar 21, 2023
🎙 A compact library for working with user output

?? Storyteller A library for working with user output Table of contents ?? Introduction ?? Visualized introduction ?? Example source code ❓ Origins ??

Martijn Gribnau 30 Dec 7, 2022
A more compact and intuitive ASCII table in your terminal: an alternative to "man 7 ascii" and "ascii"

asciit A more compact and intuitive ASCII table in your terminal: an alternative to man 7 ascii and ascii. Colored numbers and letters are much more e

Qichen Liu 刘启辰 5 Nov 16, 2023
Social media style compact number formatting for rust.

prettty-num Format integers into a compact social media style format, similar to using Intl.NumberFormat("en", { notation: "compact" }); as a number f

null 5 Aug 17, 2024
j is a limited subset of J, an array programming language

j is a limited subset of J, an array programming language. this file is an accompanying essay.

katelyn martin 7 Nov 8, 2023
SKYULL is a command-line interface (CLI) in development that creates REST API project structure templates with the aim of making it easy and fast to start a new project.

SKYULL is a command-line interface (CLI) in development that creates REST API project structure templates with the aim of making it easy and fast to start a new project. With just a few primary configurations, such as project name, you can get started quickly.

Gabriel Michaliszen 4 May 9, 2023