The Ergex Regular Expression Library

Related tags

Command-line ergex
Overview

The Ergex Regular Expression Library

Introduction

Ergex is a regular expression library that does a few things rather differently from many other libraries. None of the others quite did what I wanted, so I created ergex.

It was designed to be particularly well suited for working with analysis of network data, or with highly fragmented, very large documents stored in distributed storage.

Additionally, the ShrinkSet and push-oriented implementation of Aho-Corasick are probably worth looking at on their own.

Features

Ergex is a regular expression library with a few overarching design goals:

Push-Oriented

Unlike most regular expression libraries, ergex is push-oriented. That means that it looks for matches even when the input data is separated widely in time or space.

The most common expected use case would be for network traffic analysis: ergex can take the contents of network packets as they arrive. However, ergex was originally written for a reimplementation of the sam text editor, which has a dual-process architecture in which parts of the file may be widely separated across a network connection.

Push-orientation is rare; the only major example I'm aware of is HyperScan. Being able to read from streams (in a pull fashion) is more common but still relatively rare.

Ergex comes with a novel implementation of the Aho-Corasick algorithm that is likewise push-oriented.

Simultaneous Matching

Ergex supports matching arbitrarily many expressions simultaneously. That is, as you push data to ergex, it will test all of the expressions you've provided at once and report back all that match.

Regular expressions are compiled into "databases" consisting of arbitrarily many expressions.

No Allocations at Matching Time

Ergex performs no memory allocations during matching. A "scratch" structure is allocated to store state during matching; the scratch structure is of a fixed size and can be reused.

Thread-Safe, Lock-Free Matching

Multiple threads, each with their own scratch structures, can perform matching independently.

No Pathological Cases

There are no pathological expressions: expressions are matched in O(n*m) time, where n is the length of the expression and m is the input.

Selective Disabling of Expressions

Ergex allows expressions in a given scratch space to be selectively disabled. It uses a novel data structure (called a ShrinkSet) to allow resetting of the enabled set in O(1) time.

POSIX-Compatible Matching and Submatching

Ergex supports (almost) POSIX-compatible matching, including POSIX-compatible submatch extraction.

UTF-8 and Byte-Oriented Matching

Ergex supports matching both UTF-8 encoded text and raw bytes, and the two encodings may be mixed in the same expression.

Fast Enough

Ergex is fairly fast. Running cargo test --release takes about ten seconds on my laptop.

Safe

Ergex is written in 100% safe Rust.

Credits

Ergex stands on the shoulders of giants: it uses the excellent regex-syntax crate for parsing expressions.

Current Status

Ergex is absolutely still a work in progress. Most of the goals stated above have been met but the API is absolutely not where I want it to be and there is essentially no documentation. There is a fairly hefty test suite, however.

Due to some stuff coming up in my personal life (everything's okay, thank you), I haven't been able to devote as much time as I'd like to ergex lately -- I hope to get back to it soon but I put everything here on GitHub in case anyone was interested in it.

License

Please see the LICENSE file.

Contact

Drop me a line at [email protected].

You might also like...
A command line progress reporting library for Rust
A command line progress reporting library for Rust

indicatif Documentation A Rust library for indicating progress in command line applications to users. This currently primarily provides progress bars

Low-level Rust library for implementing terminal command line interface, like in embedded systems.

Terminal CLI Need to build an interactive command prompt, with commands, properties and with full autocomplete? This is for you. Example, output only

Rust library for ANSI terminal colours and styles (bold, underline)

rust-ansi-term This is a library for controlling colours and formatting, such as red bold text or blue underlined text, on ANSI terminals. View the Ru

Cross-platform Rust library for coloring and formatting terminal output
Cross-platform Rust library for coloring and formatting terminal output

Coloring terminal output Documentation term-painter is a cross-platform (i.e. also non-ANSI terminals) Rust library for coloring and formatting termin

A dead simple ANSI terminal color painting library for Rust.

yansi A dead simple ANSI terminal color painting library for Rust. use yansi::Paint; print!("{} light, {} light!", Paint::green("Green"), Paint::red(

Cross platform terminal library rust
Cross platform terminal library rust

Cross-platform Terminal Manipulation Library Crossterm is a pure-rust, terminal manipulation library that makes it possible to write cross-platform te

A Text User Interface library for the Rust programming language
A Text User Interface library for the Rust programming language

Cursive Cursive is a TUI (Text User Interface) library for rust. It uses ncurses by default, but other backends are available. It allows you to build

Rust library for putting things in a grid

rust-term-grid This library arranges textual data in a grid format suitable for fixed-width fonts, using an algorithm to minimise the amount of space

A Rust curses library, supports Unix platforms and Windows

pancurses pancurses is a curses library for Rust that supports both Linux and Windows by abstracting away the backend that it uses (ncurses-rs and pdc

Comments
  • Lifetimes are problematic for AhoCorasickScratch

    Lifetimes are problematic for AhoCorasickScratch

    The way lifetimes are implemented for AhoCorasickScratch make pushing bytes slightly problematic, requiring copies in some situations where they wouldn't otherwise be needed.

    Might need to just go back to the callback-based push-AC, which didn't have these problems but was less elegant.

    bug 
    opened by deadpixi 0
  • Thread-safety isn't correct

    Thread-safety isn't correct

    Scratch structures hold immutable references to their Database. This prevents a Scratch from moving to a new thread, though it should be possible. I think it would be with crossbar but I haven't tested this yet.

    enhancement help wanted 
    opened by deadpixi 0
  • Ergex's regex API should match the Aho-Corasick API

    Ergex's regex API should match the Aho-Corasick API

    The push-oriented implementation of Aho-Corasick returns an iterator over the matches it finds. The public regex API instead reports matches via callbacks. I'd like to see the public API also return an iterator.

    Something like:

    
    for match in scratch.push(bytes) {
        // do things
    }
    

    A big question is how to handle scratch.finish, which consumes the scratch but also might return matches.

    enhancement help wanted 
    opened by deadpixi 2
  • Pulse remains unchecked.

    Pulse remains unchecked.

    The matching code allows for Handlers to provide a pulse interval, expressed as a number of VM instructions, and which allows the Handler to stop matching early. This is not actually checked anywhere.

    bug beginner-friendly 
    opened by deadpixi 0
Owner
Rob King
Rob King
A new, portable, regular expression language

rulex A new, portable, regular expression language Read the book to get started! Examples On the left are rulex expressions (rulexes for short), on th

rulex 144 May 17, 2022
Rust regex in ECMAScript regular expression syntax!

ecma_regex The goal of ecma_regex is to provide the same functionality as the regex crate in ECMAScript regular expression syntax. Reliable regex engi

HeYunfei 6 Mar 7, 2023
Rust library for regular expressions using "fancy" features like look-around and backreferences

fancy-regex A Rust library for compiling and matching regular expressions. It uses a hybrid regex implementation designed to support a relatively rich

fancy-regex 302 Jan 3, 2023
RnR is a command-line tool to securely rename multiple files and directories that supports regular expressions

RnR is a command-line tool to securely rename multiple files and directories that supports regular expressions. Features Batch rename files and direct

Ismael González Valverde 219 Dec 31, 2022
😎 Pretty way of writing regular expressions in Rust

?? Write readable regular expressions The crate provides a clean and readable way of writing your regex in the Rust programming language: Without pret

Adi Salimgereyev 7 Aug 12, 2023
🧮 Boolean expression evaluation engine. A Rust port of boolrule.

coolrule My blog post: Porting Boolrule to Rust Boolean expression evaluation engine (a port of boolrule to Rust). // Without context let expr = coolr

Andrew Healey 3 Aug 21, 2023
Rust Imaging Library's Python binding: A performant and high-level image processing library for Python written in Rust

ril-py Rust Imaging Library for Python: Python bindings for ril, a performant and high-level image processing library written in Rust. What's this? Th

Cryptex 13 Dec 6, 2022
This library provides a convenient derive macro for the standard library's std::error::Error trait.

derive(Error) This library provides a convenient derive macro for the standard library's std::error::Error trait. [dependencies] therror = "1.0" Compi

Sebastian Thiel 5 Oct 23, 2023
A readline-like library in Rust.

liner A Rust library offering readline-like functionality. CONTRIBUTING.md Featues Autosuggestions Emacs and Vi keybindings Multi-line editing History

Liam 70 Jun 19, 2022
a Rust library for running child processes

duct.rs Duct is a library for running child processes. Duct makes it easy to build pipelines and redirect IO like a shell. At the same time, Duct help

Jack O'Connor 633 Dec 30, 2022