Rust library for regular expressions using "fancy" features like look-around and backreferences

Overview

fancy-regex

A Rust library for compiling and matching regular expressions. It uses a hybrid regex implementation designed to support a relatively rich set of features. In particular, it uses backtracking to implement "fancy" features such as look-around and backtracking, which are not supported in purely NFA-based implementations (exemplified by RE2, and implemented in Rust in the regex crate).

docs crate ci codecov

A goal is to be as efficient as possible. For a given regex, the NFA implementation has asymptotic running time linear in the length of the input, while in the general case a backtracking implementation has exponential blowup. An example given in Static Analysis for Regular Expression Exponential Runtime via Substructural Logics is:

import re
re.compile('(a|b|ab)*bc').match('ab' * 28 + 'ac')

In Python (tested on both 2.7 and 3.5), this match takes 91s, and doubles for each additional repeat of 'ab'.

Thus, many proponents advocate a purely NFA (nondeterministic finite automaton) based approach. Even so, backreferences and look-around do add richness to regexes, and they are commonly used in applications such as syntax highlighting for text editors. In particular, TextMate's syntax definitions, based on the Oniguruma backtracking engine, are now used in a number of other popular editors, including Sublime Text and Atom. These syntax definitions routinely use backreferences and look-around. For example, the following regex captures a single-line Rust raw string:

r(#*)".*?"\1

There is no NFA that can express this simple and useful pattern. Yet, a backtracking implementation handles it efficiently.

This package is one of the first that handles both cases well. The exponential blowup case above is run in 258ns. Thus, it should be a very appealing alternative for applications that require both richness and performance.

A warning about worst-case performance

NFA-based approaches give strong guarantees about worst-case performance. For regexes that contain "fancy" features such as backreferences and look-around, this module gives no corresponding guarantee. If an attacker can control the regular expressions that will be matched against, they will be able to successfully mount a denial-of-service attack. Be warned.

See PERFORMANCE.md for some examples.

A hybrid approach

One workable approach is to detect the presence of "fancy" features, and choose either an NFA implementation or a backtracker depending on whether they are used.

However, this module attempts to be more fine-grained. Instead, it implements a true hybrid approach. In essence, it is a backtracking VM (as well explained in Regular Expression Matching: the Virtual Machine Approach) in which one of the "instructions" in the VM delegates to an inner NFA implementation (in Rust, the regex crate, though a similar approach would certainly be possible using RE2 or the Go regexp package). Then there's an analysis which decides for each subexpression whether it is "hard", or can be delegated to the NFA matcher. At the moment, it is eager, and delegates as much as possible to the NFA engine.

Theory

(This section is written in a somewhat informal style; I hope to expand on it)

The fundamental idea is that it's a backtracking VM like PCRE, but as much as possible it delegates to an "inner" RE engine like RE2 (in this case, the Rust one). For the sublanguage not using fancy features, the library becomes a thin wrapper.

Otherwise, you do an analysis to figure out what you can delegate and what you have to backtrack. I was thinking it might be tricky, but it's actually quite simple. The first phase, you just label each subexpression as "hard" (groups that get referenced in a backref, look-around, etc), and bubble that up. You also do a little extra analysis, mostly determining whether an expression has constant match length, and the minimum length.

The second phase is top down, and you carry a context, also a boolean indicating whether it's "hard" or not. Intuitively, a hard context is one in which the match length will affect future backtracking.

If the subexpression is easy and the context is easy, generate an instruction in the VM that delegates to the inner NFA implementation. Otherwise, generate VM code as in a backtracking engine. Most expression nodes are pretty straightforward; the only interesting case is concat (a sequence of subexpressions).

Even that one is not terribly complex. First, determine a prefix of easy nodes of constant match length (this won't affect backtracking, so safe to delegate to NFA). Then, if your context is easy, determine a suffix of easy nodes. Both of these delegate to NFA. For the ones in between, recursively compile. In an easy context, the last of these also gets an easy context; everything else is generated in a hard context. So, conceptually, hard context flows from right to left, and from parents to children.

Current status

Still in development, though the basic ideas are in place. Currently, the following features are missing:

  • Procedure calls and recursive expressions

Acknowledgements

Many thanks to Andrew Gallant for stimulating conversations that inspired this approach, as well as for creating the excellent regex crate.

Authors

The main author is Raph Levien, with many contributions from Robin Stocker.

Contributions

We gladly accept contributions via GitHub pull requests. Please see CONTRIBUTING.md for more details.

This project started out as a Google 20% project, but none of the authors currently work at Google so it has been forked to be community-maintained.

Comments
  • Update owners on crates.io and release new version

    Update owners on crates.io and release new version

    With the move to a new repo, we should do a release to update the metadata on https://crates.io/crates/fancy-regex

    There were also many merged bugfixes. Maybe we should also start writing a CHANGELOG.md file.

    @raphlinus Do you mind adding me and the group as a whole as an owner to the crate?:

    cargo owner --add robinst
    cargo owner --add github:fancy-regex:owners
    

    (The first command gives me the permissions to extend owners as well, see the docs here: https://doc.rust-lang.org/cargo/reference/publishing.html#cargo-owner)

    opened by robinst 15
  • Initial named group support

    Initial named group support

    I implemented basic named groups support based on your comments in issue #34 . Everything is done in parser.rs by keeping keep track of the group number being parsed and using a HashMap to map named group IDs to the numbers.

    The result was being able to pass one more internal test and eight more onigurama tests. I'm a little confused as to how (why?) they support some things like non-unique group names. I'm also struggling to grasp the difference between "\g" and "\k" in this context.

    opened by rxt1077 12
  • Improve parse errors by showing the position they occurred at

    Improve parse errors by showing the position they occurred at

    Improve parse errors by showing the position they occurred at. The motivation for this is it becomes easier to debug when adding support for new syntax.

    opened by keith-hall 9
  • Add escape function

    Add escape function

    The regex crates provides an escape(text: &str) -> String function which “escapes all regular expression meta characters in text”. This is especially useful when the regex pattern contains a dynamic value such as user-input or a network-response.

    For example, a search that includes the user-input id_name might use the pattern

    format!("^(?:id|ref)={};", id_name)
    

    where id_name could include characters like $, +, or ?. These characters would get interpreted as special regex sequences instead of plain text. Escaping the dynamic value would prevent this:

    format!("^(?:id|ref)={};", fancy_regex::escape(id_name))
    

    Currently I use the fancy_regex crate for matching and the regex crate for quoting, but this is error-prone, since fancy_regex adds a number of regex-pattern syntax-extensions which are not handled by regex::escape.

    enhancement help wanted good first issue 
    opened by florianpircher 7
  • Q: `case_insensitive` unsupported?

    Q: `case_insensitive` unsupported?

    Hi, any chance to have case_insensitive similar to what is supported in "regex" crate?

    let new_regex = fancy_regex::RegexBuilder::new(&unescaped_source)
           // .case_insensitive(!self.match_case.clone()) //unsupported by fancy::regex
           .build()
           .unwrap();
    
    help wanted good first issue 
    opened by 4ntoine 6
  • Enable selection of regex crate features

    Enable selection of regex crate features

    Allow users to disable any of the unicode and perf-* features of the regex crate. Disabling these features can reduce compile time and/or binary size for use cases where these features are not needed. (All features remain enabled by default.)

    opened by mbrubeck 5
  • Add replace*() methods similar to regex crate

    Add replace*() methods similar to regex crate

    Closes #49

    Most of this code is verbatim copy from the regex crate.

    License: MIT OR Apache-2.0 Author: Andrew Gallant [email protected] Author: Alex Crichton [email protected] Author: Lapinot [email protected] Author: tom [email protected] Author: Matt Brubeck [email protected] Author: Marti Raudsepp [email protected]

    Methods:

    • Regex::replace() -- single replacement
    • Regex::replace_all() -- replace all non-overlapping matches
    • Regex::replacen() -- configurable number of replacements

    Traits:

    • Replacer -- describes logic for replacing matches

      Trait implementations included for &str, String, NoExpand, FnMut(&Captures)->str

    Structs:

    • NoExpand -- Replacer that performs no substitutions
    • ReplacerRef -- by-reference adaptor
    opened by intgr 5
  • Update metadata for repo move

    Update metadata for repo move

    We're moving the repo to the new "fancy-regex" organization. This patch updates the copyright headers and other metadata.

    The contribution guidelines have been updated (adopting the Rust Code of Conduct), and the README freshened up just a bit.

    opened by raphlinus 5
  • Add Regex::captures_iter

    Add Regex::captures_iter

    See https://docs.rs/regex/1.3.9/regex/struct.Regex.html#method.captures_iter

    Currently, you will have to call captures_from_pos repeatedly to get this. Would be good to add this as API. Other things such as the replace API would also make use of it (#49).

    enhancement help wanted 
    opened by robinst 4
  • One match being returned twice instead of two different matches

    One match being returned twice instead of two different matches

    I'm trying to find the matches for the following regular expression (basically everything between {# and }):

    (?<={#)(.*?)(?=})
    

    However, instead of getting both matches for the example below, I get two matches with the content of the first match.

    let input = String::from("{#test1} {#test2}");
    let re = Regex::new("(?<={#)(.*?)(?=})").unwrap();
    
    let input_string_slice = &input.clone();
    let result = re.captures(input_string_slice);
    
    let captures = result.expect("Error during regex parsing").expect("No match found");
    let first_match = captures.get(1).expect("No group");
    let second_match = captures.get(2).expect("No group"); // panics
    println!("{} {}", first_match.as_str(), second_match.as_str());
    

    Here is a question on StackOverflow where user Kitsu also posted the code which proves this. At first I thought I forgot to put global match so I'm only getting the first match, but by browsing the documentation I realized that's not the case.

    opened by panther99 4
  • Panic on not common input

    Panic on not common input

    Hi! Thank you for this library!

    While fuzzing the jsonschema crate, I found that fancy-regex panics on \\u (and a few more similar patterns like \\U and \\x), with:

    thread 'main' panicked at 'index out of bounds: the len is 2 but the index is 2', /home/stranger6667/.cargo/registry/src/github.com-1ecc6299db9ec823/fancy-regex-0.6.0/src/parse.rs:394:17
    

    Code to reproduce:

    fn main() {
        fancy_regex::Regex::new("\\u");
    }
    

    Though it is an incomplete escape sequence, it will be nice to return Err in such cases as in regex:

    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    regex parse error:
        \u
          ^
    error: incomplete escape sequence, reached end of pattern prematurely
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    

    Probably, the solution will be to check the bounds before going to bytes[ix] inside Parser::parse_hex. Let me know what do you think, I'll be happy to submit a PR to fix this :)

    opened by Stranger6667 3
  • split function missing

    split function missing

    i wanted to split a string with regex and was wondering if theres a reason that that function is missing? its implemented in the rust-regex crate and also their regex implements pattern so the std::split is an option as well. i am interested in adding it but wondering if it was purposely left out for some reason?

    enhancement help wanted 
    opened by priyankat99 1
  • panic when evaluating a specific regex

    panic when evaluating a specific regex

    Using the regex & text used in benches, but with the public Regex::new API instead of the private one, leads to a panic in the eval VM:

    use fancy_regex::Regex;
    
    fn main() {
        let regex = Regex::new("^.*?(([ab]+)\\1b)").unwrap();
        eprintln!("match: {:?}", regex.is_match("babab"));
    }
    

    gives this:

    thread 'main' panicked at 'byte index 18446744073709551615 is out of bounds of `babab`', library/core/src/str/mod.rs:107:9
    stack backtrace:
       0: rust_begin_unwind
                 at /rustc/a55dd71d5fb0ec5a6a3a9e8c27b2127ba491ce52/library/std/src/panicking.rs:584:5
       1: core::panicking::panic_fmt
                 at /rustc/a55dd71d5fb0ec5a6a3a9e8c27b2127ba491ce52/library/core/src/panicking.rs:142:14
       2: core::str::slice_error_fail_rt
       3: core::ops::function::FnOnce::call_once
                 at /rustc/a55dd71d5fb0ec5a6a3a9e8c27b2127ba491ce52/library/core/src/ops/function.rs:248:5
       4: core::intrinsics::const_eval_select
                 at /rustc/a55dd71d5fb0ec5a6a3a9e8c27b2127ba491ce52/library/core/src/intrinsics.rs:2696:5
       5: core::str::slice_error_fail
                 at /rustc/a55dd71d5fb0ec5a6a3a9e8c27b2127ba491ce52/library/core/src/str/mod.rs:86:9
       6: fancy_regex::vm::run
       7: fancy_regex::Regex::is_match
       8: bug::main
    

    This is both present on 0.10.0 and on current main (f594a9ba1f675fce87b1f24a26d95d448afacc75)

    bug 
    opened by vthib 1
  • Bad(?) performance with a specific regex with backreferences

    Bad(?) performance with a specific regex with backreferences

    Consider this regular expression:

    (..)\1\1\1$
    

    This regular expression should match 4 identical groups of 2 characters at the end of the input string (e. g. 123456789ABABABAB should match).

    When using fancy-regex to match this regular expression against a 40-character string, a single iteration takes almost 6 µs if using the normal Regex.is_match() interface whereas manually compiling and running an Expr (as it is done in benchmarks) takes ~300 ns.

    For comparison, an equivalent RE using only regex syntax ((00000000|...|FFFFFFFF)$) takes ~170 ns to run against the same 40-character string.

    I understand that Regex wraps the input RE in a super-expression implementing the "match anywhere" semantics, which probably explains the 20x performance difference, but I'm wondering if this is expected (and optimizing further is a non-goal or a wishlist-level goal) or if this is unexpected/a bug.

    bench.rs

    #[macro_use]
    extern crate criterion;
    extern crate rand;
    
    use criterion::Criterion;
    use std::time::Duration;
    
    use fancy_regex::internal::{analyze, compile, run_default};
    use fancy_regex::Expr;
    
    use rand::Rng;
    
    struct Hexadecimal;
    impl rand::distributions::Distribution<u8> for Hexadecimal {
        fn sample<R: Rng + ?Sized>(&self, rng: &mut R) -> u8 {
            const RANGE: u32 = 16;
            const GEN_ASCII_STR_CHARSET: &[u8] = b"0123456789ABCDEF";
            // We can pick from 62 characters. This is so close to a power of 2, 64,
            // that we can do better than `Uniform`. Use a simple bitshift and
            // rejection sampling. We do not use a bitmask, because for small RNGs
            // the most significant bits are usually of higher quality.
            loop {
                let var = rng.next_u32() >> (32 - 4);
                if var < RANGE {
                    return GEN_ASCII_STR_CHARSET[var as usize];
                }
            }
        }
    }
    
    fn check_normal_expr(c: &mut Criterion) {
        let re_str = "(00000000|01010101|02020202|03030303|04040404|05050505|06060606|07070707|08080808|09090909|0A0A0A0A|0B0B0B0B|0C0C0C0C|0D0D0D0D|0E0E0E0E|0F0F0F0F|10101010|11111111|12121212|13131313|14141414|15151515|16161616|17171717|18181818|19191919|1A1A1A1A|1B1B1B1B|1C1C1C1C|1D1D1D1D|1E1E1E1E|1F1F1F1F|20202020|21212121|22222222|23232323|24242424|25252525|26262626|27272727|28282828|29292929|2A2A2A2A|2B2B2B2B|2C2C2C2C|2D2D2D2D|2E2E2E2E|2F2F2F2F|30303030|31313131|32323232|33333333|34343434|35353535|36363636|37373737|38383838|39393939|3A3A3A3A|3B3B3B3B|3C3C3C3C|3D3D3D3D|3E3E3E3E|3F3F3F3F|40404040|41414141|42424242|43434343|44444444|45454545|46464646|47474747|48484848|49494949|4A4A4A4A|4B4B4B4B|4C4C4C4C|4D4D4D4D|4E4E4E4E|4F4F4F4F|50505050|51515151|52525252|53535353|54545454|55555555|56565656|57575757|58585858|59595959|5A5A5A5A|5B5B5B5B|5C5C5C5C|5D5D5D5D|5E5E5E5E|5F5F5F5F|60606060|61616161|62626262|63636363|64646464|65656565|66666666|67676767|68686868|69696969|6A6A6A6A|6B6B6B6B|6C6C6C6C|6D6D6D6D|6E6E6E6E|6F6F6F6F|70707070|71717171|72727272|73737373|74747474|75757575|76767676|77777777|78787878|79797979|7A7A7A7A|7B7B7B7B|7C7C7C7C|7D7D7D7D|7E7E7E7E|7F7F7F7F|80808080|81818181|82828282|83838383|84848484|85858585|86868686|87878787|88888888|89898989|8A8A8A8A|8B8B8B8B|8C8C8C8C|8D8D8D8D|8E8E8E8E|8F8F8F8F|90909090|91919191|92929292|93939393|94949494|95959595|96969696|97979797|98989898|99999999|9A9A9A9A|9B9B9B9B|9C9C9C9C|9D9D9D9D|9E9E9E9E|9F9F9F9F|A0A0A0A0|A1A1A1A1|A2A2A2A2|A3A3A3A3|A4A4A4A4|A5A5A5A5|A6A6A6A6|A7A7A7A7|A8A8A8A8|A9A9A9A9|AAAAAAAA|ABABABAB|ACACACAC|ADADADAD|AEAEAEAE|AFAFAFAF|B0B0B0B0|B1B1B1B1|B2B2B2B2|B3B3B3B3|B4B4B4B4|B5B5B5B5|B6B6B6B6|B7B7B7B7|B8B8B8B8|B9B9B9B9|BABABABA|BBBBBBBB|BCBCBCBC|BDBDBDBD|BEBEBEBE|BFBFBFBF|C0C0C0C0|C1C1C1C1|C2C2C2C2|C3C3C3C3|C4C4C4C4|C5C5C5C5|C6C6C6C6|C7C7C7C7|C8C8C8C8|C9C9C9C9|CACACACA|CBCBCBCB|CCCCCCCC|CDCDCDCD|CECECECE|CFCFCFCF|D0D0D0D0|D1D1D1D1|D2D2D2D2|D3D3D3D3|D4D4D4D4|D5D5D5D5|D6D6D6D6|D7D7D7D7|D8D8D8D8|D9D9D9D9|DADADADA|DBDBDBDB|DCDCDCDC|DDDDDDDD|DEDEDEDE|DFDFDFDF|E0E0E0E0|E1E1E1E1|E2E2E2E2|E3E3E3E3|E4E4E4E4|E5E5E5E5|E6E6E6E6|E7E7E7E7|E8E8E8E8|E9E9E9E9|EAEAEAEA|EBEBEBEB|ECECECEC|EDEDEDED|EEEEEEEE|EFEFEFEF|F0F0F0F0|F1F1F1F1|F2F2F2F2|F3F3F3F3|F4F4F4F4|F5F5F5F5|F6F6F6F6|F7F7F7F7|F8F8F8F8|F9F9F9F9|FAFAFAFA|FBFBFBFB|FCFCFCFC|FDFDFDFD|FEFEFEFE|FFFFFFFF)$";
        let tree = Expr::parse_tree(re_str).unwrap();
        let a = analyze(&tree).unwrap();
        let p = compile(&a).unwrap();
        c.bench_function("check_normal_expr", |b| {
            b.iter(|| {
                let rng = rand::thread_rng();
                let string: String = rng.sample_iter(Hexadecimal).take(40).map(char::from).collect();
                run_default(&p, &string, 0).unwrap()
            });
        });
    }
    
    fn check_normal_regex(c: &mut Criterion) {
        let re_str = "(00000000|01010101|02020202|03030303|04040404|05050505|06060606|07070707|08080808|09090909|0A0A0A0A|0B0B0B0B|0C0C0C0C|0D0D0D0D|0E0E0E0E|0F0F0F0F|10101010|11111111|12121212|13131313|14141414|15151515|16161616|17171717|18181818|19191919|1A1A1A1A|1B1B1B1B|1C1C1C1C|1D1D1D1D|1E1E1E1E|1F1F1F1F|20202020|21212121|22222222|23232323|24242424|25252525|26262626|27272727|28282828|29292929|2A2A2A2A|2B2B2B2B|2C2C2C2C|2D2D2D2D|2E2E2E2E|2F2F2F2F|30303030|31313131|32323232|33333333|34343434|35353535|36363636|37373737|38383838|39393939|3A3A3A3A|3B3B3B3B|3C3C3C3C|3D3D3D3D|3E3E3E3E|3F3F3F3F|40404040|41414141|42424242|43434343|44444444|45454545|46464646|47474747|48484848|49494949|4A4A4A4A|4B4B4B4B|4C4C4C4C|4D4D4D4D|4E4E4E4E|4F4F4F4F|50505050|51515151|52525252|53535353|54545454|55555555|56565656|57575757|58585858|59595959|5A5A5A5A|5B5B5B5B|5C5C5C5C|5D5D5D5D|5E5E5E5E|5F5F5F5F|60606060|61616161|62626262|63636363|64646464|65656565|66666666|67676767|68686868|69696969|6A6A6A6A|6B6B6B6B|6C6C6C6C|6D6D6D6D|6E6E6E6E|6F6F6F6F|70707070|71717171|72727272|73737373|74747474|75757575|76767676|77777777|78787878|79797979|7A7A7A7A|7B7B7B7B|7C7C7C7C|7D7D7D7D|7E7E7E7E|7F7F7F7F|80808080|81818181|82828282|83838383|84848484|85858585|86868686|87878787|88888888|89898989|8A8A8A8A|8B8B8B8B|8C8C8C8C|8D8D8D8D|8E8E8E8E|8F8F8F8F|90909090|91919191|92929292|93939393|94949494|95959595|96969696|97979797|98989898|99999999|9A9A9A9A|9B9B9B9B|9C9C9C9C|9D9D9D9D|9E9E9E9E|9F9F9F9F|A0A0A0A0|A1A1A1A1|A2A2A2A2|A3A3A3A3|A4A4A4A4|A5A5A5A5|A6A6A6A6|A7A7A7A7|A8A8A8A8|A9A9A9A9|AAAAAAAA|ABABABAB|ACACACAC|ADADADAD|AEAEAEAE|AFAFAFAF|B0B0B0B0|B1B1B1B1|B2B2B2B2|B3B3B3B3|B4B4B4B4|B5B5B5B5|B6B6B6B6|B7B7B7B7|B8B8B8B8|B9B9B9B9|BABABABA|BBBBBBBB|BCBCBCBC|BDBDBDBD|BEBEBEBE|BFBFBFBF|C0C0C0C0|C1C1C1C1|C2C2C2C2|C3C3C3C3|C4C4C4C4|C5C5C5C5|C6C6C6C6|C7C7C7C7|C8C8C8C8|C9C9C9C9|CACACACA|CBCBCBCB|CCCCCCCC|CDCDCDCD|CECECECE|CFCFCFCF|D0D0D0D0|D1D1D1D1|D2D2D2D2|D3D3D3D3|D4D4D4D4|D5D5D5D5|D6D6D6D6|D7D7D7D7|D8D8D8D8|D9D9D9D9|DADADADA|DBDBDBDB|DCDCDCDC|DDDDDDDD|DEDEDEDE|DFDFDFDF|E0E0E0E0|E1E1E1E1|E2E2E2E2|E3E3E3E3|E4E4E4E4|E5E5E5E5|E6E6E6E6|E7E7E7E7|E8E8E8E8|E9E9E9E9|EAEAEAEA|EBEBEBEB|ECECECEC|EDEDEDED|EEEEEEEE|EFEFEFEF|F0F0F0F0|F1F1F1F1|F2F2F2F2|F3F3F3F3|F4F4F4F4|F5F5F5F5|F6F6F6F6|F7F7F7F7|F8F8F8F8|F9F9F9F9|FAFAFAFA|FBFBFBFB|FCFCFCFC|FDFDFDFD|FEFEFEFE|FFFFFFFF)$";
        let regex = fancy_regex::Regex::new(re_str).unwrap();
        c.bench_function("check_normal_regex", |b| {
            b.iter(|| {
                let rng = rand::thread_rng();
                let string: String = rng.sample_iter(Hexadecimal).take(40).map(char::from).collect();
                regex.is_match(&string).unwrap()
            });
        });
    }
    
    // The following regex is a pathological case for backtracking
    // implementations, see README.md:
    fn check_backref_expr(c: &mut Criterion) {
        let tree = Expr::parse_tree("((..)\\1\\1\\1$)").unwrap();
        let a = analyze(&tree).unwrap();
        let p = compile(&a).unwrap();
        c.bench_function("check_backref_expr", |b| {
            b.iter(|| {
                let rng = rand::thread_rng();
                let string: String = rng.sample_iter(Hexadecimal).take(40).map(char::from).collect();
                run_default(&p, &string, 0).unwrap()
            });
        });
    }
    
    fn check_backref_regex(c: &mut Criterion) {
        let regex = fancy_regex::Regex::new("(..)\\1\\1\\1$").unwrap();
        c.bench_function("check_backref_regex", |b| {
            b.iter(|| {
                let rng = rand::thread_rng();
                let string: String = rng.sample_iter(Hexadecimal).take(40).map(char::from).collect();
                regex.is_match(&string).unwrap()
            });
        });
    }
    
    criterion_group!(
        name = checks;
        config = Criterion::default().warm_up_time(Duration::from_secs(10));
        targets =
        check_normal_expr,
        check_normal_regex,
        check_backref_expr,
        check_backref_regex
    );
    
    criterion_main!(checks);
    

    `cargo bench` output

         Running benches/bench.rs (target/release/deps/bench-8117c8c31bf52a35)
    WARNING: HTML report generation will become a non-default optional feature in Criterion.rs 0.4.0.
    This feature is being moved to cargo-criterion (https://github.com/bheisler/cargo-criterion) and will be optional in a future version of Criterion.rs. To silence this warning, either switch to cargo-criterion or enable the 'html_reports' feature in your Cargo.toml.
    
    Gnuplot not found, using plotters backend
    check_normal_expr       time:   [2.6787 us 2.7315 us 2.7891 us]
    Found 5 outliers among 100 measurements (5.00%)
      4 (4.00%) high mild
      1 (1.00%) high severe
    
    check_normal_regex      time:   [168.07 ns 169.70 ns 171.54 ns]
    Found 2 outliers among 100 measurements (2.00%)
      2 (2.00%) high mild
    
    check_backref_expr      time:   [304.59 ns 306.39 ns 308.37 ns]
    Found 5 outliers among 100 measurements (5.00%)
      5 (5.00%) high mild
    
    check_backref_regex     time:   [6.4816 us 6.8658 us 7.3006 us]
    Found 15 outliers among 100 measurements (15.00%)
      4 (4.00%) high mild
      11 (11.00%) high severe
    

    opened by intelfx 0
  • Add support for conditionals

    Add support for conditionals

    This PR adds support for Oniguruma's "backreference validity checker" functionality, plus conditionals - basically if/then/else expressions.

    To achieve this, I ~~only needed to add~~ added one additional Expr variant for the parser and a corresponding Insn variant for the VM. I have called them BackrefExistsCondition, which takes the capture group number as a parameter. (Named capture groups are automatically converted to numbered ones in the parser. Meanwhile, I noticed there was an Expr::NamedBackref which was never used, so I took the opportunity to remove it.)

    This instruction is marked as hard, and const size - always zero. I added some comments in the parser to hopefully explain how it works, but happy to clarify anything here if needed. :)


    EDIT to add:

    After some investigation, I realized I also needed to add a Expr for the conditional expression.

    enhancement 
    opened by keith-hall 0
  • Add support for multi_line builder option

    Add support for multi_line builder option

    regex has support for RegexBuilder::multi_line() (source here), would it be possible to add that here?

    I think the implementation can be done with a simple transformation if full PCRE is enabled - ^ to (?<=^|\r\n|\n|\x0b|\f|\r|\x85) and $ to (?=$|\r\n|\n|\x0b|\f|\r|\x85)

    opened by tgross35 0
  • Add an option to support only subset of regex available to Python

    Add an option to support only subset of regex available to Python

    Documentation says:

    (?<name>exp) : match exp, creating capture group named name
    \k<name> : match the exact string that the capture group named name matched
    (?P<name>exp) : same as (?<name>exp) for compatibility with Python, etc.
    (?P=name) : same as \k<name> for compatibility with Python, etc.
    

    Can we have an option to allow only the latter synyax?

    opened by stepancheg 3
Releases(0.10.0)
Owner
fancy-regex
The fancy-regex Rust library for non-regular regexes
fancy-regex
RnR is a command-line tool to securely rename multiple files and directories that supports regular expressions

RnR is a command-line tool to securely rename multiple files and directories that supports regular expressions. Features Batch rename files and direct

Ismael González Valverde 219 Dec 31, 2022
😎 Pretty way of writing regular expressions in Rust

?? Write readable regular expressions The crate provides a clean and readable way of writing your regex in the Rust programming language: Without pret

Adi Salimgereyev 7 Aug 12, 2023
Pet project to get acquainted with Rust, and mess around with symbolic expressions.

Symba Pet project to get acquainted with Rust, and to mess around with symbolic expressions, hence the name 'Symba'. Example: use asg::{deftree, r

Ranjeeth Mahankali 3 Nov 23, 2023
a rust crate for drawing fancy pie charts in the terminal

piechart piechart is a rust crate for drawing fancy pie charts in the terminal. Example usage: use piechart::{Chart, Color, Data}; fn main() { le

Jakob Hellermann 35 Dec 30, 2022
📰 A terminal feed reader with a fancy ui

tuifeed Developed by @veeso Current version: 0.1.1 (17/11/2021) ~ A terminal news feed reader with a fancy ui ~ tuifeed About tuifeed ?? Features ?? G

Christian Visintin 55 Jan 5, 2023
A tool to compare how Typst documents would look using different fonts or font variants.

typst-font-compare A tool to compare how Typst documents would look using different fonts or font variants. Installation cargo install --path . Usage

null 3 Feb 15, 2024
Croc-look is a tool to make testing and debuging proc macros easier

croc-look croc-look is a tool to make testing and debuging proc macros easier by these two features Printing the implementation specific generated cod

weegee 7 Dec 2, 2022
The Ergex Regular Expression Library

The Ergex Regular Expression Library Introduction Ergex is a regular expression library that does a few things rather differently from many other libr

Rob King 119 Dec 6, 2022
👀Little program I made in 🦀Rust that reminds me every 20 minutes to look away from my computer 🖥screen.

?? eye break Little program I made in ?? Rust that reminds me every 20 minutes to look away from my computer ?? screen. I stay way too long on the com

Goldy 3 Apr 9, 2023
Look Ma: My computer is talking!

chatgpt-at-home A simple text generator based on n-grams and Markov chains: $ cargo run --release > :train sources/shakespeare.txt > ACT IV ACT IV.

Patryk Wychowaniec 5 Mar 26, 2023
Toki pona word look-up in the CLI

seme toki pona word look-up in the CLI note the UCSUR are just a square for me because my font doesn't support them, but if yours does you should see

víctor 4 Feb 25, 2024
Generate progress bars from cron expressions.

jalm Generate Progress Bars from Cron Expressions Installation and Usage Grab the latest binary from the Github Actions tab. Alternatively, to build f

iamkneel 22 Oct 30, 2022
Asserts const generic expressions at build-time.

build_assert build_assert allows you to make assertions at build-time. Unlike assert and some implementations of compile-time assertions, such as stat

MaxXing 4 Nov 23, 2023
Rust regex in ECMAScript regular expression syntax!

ecma_regex The goal of ecma_regex is to provide the same functionality as the regex crate in ECMAScript regular expression syntax. Reliable regex engi

HeYunfei 6 Mar 7, 2023
A new, portable, regular expression language

rulex A new, portable, regular expression language Read the book to get started! Examples On the left are rulex expressions (rulexes for short), on th

Ludwig Stecher 912 Dec 29, 2022
A new, portable, regular expression language

rulex A new, portable, regular expression language Read the book to get started! Examples On the left are rulex expressions (rulexes for short), on th

rulex 144 May 17, 2022
Estratto is a powerful and user-friendly Rust library designed for extracting rich audio features from digital audio signals.

estratto 〜 An Audio Feature Extraction Library estratto is a powerful and user-friendly Rust library designed for extracting rich audio features from

Amber J Blue 5 Aug 25, 2023
A bit like tee, a bit like script, but all with a fake tty. Lets you remote control and watch a process

teetty teetty is a wrapper binary to execute a command in a pty while providing remote control facilities. This allows logging the stdout of a process

Armin Ronacher 259 Jan 3, 2023
Rust implementation of PowerSession, with new features and enhancements

PowerSession Record a Session in PowerShell. PowerShell version of asciinema based on Windows Pseudo Console(ConPTY) This is a new Rust implemented ve

Watfaq Technologies Pty Ltd 43 Dec 26, 2022