A command-line tool and library for generating regular expressions from user-provided test cases

Overview

grex


Build Status dependency status codecov lines of code Downloads

Docs.rs Crates.io Lib.rs license

Linux Download MacOS Download Windows Download

Table of Contents

  1. What does this tool do?
  2. Do I still need to learn to write regexes then?
  3. Current features
  4. How to install?
    4.1 The command-line tool
    4.2 The library
  5. How to use?
    5.1 The command-line tool
    5.2 The library
    5.3 Examples
  6. How to build?
  7. How does it work?
  8. Do you want to contribute?

1. What does this tool do? Top ▲

grex is a library as well as a command-line utility that is meant to simplify the often complicated and tedious task of creating regular expressions. It does so by automatically generating a single regular expression from user-provided test cases. The resulting expression is guaranteed to match the test cases which it was generated from.

This project has started as a Rust port of the JavaScript tool regexgen written by Devon Govett. Although a lot of further useful features could be added to it, its development was apparently ceased several years ago. The plan is now to add these new features to grex as Rust really shines when it comes to command-line tools. grex offers all features that regexgen provides, and more.

The philosophy of this project is to generate the most specific regular expression possible by default which exactly matches the given input only and nothing else. With the use of command-line flags (in the CLI tool) or preprocessing methods (in the library), more generalized expressions can be created.

The produced expressions are Perl-compatible regular expressions which are also compatible with the regular expression parser in Rust's regex crate. Other regular expression parsers or respective libraries from other programming languages have not been tested so far, but they ought to be mostly compatible as well.

2. Do I still need to learn to write regexes then? Top ▲

Definitely, yes! Using the standard settings, grex produces a regular expression that is guaranteed to match only the test cases given as input and nothing else. This has been verified by property tests. However, if the conversion to shorthand character classes such as \w is enabled, the resulting regex matches a much wider scope of test cases. Knowledge about the consequences of this conversion is essential for finding a correct regular expression for your business domain.

grex uses an algorithm that tries to find the shortest possible regex for the given test cases. Very often though, the resulting expression is still longer or more complex than it needs to be. In such cases, a more compact or elegant regex can be created only by hand. Also, every regular expression engine has different built-in optimizations. grex does not know anything about those and therefore cannot optimize its regexes for a specific engine.

So, please learn how to write regular expressions! The currently best use case for grex is to find an initial correct regex which should be inspected by hand if further optimizations are possible.

3. Current Features Top ▲

  • literals
  • character classes
  • detection of common prefixes and suffixes
  • detection of repeated substrings and conversion to {min,max} quantifier notation
  • alternation using | operator
  • optionality using ? quantifier
  • escaping of non-ascii characters, with optional conversion of astral code points to surrogate pairs
  • case-sensitive or case-insensitive matching
  • capturing or non-capturing groups
  • fully compliant to newest Unicode Standard 13.0
  • fully compatible with regex crate 1.3.5+
  • correctly handles graphemes consisting of multiple Unicode symbols
  • reads input strings from the command-line or from a file
  • optional syntax highlighting for nicer output in supported terminals

4. How to install? Top ▲

4.1 The command-line tool Top ▲

You can download the self-contained executable for your platform above and put it in a place of your choice. Alternatively, pre-compiled 64-Bit binaries are available within the package managers Scoop (for Windows) and Homebrew (for macOS and Linux).

grex is also hosted on crates.io, the official Rust package registry. If you are a Rust developer and already have the Rust toolchain installed, you can install by compiling from source using cargo, the Rust package manager. So the summary of your installation options is:

( scoop | brew | cargo ) install grex

4.2 The library Top ▲

In order to use grex as a library, simply add it as a dependency to your Cargo.toml file:

[dependencies]
grex = "1.1.0"

5. How to use? Top ▲

Detailed explanations of the available settings are provided in the library section. All settings can be freely combined with each other.

5.1 The command-line tool Top ▲

$ grex -h

grex 1.1.0
© 2019-2020 Peter M. Stahl <[email protected]>
Licensed under the Apache License, Version 2.0
Downloadable from https://crates.io/crates/grex
Source code at https://github.com/pemistahl/grex

grex generates regular expressions from user-provided test cases.

USAGE:
    grex [FLAGS] [OPTIONS] <INPUT>... --file <FILE>

FLAGS:
    -d, --digits             Converts any Unicode decimal digit to \d
    -D, --non-digits         Converts any character which is not a Unicode decimal digit to \D
    -s, --spaces             Converts any Unicode whitespace character to \s
    -S, --non-spaces         Converts any character which is not a Unicode whitespace character to \S
    -w, --words              Converts any Unicode word character to \w
    -W, --non-words          Converts any character which is not a Unicode word character to \W
    -r, --repetitions        Detects repeated non-overlapping substrings and
                             converts them to {min,max} quantifier notation
    -e, --escape             Replaces all non-ASCII characters with unicode escape sequences
        --with-surrogates    Converts astral code points to surrogate pairs if --escape is set
    -i, --ignore-case        Performs case-insensitive matching, letters match both upper and lower case
    -g, --capture-groups     Replaces non-capturing groups by capturing ones
    -c, --colorize           Provides syntax highlighting for the resulting regular expression
    -h, --help               Prints help information
    -v, --version            Prints version information

OPTIONS:
    -f, --file <FILE>                      Reads test cases on separate lines from a file
        --min-repetitions <QUANTITY>       Specifies the minimum quantity of substring repetitions
                                           to be converted if --repetitions is set [default: 1]
        --min-substring-length <LENGTH>    Specifies the minimum length a repeated substring must have
                                           in order to be converted if --repetitions is set [default: 1]

ARGS:
    <INPUT>...    One or more test cases separated by blank space 

5.2 The library Top ▲

5.2.1 Default settings

Test cases are passed either from a collection via RegExpBuilder::from() or from a file via RegExpBuilder::from_file(). If read from a file, each test case must be on a separate line. Lines may be ended with either a newline \n or a carriage return with a line feed \r\n.

use grex::RegExpBuilder;

let regexp = RegExpBuilder::from(&["a", "aa", "aaa"]).build();
assert_eq!(regexp, "^a(?:aa?)?$");

5.2.2 Convert to character classes

use grex::{Feature, RegExpBuilder};

let regexp = RegExpBuilder::from(&["a", "aa", "123"])
    .with_conversion_of(&[Feature::Digit, Feature::Word])
    .build();
assert_eq!(regexp, "^(\\d\\d\\d|\\w(?:\\w)?)$");

5.2.3 Convert repeated substrings

use grex::{Feature, RegExpBuilder};

let regexp = RegExpBuilder::from(&["aa", "bcbc", "defdefdef"])
    .with_conversion_of(&[Feature::Repetition])
    .build();
assert_eq!(regexp, "^(?:a{2}|(?:bc){2}|(?:def){3})$");

By default, grex converts each substring this way which is at least a single character long and which is subsequently repeated at least once. You can customize these two parameters if you like.

In the following example, the test case aa is not converted to a{2} because the repeated substring a has a length of 1, but the minimum substring length has been set to 2.

use grex::{Feature, RegExpBuilder};

let regexp = RegExpBuilder::from(&["aa", "bcbc", "defdefdef"])
    .with_conversion_of(&[Feature::Repetition])
    .with_minimum_substring_length(2)
    .build();
assert_eq!(regexp, "^(?:aa|(?:bc){2}|(?:def){3})$");

Setting a minimum number of 2 repetitions in the next example, only the test case defdefdef will be converted because it is the only one that is repeated twice.

use grex::{Feature, RegExpBuilder};

let regexp = RegExpBuilder::from(&["aa", "bcbc", "defdefdef"])
    .with_conversion_of(&[Feature::Repetition])
    .with_minimum_repetitions(2)
    .build();
assert_eq!(regexp, "^(?:bcbc|aa|(?:def){3})$");

5.2.4 Escape non-ascii characters

use grex::RegExpBuilder;

let regexp = RegExpBuilder::from(&["You smell like 💩."])
    .with_escaping_of_non_ascii_chars(false)
    .build();
assert_eq!(regexp, "^You smell like \\u{1f4a9}\\.$");

Old versions of JavaScript do not support unicode escape sequences for the astral code planes (range U+010000 to U+10FFFF). In order to support these symbols in JavaScript regular expressions, the conversion to surrogate pairs is necessary. More information on that matter can be found here.

use grex::RegExpBuilder;

let regexp = RegExpBuilder::from(&["You smell like 💩."])
    .with_escaped_non_ascii_chars(true)
    .build();
assert_eq!(regexp, "^You smell like \\u{d83d}\\u{dca9}\\.$");

5.2.5 Case-insensitive matching

The regular expressions that grex generates are case-sensitive by default. Case-insensitive matching can be enabled like so:

use grex::{Feature, RegExpBuilder};

let regexp = RegExpBuilder::from(&["big", "BIGGER"])
    .with_conversion_of(&[Feature::CaseInsensitivity])
    .build();
assert_eq!(regexp, "(?i)^big(?:ger)?$");

5.2.6 Capturing Groups

Non-capturing groups are used by default. Extending the previous example, you can switch to capturing groups instead.

use grex::{Feature, RegExpBuilder};

let regexp = RegExpBuilder::from(&["big", "BIGGER"])
    .with_conversion_of(&[Feature::CaseInsensitivity, Feature::CapturingGroup])
    .build();
assert_eq!(regexp, "(?i)^big(ger)?$");

5.2.7 Syntax highlighting

The method with_syntax_highlighting() may only be used if the resulting regular expression is meant to be printed to the console. The regex string representation returned from enabling this setting cannot be fed into the regex crate.

use grex::RegExpBuilder;

let regexp = RegExpBuilder::from(&["a", "aa", "123"])
    .with_syntax_highlighting()
    .build();

5.3 Examples Top ▲

The following examples show the various supported regex syntax features:

$ grex a b c
^[a-c]$

$ grex a c d e f
^[ac-f]$

$ grex a b x de
^(?:de|[abx])$

$ grex abc bc
^a?bc$

$ grex a b bc
^(?:bc?|a)$

$ grex [a-z]
^\[a\-z\]$

$ grex -r b ba baa baaa
^b(?:a{1,3})?$

$ grex -r b ba baa baaaa
^b(?:a{1,2}|a{4})?$

$ grex y̆ a z
^(?:y̆|[az])$
Note: 
Grapheme y̆ consists of two Unicode symbols:
U+0079 (Latin Small Letter Y)
U+0306 (Combining Breve)

$ grex "I ♥ cake" "I ♥ cookies"
^I ♥ c(?:ookies|ake)$
Note:
Input containing blank space must be 
surrounded by quotation marks.

The string "I ♥♥♥ 36 and ٣ and 💩💩." serves as input for the following examples using the command-line notation:

$ grex <INPUT>
^I ♥♥♥ 36 and ٣ and 💩💩\.$

$ grex -e <INPUT>
^I \u{2665}\u{2665}\u{2665} 36 and \u{663} and \u{1f4a9}\u{1f4a9}\.$

$ grex -e --with-surrogates <INPUT>
^I \u{2665}\u{2665}\u{2665} 36 and \u{663} and \u{d83d}\u{dca9}\u{d83d}\u{dca9}\.$

$ grex -d <INPUT>
^I ♥♥♥ \d\d and \d and 💩💩\.$

$ grex -s <INPUT>
^I\s♥♥♥\s36\sand\s٣\sand\s💩💩\.$

$ grex -w <INPUT>
^\w ♥♥♥ \w\w \w\w\w \w \w\w\w 💩💩\.$

$ grex -D <INPUT>
^\D\D\D\D\D\D36\D\D\D\D\D٣\D\D\D\D\D\D\D\D$

$ grex -S <INPUT>
^\S \S\S\S \S\S \S\S\S \S \S\S\S \S\S\S$

$ grex -dsw <INPUT>
^\w\s♥♥♥\s\d\d\s\w\w\w\s\d\s\w\w\w\s💩💩\.$

$ grex -dswW <INPUT>
^\w\s\W\W\W\s\d\d\s\w\w\w\s\d\s\w\w\w\s\W\W\W$

$ grex -r <INPUT>
^I ♥{3} 36 and ٣ and 💩{2}\.$

$ grex -er <INPUT>
^I \u{2665}{3} 36 and \u{663} and \u{1f4a9}{2}\.$

$ grex -er --with-surrogates <INPUT>
^I \u{2665}{3} 36 and \u{663} and (?:\u{d83d}\u{dca9}){2}\.$

$ grex -dgr <INPUT>
^I ♥{3} \d(\d and ){2}💩{2}\.$

$ grex -rs <INPUT>
^I\s♥{3}\s36\sand\s٣\sand\s💩{2}\.$

$ grex -rw <INPUT>
^\w ♥{3} \w(?:\w \w{3} ){2}💩{2}\.$

$ grex -Dr <INPUT>
^\D{6}36\D{5}٣\D{8}$

$ grex -rS <INPUT>
^\S \S(?:\S{2} ){2}\S{3} \S \S{3} \S{3}$

$ grex -rW <INPUT>
^I\W{5}36\Wand\W٣\Wand\W{4}$

$ grex -drsw <INPUT>
^\w\s♥{3}\s\d(?:\d\s\w{3}\s){2}💩{2}\.$

$ grex -drswW <INPUT>
^\w\s\W{3}\s\d(?:\d\s\w{3}\s){2}\W{3}$

6. How to build? Top ▲

In order to build the source code yourself, you need the stable Rust toolchain installed on your machine so that cargo, the Rust package manager is available.

git clone https://github.com/pemistahl/grex.git
cd grex
cargo build

The source code is accompanied by an extensive test suite consisting of unit tests, integration tests and property tests. For running the unit and integration tests, simply say:

cargo test

Property tests are disabled by default with the #[ignore] annotation because they are very long-running. They are used for automatically generating test cases for regular expression conversion. If a test case is found that produces a wrong conversion, it is shrinked to the shortest test case possible that still produces a wrong result. This is a very useful tool for finding bugs. If you want to run these tests, say:

cargo test -- --ignored

7. How does it work? Top ▲

  1. A deterministic finite automaton (DFA) is created from the input strings.

  2. The number of states and transitions between states in the DFA is reduced by applying Hopcroft's DFA minimization algorithm.

  3. The minimized DFA is expressed as a system of linear equations which are solved with Brzozowski's algebraic method, resulting in the final regular expression.

8. Do you want to contribute? Top ▲

In case you want to contribute something to grex even though it's in a very early stage of development, then I encourage you to do so nevertheless. Do you have ideas for cool features? Or have you found any bugs so far? Feel free to open an issue or send a pull request. It's very much appreciated. :-)

Comments
  • Add option to exclude test cases

    Add option to exclude test cases

    This adds a new option --file-negative, which contains a list of negative test cases. The resulting regex will strictly not matching any of these test cases. This fixes #16.


    To support negation, a second DFA is built of the negative cases, and then subtracted from the positive case DFA, using the standard DFA combination algorithm. To limit the number of nodes generated, combinations of nodes in the two DFAs are visited in depth-first order. Nodes that only occur in the negative match DFA are not visited.

    Because the repetition feature can produce grapheme transitions in the DFA that are variable length, code is added to calculate the overlap of two grapheme ranges.

    The generated graphs can contain 'dead ends' so some code is added to remove those. Some bug fixes for corner cases that were previously not hit were needed in the recreate_graph function were also necessary. Also find_next_state was written to use the new grapheme overlapping function, to prevent sometimes creating multiple conflicting edges out of a node.

    As part of this, a bug was fixed that previously caused blank lines the input to not be considered in the final regex, because the "initial" state could never be considered an accept state.

    I got rid of final_state_indices and moved that information into the node label. I also added descriptive labels to nodes to aid debugging.

    Adds appropriate tests. All pass. Ran through cargo fmt and cargo clippy.

    I haven't written much rust before so please let me know if there are any issues.

    opened by allanlw 10
  • Problems to consider when making anchors optional

    Problems to consider when making anchors optional

    It seems grex inherited this bug from regexgen: https://github.com/devongovett/regexgen/issues/31

    Repro:

    $ cat input
    AGBHD
    EIBCD
    EGBCD
    FBJBF
    AGBH
    EIBC
    EGBC
    EBC
    FBC
    CD
    F
    C
    ABCD
    EBCD
    FBCD
    
    $ # note the last entry to be matched, i.e. "FBCD"
    
    $ grex --file input
    ^(?:F(?:BJBF)?|(?:E(?:[GI])?BC|(?:FB)?C)D?|A(?:GBHD?|BCD))$
    

    After removing ^ and $ (see #30), this generated pattern does not match "FBCD" despite it being one of the input strings:

    'FBCD'.match(/(?:F(?:BJBF)?|(?:E(?:[GI])?BC|(?:FB)?C)D?|A(?:GBHD?|BCD))/g);
    // → ['F', 'CD']
    

    Here’s what I think the bug is: within the generated pattern, it should never happen that something on the left matches a prefix of something that's further on the right, because then the latter can never match.

    See https://github.com/devongovett/regexgen/issues/31#issuecomment-801380409 for some more details.

    enhancement 
    opened by mathiasbynens 6
  • Add optional CLI feature if using grex as a library

    Add optional CLI feature if using grex as a library

    Hi @pemistahl, first off, thanks for this wonderful library!

    But would it be possible to have an optional CLI feature in cargo.toml?

    In that way, if I'm using grex as a library, I don't need to get dependencies like structopt included in my project.

    enhancement 
    opened by jqnatividad 5
  • Grex hash for v1.2.0 release fails to verify in scoop

    Grex hash for v1.2.0 release fails to verify in scoop

    Simple as that:

    λ scoop install grex
    Installing 'grex' (1.2.0) [64bit]
    grex-v1.2.0-x86_64-pc-windows-msvc.zip (792,6 KB) [==========================================================================================================================] 100%
    Checking hash of grex-v1.2.0-x86_64-pc-windows-msvc.zip ... ERROR Hash check failed!
    App:         main/grex
    URL:         https://github.com/pemistahl/grex/releases/download/v1.2.0/grex-v1.2.0-x86_64-pc-windows-msvc.zip
    First bytes: 50 4B 03 04 14 00 00 00
    Expected:    d075efdbccb01c8b093b6c5120d064cc5ead534dec483c1a3d43cc4543d940ea
    Actual:      da9c50a4e19cbf7b1c4a001a9252c1a097b8eebbb9ec0bbf3f88bc79030e7d73
    

    Creating issue as this may be an overlooked thing. Another option is that I'm facing MIM attack which would be worse case scenario ;)

    opened by piaseckim 5
  • Make anchors

    Make anchors "^" and "$" optional

    Additional options: -B, --match-beginning - Match the beginning of the string (prepend ^) -E, --match-end - Match the end of the string (append $) -X, --match-line - Match the whole string (as a shorthand for -B -E)

    It's result of the discussion in the issue pemistahl/grex#30. Sorry, if some of my modifications look silly. It's my attempt to understand Rust from the scratch.

    opened by ildar-shaimordanov 4
  • Overly complex regex with input containing several common parts

    Overly complex regex with input containing several common parts

    While building a regex for the various possible formats of Creative Commons' Public Domain Mark (to assist in https://github.com/spdx/license-list-XML/issues/988), I noticed that grex produces a more complex regex than what the input requires.

    Here's what I provided:

    grex \
      "This work is free of known copyright restrictions." \
      "This work (WWW) is free of known copyright restrictions." \
      "This work (by AAA) is free of known copyright restrictions." \
      "This work, identified by CCC, is free of known copyright restrictions." \
      "This work (WWW, by AAA) is free of known copyright restrictions." \
      "This work (WWW), identified by CCC, is free of known copyright restrictions." \
      "This work (WWW, by AAA), identified by CCC, is free of known copyright restrictions." \
      "This work (by AAA), identified by CCC, is free of known copyright restrictions."
    

    The result was (after manually making groups non-capturing):

    ^This work(?:(?: \((?:(?:WWW, b|b)y AAA|WWW)\),|,) identified by CCC, |(?: \((?:(?:WWW, b|b)y AAA|WWW)\) | ))is free of known copyright restrictions\.$
    

    Visualized as a Debuggex diagram:

    Screenshot 2020-03-10 at 14 20 39

    A regex produced by hand to match the same input shows that this could be simplified:

    ^This work(?: \((?:WWW(?:, by AAA)?|by AAA)\))?(?:, identified by CCC,)? is free of known copyright restrictions\.$
    

    Debuggex diagram:

    Screenshot 2020-03-10 at 12 22 50

    wontfix 
    opened by waldyrious 4
  • Add feature for disabling capturing groups

    Add feature for disabling capturing groups

    grex produces regular expressions with capturing groups by default. Some users might prefer to create regexes with non-capturing groups instead, so I will add a new library method and a new command-line flag for handling this use case.

    enhancement 
    opened by pemistahl 4
  • Optional anchors

    Optional anchors "^" and "$"

    Added options to suppress anchors: -B, --no-match-beginning - Match the beginning of the string (prepend ^) -E, --no-match-end - Match the end of the string (append $) -X, --no-match-line - Match the whole string (as a shorthand for -B -E)

    This PR is intended to close the issue pemistahl/grex#30 and my previous GH-39 as this one covers the requirements to keep anchors by default.

    opened by ildar-shaimordanov 3
  • Couldn't compile ndarray

    Couldn't compile ndarray

    Hey I'm on Debian bullseye and cargo install grex won't succeed :

    [lots of errors]
    error: aborting due to 204 previous errors
    
    For more information about this error, try `rustc --explain E0277`.
    error: could not compile `ndarray`.
    

    ndarray version = 0.15.1 cargo version = 1.47

    has someone experienced the same issue ?

    opened by 0-Kala-0 3
  • Installation problem

    Installation problem

    When installing grex on Debian Linux, I get 365 syntax errors. They seem to be many repetitions of: the trait data_traits::RawDataSubst<u128> is not implemented for <S as data_traits::DataOwned>::MaybeUninit 348 | impl_scalar_lhs_op!(Complex, Ordered, /, Div, div, "division"); | -------------------------------------------------------------------- in this macro invocation | ::: /home/greg/.cargo/registry/src/github.com-1ecc6299db9ec823/ndarray-0.15.0/src/data_traits.rs:411:1 | 411 | pub unsafe trait DataOwned: Data { | -------------------------------- required by data_traits::DataOwned I seem to get the same blast whether I run cargo from the command-line or in vscode. I installed with: $ git clone https://github.com/pemistahl/grex.git $ cd grex $ cargo build and $ cargo install grex Since I don't see any other complaints perhaps my distribution is to blame: $ uname -a Linux debian-dell-desktop 5.10.0-4-amd64 #1 SMP Debian 5.10.19-1 (2021-03-02) x86_64 GNU/Linux Creating an empty project with grex as the only dependency also fails. Version 1.1 of grex seems to run fine. I've only been programming rust for about a year, so I haven't gotten to writing macros yet, but I'll try to dig deeper. -- Greg

    opened by GregLawson 3
  • Inserting a character breaks repetition detection (sometimes)

    Inserting a character breaks repetition detection (sometimes)

    I have been looking for a way to find repeated substrings. I think I can parse grex results to find repetitions, and given that my strings are rather short, I could then compare group contents to find non-contiguous repetitions.

    I did some quick tests and I may have chanced upon a problem:

    • grex -dsr -c 'heeelooo world lalala lalala foo foo xalxalxal xalxalxal'

      gives ^he{3}lo{3}\sworld(?:\s(?:la){3}){2}(?:\sfo{2}){2}(?:\s(?:xal){3}){2}$

    • grex -dsr -c 'heeelooo world lalala lalala foo foo xalxalxal i xalxalxal'

      gives ^he{3}lo{3}\sworld\s(?:(?:la){3}\s){2}(?:fo{2}\s){2}(?:xal){3}\si\s(?:xal){3}$

    • grex -dsr -c 'heeelooo world lalala k lalala foo foo xalxalxal i xalxalxal'

      gives ^he{3}lo{3}\sworld\slalala\sk\slalala\s(?:fo{2}\s){2}(?:xal){3}\si\s(?:xal){3}$

    In the last probe, neither of the two lalala was detected as repetitious when a k was inserted, although xalxalxal was treated as expected. Any thoughts?

    bug 
    opened by loveencounterflow 3
  • Treat diffs as separate groups

    Treat diffs as separate groups

    For example:

    
    <iframe src="//player.bilibili.com/player.html?aid=303065226&bvid=BV1dP411n7bc&cid=833485551&page=1" scrolling="no" border="0" frameborder="no" framespacing="0" allowfullscreen="true"> </iframe>
    
    <iframe src="//player.bilibili.com/player.html?aid=261233537&bvid=BV1xe411j7EQ&cid=851171461&page=1" scrolling="no" border="0" frameborder="no" framespacing="0" allowfullscreen="true"> </iframe>
    
    <iframe src="//player.bilibili.com/player.html?aid=558528772&bvid=BV1Ee4y1r7wX&cid=848823074&page=1" scrolling="no" border="0" frameborder="no" framespacing="0" allowfullscreen="true"> </iframe>
    
    
    <iframe src="//player.bilibili.com/player.html?aid=455751094&bvid=BV1U5411s7RU&cid=383073940&page=1" scrolling="no" border="0" frameborder="no" framespacing="0" allowfullscreen="true"> </iframe>
    
    

    diff:

    aid=303065226&bvid=BV1dP411n7bc&cid=833485551
    aid=261233537&bvid=BV1xe411j7EQ&cid=851171461
    aid=558528772&bvid=BV1Ee4y1r7wX&cid=848823074
    aid=455751094&bvid=BV1U5411s7RU&cid=383073940
    

    regex:

    aid=([0-9]+)&bvid=([0-9a-zA-Z]+)&cid=([0-9]+)
    

    current output grex -f grex.txt -g:

    <iframe src="//player\.bilibili\.com/player\.html\?aid=(((261233537&bvid=BV1xe411j7EQ&cid=85117146|303065226&bvid=BV1dP411n7bc&cid=83348555)1|455751094&bvid=BV1U5411s7RU&cid=383073940)|558528772&bvid=BV1Ee4y1r7wX&cid=848823074)&page=1" scrolling="no" border="0" frameborder="no" framespacing="0" allowfullscreen="true"> </iframe>$
    suliveevil@swy-M1 ~ % grex -f grex.txt -r
    ^<iframe src="/{2}player\.(?:bili){2}\.com/player\.html\?aid=(?:(?:45{2}751094&bvid=BV1U541{2}s7RU&cid=383073940|(?:26123{2}537&bvid=BV1xe41{2}j7EQ&cid=851{2}7146|(?:30){2}652{2}6&bvid=BV1dP41{2}n7bc&cid=83{2}485{3})1)|5{2}85287{2}2&bvid=BV1Ee4y1r7wX&cid=848{2}23074)&page=1" scrol{2}ing="no" border="0" frameborder="no" framespacing="0" al{2}owful{2}scre{2}n="true"> </iframe>
    
    截屏2022-10-04 05 23 50

    expected output:

    <iframe src="//player\.bilibili\.com/player\.html\?aid=([0-9]+)&bvid=([0-9a-zA-Z]+)&cid=([0-9]+)&page=1" scrolling="no" border="0" frameborder="no" framespacing="0" allowfullscreen="true"> </iframe>
    
    截屏2022-10-04 05 24 20

    https://github.com/pemistahl/grex/issues/48

    new feature 
    opened by suliveevil 2
  • Allow to specify characters that have to be converted to character class

    Allow to specify characters that have to be converted to character class

    First of all thank you for this great tool. When using, I often need to convert text into more detailed character classes, not just non-digits or non-blank characters. Is it possible to customize the range of characters to be converted into character classes, like [a-e\d], [①-⑨⒈-⒙] or specific languages such as Chinese and Japanese. For example, if the source text is 我的名字是Tom, I hope to get the regular expression [\u{4e00}-\u{9fa5}]{5}\w{3} instead of \w{8}, by specifying character class [\u{4e00}-\u{9fa5}]. And I want to specify the maximum and minimum length of repeated substrings. Sometimes I get results like (\w{5}|\w{7,8}|\w{10,17}), but the regular expression I expected is (\w{3,20}). So I hope to be able to specify the minimum and maximum repetition times of the substring, or combine the repetition times into an interval instead of multiple branches. I think these two points can be specified together, using multiple formats similar to \w{3,20} to specify characters that must be converted into character classes.

    new feature 
    opened by NightWatch0 2
  • Provide more installation options

    Provide more installation options

    I think this is a great tool by description. But I use ubuntu and I doesn't have homebrew (and I don't want to install it only for this tool). I cannot try this tool :( I think good idea an wget one liner install script, or a step by step description, or apt or anithing what a tipical ubuntu laptop can do without install more package manager.

    enhancement 
    opened by sarkiroka 7
  • Allow to provide test cases that must not be matched

    Allow to provide test cases that must not be matched

    Currently, only test cases that must be matched by the generated regular expression can be provided. It would be useful to additionally provide test cases that must not be matched by the generated expression. In combination with shorthand character classes this would allow for more specific and versatile regular expressions.

    new feature 
    opened by pemistahl 2
Releases(v1.4.1)
Owner
Peter M. Stahl
Computational linguist, Rust enthusiast, green IT advocate
Peter M. Stahl
Text Expression Runner – Readable and easy to use text expressions

ter - Text Expression Runner ter is a cli to run text expressions and perform basic text operations such as filtering, ignoring and replacing on the c

Maximilian Schulke 72 Jul 31, 2022
A command line tool for renaming your ipa files quickly and easily.

ipa_renamer A command line tool for renaming your ipa files quickly and easily. Usage ipa_renamer 0.0.1 A command line tool for renaming your ipa file

Noah Hsu 31 Dec 31, 2022
A backend for mdBook written in Rust for generating PDF based on headless chrome and Chrome DevTools Protocol.

A backend for mdBook written in Rust for generating PDF based on headless chrome and Chrome DevTools Protocol.

Hollow Man 52 Jan 7, 2023
Zero-grammer definition command-line parser

zgclp Zgclp (Zero-grammar definition command-line parser) is one of Rust's command-line parsers. A normal command-line parser generates a parser from

Toshihiro Kamiya 1 Mar 31, 2022
Splits test files into multiple groups to run tests in parallel nodes

split-test split-test splits tests into multiple groups based on timing data to run tests in parallel. Installation Download binary from GitHub releas

Fumiaki MATSUSHIMA 28 Dec 12, 2022
Find all your TODO notes with one command!

Todo_r Find all your notes with one command! Todo_r is a simple rust command line utility that keeps track of your todo items in code. It is pronounce

Lavi Blumberg 34 Apr 22, 2022
Quickner is a new tool to quickly annotate texts for NER (Named Entity Recognition). It is written in Rust and accessible through a Python API.

Quickner ⚡ A simple, fast, and easy to use NER annotator for Python Quickner is a new tool to quickly annotate texts for NER (Named Entity Recognition

Omar MHAIMDAT 7 Mar 3, 2023
A simple and fast linear algebra library for games and graphics

glam A simple and fast 3D math library for games and graphics. Development status glam is in beta stage. Base functionality has been implemented and t

Cameron Hart 953 Jan 3, 2023
Difftastic is an experimental structured diff tool that compares files based on their syntax.

Difftastic is an experimental structured diff tool that compares files based on their syntax.

Wilfred Hughes 13.9k Jan 2, 2023
Papercraft is a tool to unwrap 3D models.

Papercraft Introduction Papercraft is a tool to unwrap paper 3D models, so that you can cut and glue them together and get a real world paper model. T

Rodrigo Rivas Costa 13 Nov 18, 2022
A build tool for illumos.

Eos Eos is a build tool for illumos. It works by locating build.toml files in the illumos source tree and generating a top-level ninja build specifica

Oxide Computer Company 5 Dec 16, 2022
An efficient and powerful Rust library for word wrapping text.

Textwrap Textwrap is a library for wrapping and indenting text. It is most often used by command-line programs to format dynamic output nicely so it l

Martin Geisler 322 Dec 26, 2022
👄 The most accurate natural language detection library in the Rust ecosystem, suitable for long and short text alike

Table of Contents What does this library do? Why does this library exist? Which languages are supported? How good is it? Why is it better than other l

Peter M. Stahl 569 Jan 3, 2023
A fast, low-resource Natural Language Processing and Text Correction library written in Rust.

nlprule A fast, low-resource Natural Language Processing and Error Correction library written in Rust. nlprule implements a rule- and lookup-based app

Benjamin Minixhofer 496 Jan 8, 2023
lingua-rs Python binding. An accurate natural language detection library, suitable for long and short text alike.

lingua-py lingua-rs Python binding. An accurate natural language detection library, suitable for long and short text alike. Installation pip install l

messense 7 Dec 30, 2022
⏮ ⏯ ⏭ A Rust library to easily read forwards, backwards or randomly through the lines of huge files.

EasyReader The main goal of this library is to allow long navigations through the lines of large files, freely moving forwards and backwards or gettin

Michele Federici 81 Dec 6, 2022
Natural language detection library for Rust. Try demo online: https://www.greyblake.com/whatlang/

Whatlang Natural language detection for Rust with focus on simplicity and performance. Content Features Get started Documentation Supported languages

Sergey Potapov 805 Dec 28, 2022
A Rust library for generically joining iterables with a separator

joinery A Rust library for generically joining iterables with a separator. Provides the tragically missing string join functionality to rust. extern c

Nathan West 72 Dec 16, 2022