A command-line tool and library for generating regular expressions from user-provided test cases

Peter M. Stahl

Last update: Dec 30, 2022

Related tags

Text processing rust cli terminal tool regex regexp regular-expression regex-pattern command-line-tool rust-library regular-expressions rust-crate rust-cli

Overview

What does this tool do?
Do I still need to learn to write regexes then?
Current features
How to install?
4.1 The command-line tool
4.2 The library
How to use?
5.1 The command-line tool
5.2 The library
5.3 Examples
How to build?
How does it work?
Do you want to contribute?

1. What does this tool do? ^{Top ▲}

grex is a library as well as a command-line utility that is meant to simplify the often complicated and tedious task of creating regular expressions. It does so by automatically generating a single regular expression from user-provided test cases. The resulting expression is guaranteed to match the test cases which it was generated from.

This project has started as a Rust port of the JavaScript tool regexgen written by Devon Govett. Although a lot of further useful features could be added to it, its development was apparently ceased several years ago. The plan is now to add these new features to grex as Rust really shines when it comes to command-line tools. grex offers all features that regexgen provides, and more.

The philosophy of this project is to generate the most specific regular expression possible by default which exactly matches the given input only and nothing else. With the use of command-line flags (in the CLI tool) or preprocessing methods (in the library), more generalized expressions can be created.

The produced expressions are Perl-compatible regular expressions which are also compatible with the regular expression parser in Rust's regex crate. Other regular expression parsers or respective libraries from other programming languages have not been tested so far, but they ought to be mostly compatible as well.

2. Do I still need to learn to write regexes then? ^{Top ▲}

Definitely, yes! Using the standard settings, grex produces a regular expression that is guaranteed to match only the test cases given as input and nothing else. This has been verified by property tests. However, if the conversion to shorthand character classes such as \w is enabled, the resulting regex matches a much wider scope of test cases. Knowledge about the consequences of this conversion is essential for finding a correct regular expression for your business domain.

grex uses an algorithm that tries to find the shortest possible regex for the given test cases. Very often though, the resulting expression is still longer or more complex than it needs to be. In such cases, a more compact or elegant regex can be created only by hand. Also, every regular expression engine has different built-in optimizations. grex does not know anything about those and therefore cannot optimize its regexes for a specific engine.

So, please learn how to write regular expressions! The currently best use case for grex is to find an initial correct regex which should be inspected by hand if further optimizations are possible.

3. Current Features ^{Top ▲}

literals
character classes
detection of common prefixes and suffixes
detection of repeated substrings and conversion to {min,max} quantifier notation
alternation using | operator
optionality using ? quantifier
escaping of non-ascii characters, with optional conversion of astral code points to surrogate pairs
case-sensitive or case-insensitive matching
capturing or non-capturing groups
fully compliant to newest Unicode Standard 13.0
fully compatible with regex crate 1.3.5+
correctly handles graphemes consisting of multiple Unicode symbols
reads input strings from the command-line or from a file
optional syntax highlighting for nicer output in supported terminals

4. How to install? ^{Top ▲}

4.1 The command-line tool ^{Top ▲}

You can download the self-contained executable for your platform above and put it in a place of your choice. Alternatively, pre-compiled 64-Bit binaries are available within the package managers Scoop (for Windows) and Homebrew (for macOS and Linux).

grex is also hosted on crates.io, the official Rust package registry. If you are a Rust developer and already have the Rust toolchain installed, you can install by compiling from source using cargo, the Rust package manager. So the summary of your installation options is:

( scoop | brew | cargo ) install grex

4.2 The library ^{Top ▲}

In order to use grex as a library, simply add it as a dependency to your Cargo.toml file:

[dependencies]
grex = "1.1.0"

5. How to use? ^{Top ▲}

Detailed explanations of the available settings are provided in the library section. All settings can be freely combined with each other.

5.1 The command-line tool ^{Top ▲}

$ grex -h

grex 1.1.0
© 2019-2020 Peter M. Stahl <[email protected]>
Licensed under the Apache License, Version 2.0
Downloadable from https://crates.io/crates/grex
Source code at https://github.com/pemistahl/grex

grex generates regular expressions from user-provided test cases.

USAGE:
    grex [FLAGS] [OPTIONS] <INPUT>... --file <FILE>

FLAGS:
    -d, --digits             Converts any Unicode decimal digit to \d
    -D, --non-digits         Converts any character which is not a Unicode decimal digit to \D
    -s, --spaces             Converts any Unicode whitespace character to \s
    -S, --non-spaces         Converts any character which is not a Unicode whitespace character to \S
    -w, --words              Converts any Unicode word character to \w
    -W, --non-words          Converts any character which is not a Unicode word character to \W
    -r, --repetitions        Detects repeated non-overlapping substrings and
                             converts them to {min,max} quantifier notation
    -e, --escape             Replaces all non-ASCII characters with unicode escape sequences
        --with-surrogates    Converts astral code points to surrogate pairs if --escape is set
    -i, --ignore-case        Performs case-insensitive matching, letters match both upper and lower case
    -g, --capture-groups     Replaces non-capturing groups by capturing ones
    -c, --colorize           Provides syntax highlighting for the resulting regular expression
    -h, --help               Prints help information
    -v, --version            Prints version information

OPTIONS:
    -f, --file <FILE>                      Reads test cases on separate lines from a file
        --min-repetitions <QUANTITY>       Specifies the minimum quantity of substring repetitions
                                           to be converted if --repetitions is set [default: 1]
        --min-substring-length <LENGTH>    Specifies the minimum length a repeated substring must have
                                           in order to be converted if --repetitions is set [default: 1]

ARGS:
    <INPUT>...    One or more test cases separated by blank space

5.2 The library ^{Top ▲}

5.2.1 Default settings

Test cases are passed either from a collection via RegExpBuilder::from() or from a file via RegExpBuilder::from_file(). If read from a file, each test case must be on a separate line. Lines may be ended with either a newline \n or a carriage return with a line feed \r\n.

use grex::RegExpBuilder;

let regexp = RegExpBuilder::from(&["a", "aa", "aaa"]).build();
assert_eq!(regexp, "^a(?:aa?)?$");

5.2.2 Convert to character classes

use grex::{Feature, RegExpBuilder};

let regexp = RegExpBuilder::from(&["a", "aa", "123"])
    .with_conversion_of(&[Feature::Digit, Feature::Word])
    .build();
assert_eq!(regexp, "^(\\d\\d\\d|\\w(?:\\w)?)$");

5.2.3 Convert repeated substrings

use grex::{Feature, RegExpBuilder};

let regexp = RegExpBuilder::from(&["aa", "bcbc", "defdefdef"])
    .with_conversion_of(&[Feature::Repetition])
    .build();
assert_eq!(regexp, "^(?:a{2}|(?:bc){2}|(?:def){3})$");

By default, grex converts each substring this way which is at least a single character long and which is subsequently repeated at least once. You can customize these two parameters if you like.

In the following example, the test case aa is not converted to a{2} because the repeated substring a has a length of 1, but the minimum substring length has been set to 2.

use grex::{Feature, RegExpBuilder};

let regexp = RegExpBuilder::from(&["aa", "bcbc", "defdefdef"])
    .with_conversion_of(&[Feature::Repetition])
    .with_minimum_substring_length(2)
    .build();
assert_eq!(regexp, "^(?:aa|(?:bc){2}|(?:def){3})$");

Setting a minimum number of 2 repetitions in the next example, only the test case defdefdef will be converted because it is the only one that is repeated twice.

use grex::{Feature, RegExpBuilder};

let regexp = RegExpBuilder::from(&["aa", "bcbc", "defdefdef"])
    .with_conversion_of(&[Feature::Repetition])
    .with_minimum_repetitions(2)
    .build();
assert_eq!(regexp, "^(?:bcbc|aa|(?:def){3})$");

5.2.4 Escape non-ascii characters

use grex::RegExpBuilder;

let regexp = RegExpBuilder::from(&["You smell like 💩."])
    .with_escaping_of_non_ascii_chars(false)
    .build();
assert_eq!(regexp, "^You smell like \\u{1f4a9}\\.$");

Old versions of JavaScript do not support unicode escape sequences for the astral code planes (range U+010000 to U+10FFFF). In order to support these symbols in JavaScript regular expressions, the conversion to surrogate pairs is necessary. More information on that matter can be found here.

use grex::RegExpBuilder;

let regexp = RegExpBuilder::from(&["You smell like 💩."])
    .with_escaped_non_ascii_chars(true)
    .build();
assert_eq!(regexp, "^You smell like \\u{d83d}\\u{dca9}\\.$");

5.2.5 Case-insensitive matching

The regular expressions that grex generates are case-sensitive by default. Case-insensitive matching can be enabled like so:

use grex::{Feature, RegExpBuilder};

let regexp = RegExpBuilder::from(&["big", "BIGGER"])
    .with_conversion_of(&[Feature::CaseInsensitivity])
    .build();
assert_eq!(regexp, "(?i)^big(?:ger)?$");

5.2.6 Capturing Groups

Non-capturing groups are used by default. Extending the previous example, you can switch to capturing groups instead.

use grex::{Feature, RegExpBuilder};

let regexp = RegExpBuilder::from(&["big", "BIGGER"])
    .with_conversion_of(&[Feature::CaseInsensitivity, Feature::CapturingGroup])
    .build();
assert_eq!(regexp, "(?i)^big(ger)?$");

5.2.7 Syntax highlighting

⚠ The method with_syntax_highlighting() may only be used if the resulting regular expression is meant to be printed to the console. The regex string representation returned from enabling this setting cannot be fed into the regex crate.

use grex::RegExpBuilder;

let regexp = RegExpBuilder::from(&["a", "aa", "123"])
    .with_syntax_highlighting()
    .build();

5.3 Examples ^{Top ▲}

The following examples show the various supported regex syntax features:

$ grex a b c
^[a-c]$

$ grex a c d e f
^[ac-f]$

$ grex a b x de
^(?:de|[abx])$

$ grex abc bc
^a?bc$

$ grex a b bc
^(?:bc?|a)$

$ grex [a-z]
^\[a\-z\]$

$ grex -r b ba baa baaa
^b(?:a{1,3})?$

$ grex -r b ba baa baaaa
^b(?:a{1,2}|a{4})?$

$ grex y̆ a z
^(?:y̆|[az])$
Note: 
Grapheme y̆ consists of two Unicode symbols:
U+0079 (Latin Small Letter Y)
U+0306 (Combining Breve)

$ grex "I ♥ cake" "I ♥ cookies"
^I ♥ c(?:ookies|ake)$
Note:
Input containing blank space must be 
surrounded by quotation marks.

The string "I ♥♥♥ 36 and ٣ and 💩💩." serves as input for the following examples using the command-line notation:

$ grex <INPUT>
^I ♥♥♥ 36 and ٣ and 💩💩\.$

$ grex -e <INPUT>
^I \u{2665}\u{2665}\u{2665} 36 and \u{663} and \u{1f4a9}\u{1f4a9}\.$

$ grex -e --with-surrogates <INPUT>
^I \u{2665}\u{2665}\u{2665} 36 and \u{663} and \u{d83d}\u{dca9}\u{d83d}\u{dca9}\.$

$ grex -d <INPUT>
^I ♥♥♥ \d\d and \d and 💩💩\.$

$ grex -s <INPUT>
^I\s♥♥♥\s36\sand\s٣\sand\s💩💩\.$

$ grex -w <INPUT>
^\w ♥♥♥ \w\w \w\w\w \w \w\w\w 💩💩\.$

$ grex -D <INPUT>
^\D\D\D\D\D\D36\D\D\D\D\D٣\D\D\D\D\D\D\D\D$

$ grex -S <INPUT>
^\S \S\S\S \S\S \S\S\S \S \S\S\S \S\S\S$

$ grex -dsw <INPUT>
^\w\s♥♥♥\s\d\d\s\w\w\w\s\d\s\w\w\w\s💩💩\.$

$ grex -dswW <INPUT>
^\w\s\W\W\W\s\d\d\s\w\w\w\s\d\s\w\w\w\s\W\W\W$

$ grex -r <INPUT>
^I ♥{3} 36 and ٣ and 💩{2}\.$

$ grex -er <INPUT>
^I \u{2665}{3} 36 and \u{663} and \u{1f4a9}{2}\.$

$ grex -er --with-surrogates <INPUT>
^I \u{2665}{3} 36 and \u{663} and (?:\u{d83d}\u{dca9}){2}\.$

$ grex -dgr <INPUT>
^I ♥{3} \d(\d and ){2}💩{2}\.$

$ grex -rs <INPUT>
^I\s♥{3}\s36\sand\s٣\sand\s💩{2}\.$

$ grex -rw <INPUT>
^\w ♥{3} \w(?:\w \w{3} ){2}💩{2}\.$

$ grex -Dr <INPUT>
^\D{6}36\D{5}٣\D{8}$

$ grex -rS <INPUT>
^\S \S(?:\S{2} ){2}\S{3} \S \S{3} \S{3}$

$ grex -rW <INPUT>
^I\W{5}36\Wand\W٣\Wand\W{4}$

$ grex -drsw <INPUT>
^\w\s♥{3}\s\d(?:\d\s\w{3}\s){2}💩{2}\.$

$ grex -drswW <INPUT>
^\w\s\W{3}\s\d(?:\d\s\w{3}\s){2}\W{3}$

6. How to build? ^{Top ▲}

In order to build the source code yourself, you need the stable Rust toolchain installed on your machine so that cargo, the Rust package manager is available.

git clone https://github.com/pemistahl/grex.git
cd grex
cargo build

The source code is accompanied by an extensive test suite consisting of unit tests, integration tests and property tests. For running the unit and integration tests, simply say:

cargo test

Property tests are disabled by default with the #[ignore] annotation because they are very long-running. They are used for automatically generating test cases for regular expression conversion. If a test case is found that produces a wrong conversion, it is shrinked to the shortest test case possible that still produces a wrong result. This is a very useful tool for finding bugs. If you want to run these tests, say:

cargo test -- --ignored

7. How does it work? ^{Top ▲}

A deterministic finite automaton (DFA) is created from the input strings.
The number of states and transitions between states in the DFA is reduced by applying Hopcroft's DFA minimization algorithm.
The minimized DFA is expressed as a system of linear equations which are solved with Brzozowski's algebraic method, resulting in the final regular expression.

8. Do you want to contribute? ^{Top ▲}

In case you want to contribute something to grex even though it's in a very early stage of development, then I encourage you to do so nevertheless. Do you have ideas for cool features? Or have you found any bugs so far? Feel free to open an issue or send a pull request. It's very much appreciated. :-)

Comments

Add option to exclude test cases

This adds a new option --file-negative, which contains a list of negative test cases. The resulting regex will strictly not matching any of these test cases. This fixes #16.

To support negation, a second DFA is built of the negative cases, and then subtracted from the positive case DFA, using the standard DFA combination algorithm. To limit the number of nodes generated, combinations of nodes in the two DFAs are visited in depth-first order. Nodes that only occur in the negative match DFA are not visited.

Because the repetition feature can produce grapheme transitions in the DFA that are variable length, code is added to calculate the overlap of two grapheme ranges.

The generated graphs can contain 'dead ends' so some code is added to remove those. Some bug fixes for corner cases that were previously not hit were needed in the recreate_graph function were also necessary. Also find_next_state was written to use the new grapheme overlapping function, to prevent sometimes creating multiple conflicting edges out of a node.

As part of this, a bug was fixed that previously caused blank lines the input to not be considered in the final regex, because the "initial" state could never be considered an accept state.

I got rid of final_state_indices and moved that information into the node label. I also added descriptive labels to nodes to aid debugging.

Adds appropriate tests. All pass. Ran through cargo fmt and cargo clippy.

I haven't written much rust before so please let me know if there are any issues.

opened by allanlw 10
Problems to consider when making anchors optional
It seems grex inherited this bug from regexgen: https://github.com/devongovett/regexgen/issues/31

Repro:

$ cat input AGBHD EIBCD EGBCD FBJBF AGBH EIBC EGBC EBC FBC CD F C ABCD EBCD FBCD $ # note the last entry to be matched, i.e. "FBCD" $ grex --file input ^(?:F(?:BJBF)?|(?:E(?:[GI])?BC|(?:FB)?C)D?|A(?:GBHD?|BCD))$

After removing ^ and $ (see #30), this generated pattern does not match "FBCD" despite it being one of the input strings:

'FBCD'.match(/(?:F(?:BJBF)?|(?:E(?:[GI])?BC|(?:FB)?C)D?|A(?:GBHD?|BCD))/g); // → ['F', 'CD']

Here’s what I think the bug is: within the generated pattern, it should never happen that something on the left matches a prefix of something that's further on the right, because then the latter can never match.

See https://github.com/devongovett/regexgen/issues/31#issuecomment-801380409 for some more details.
enhancement
opened by mathiasbynens 6
Add optional CLI feature if using grex as a library

Hi @pemistahl, first off, thanks for this wonderful library!

But would it be possible to have an optional CLI feature in cargo.toml?

In that way, if I'm using grex as a library, I don't need to get dependencies like structopt included in my project.
enhancement

opened by jqnatividad 5

Grex hash for v1.2.0 release fails to verify in scoop

Simple as that:

λ scoop install grex
Installing 'grex' (1.2.0) [64bit]
grex-v1.2.0-x86_64-pc-windows-msvc.zip (792,6 KB) [==========================================================================================================================] 100%
Checking hash of grex-v1.2.0-x86_64-pc-windows-msvc.zip ... ERROR Hash check failed!
App:         main/grex
URL:         https://github.com/pemistahl/grex/releases/download/v1.2.0/grex-v1.2.0-x86_64-pc-windows-msvc.zip
First bytes: 50 4B 03 04 14 00 00 00
Expected:    d075efdbccb01c8b093b6c5120d064cc5ead534dec483c1a3d43cc4543d940ea
Actual:      da9c50a4e19cbf7b1c4a001a9252c1a097b8eebbb9ec0bbf3f88bc79030e7d73

Creating issue as this may be an overlooked thing. Another option is that I'm facing MIM attack which would be worse case scenario ;)

opened by piaseckim 5

Make anchors "^" and "$" optional

Additional options: -B, --match-beginning - Match the beginning of the string (prepend ^) -E, --match-end - Match the end of the string (append $) -X, --match-line - Match the whole string (as a shorthand for -B -E)

It's result of the discussion in the issue pemistahl/grex#30. Sorry, if some of my modifications look silly. It's my attempt to understand Rust from the scratch.

opened by ildar-shaimordanov 4

Overly complex regex with input containing several common parts

While building a regex for the various possible formats of Creative Commons' Public Domain Mark (to assist in https://github.com/spdx/license-list-XML/issues/988), I noticed that grex produces a more complex regex than what the input requires.

Here's what I provided:

grex \
  "This work is free of known copyright restrictions." \
  "This work (WWW) is free of known copyright restrictions." \
  "This work (by AAA) is free of known copyright restrictions." \
  "This work, identified by CCC, is free of known copyright restrictions." \
  "This work (WWW, by AAA) is free of known copyright restrictions." \
  "This work (WWW), identified by CCC, is free of known copyright restrictions." \
  "This work (WWW, by AAA), identified by CCC, is free of known copyright restrictions." \
  "This work (by AAA), identified by CCC, is free of known copyright restrictions."

The result was (after manually making groups non-capturing):

^This work(?:(?: \((?:(?:WWW, b|b)y AAA|WWW)\),|,) identified by CCC, |(?: \((?:(?:WWW, b|b)y AAA|WWW)\) | ))is free of known copyright restrictions\.$

Visualized as a Debuggex diagram:

Screenshot 2020-03-10 at 14 20 39

A regex produced by hand to match the same input shows that this could be simplified:

^This work(?: \((?:WWW(?:, by AAA)?|by AAA)\))?(?:, identified by CCC,)? is free of known copyright restrictions\.$

Debuggex diagram:

Screenshot 2020-03-10 at 12 22 50

wontfix

opened by waldyrious 4

Add feature for disabling capturing groups

grex produces regular expressions with capturing groups by default. Some users might prefer to create regexes with non-capturing groups instead, so I will add a new library method and a new command-line flag for handling this use case.
enhancement

opened by pemistahl 4
Optional anchors "^" and "$"

Added options to suppress anchors: -B, --no-match-beginning - Match the beginning of the string (prepend ^) -E, --no-match-end - Match the end of the string (append $) -X, --no-match-line - Match the whole string (as a shorthand for -B -E)

This PR is intended to close the issue pemistahl/grex#30 and my previous GH-39 as this one covers the requirements to keep anchors by default.

opened by ildar-shaimordanov 3
Couldn't compile ndarray
Hey I'm on Debian bullseye and cargo install grex won't succeed :

[lots of errors] error: aborting due to 204 previous errors For more information about this error, try `rustc --explain E0277`. error: could not compile `ndarray`.

ndarray version = 0.15.1 cargo version = 1.47

has someone experienced the same issue ?
opened by 0-Kala-0 3
Installation problem

When installing grex on Debian Linux, I get 365 syntax errors. They seem to be many repetitions of: the trait data_traits::RawDataSubst<u128> is not implemented for <S as data_traits::DataOwned>::MaybeUninit 348 | impl_scalar_lhs_op!(Complex, Ordered, /, Div, div, "division"); | -------------------------------------------------------------------- in this macro invocation | ::: /home/greg/.cargo/registry/src/github.com-1ecc6299db9ec823/ndarray-0.15.0/src/data_traits.rs:411:1 | 411 | pub unsafe trait DataOwned: Data { | -------------------------------- required by data_traits::DataOwned I seem to get the same blast whether I run cargo from the command-line or in vscode. I installed with: $ git clone https://github.com/pemistahl/grex.git $ cd grex $ cargo build and $ cargo install grex Since I don't see any other complaints perhaps my distribution is to blame: $ uname -a Linux debian-dell-desktop 5.10.0-4-amd64 #1 SMP Debian 5.10.19-1 (2021-03-02) x86_64 GNU/Linux Creating an empty project with grex as the only dependency also fails. Version 1.1 of grex seems to run fine. I've only been programming rust for about a year, so I haven't gotten to writing macros yet, but I'll try to dig deeper. -- Greg

opened by GregLawson 3
Inserting a character breaks repetition detection (sometimes)
I have been looking for a way to find repeated substrings. I think I can parse grex results to find repetitions, and given that my strings are rather short, I could then compare group contents to find non-contiguous repetitions.

I did some quick tests and I may have chanced upon a problem:

grex -dsr -c 'heeelooo world lalala lalala foo foo xalxalxal xalxalxal'

gives ^he{3}lo{3}\sworld(?:\s(?:la){3}){2}(?:\sfo{2}){2}(?:\s(?:xal){3}){2}$

grex -dsr -c 'heeelooo world lalala lalala foo foo xalxalxal i xalxalxal'

gives ^he{3}lo{3}\sworld\s(?:(?:la){3}\s){2}(?:fo{2}\s){2}(?:xal){3}\si\s(?:xal){3}$

grex -dsr -c 'heeelooo world lalala k lalala foo foo xalxalxal i xalxalxal'

gives ^he{3}lo{3}\sworld\slalala\sk\slalala\s(?:fo{2}\s){2}(?:xal){3}\si\s(?:xal){3}$

In the last probe, neither of the two lalala was detected as repetitious when a k was inserted, although xalxalxal was treated as expected. Any thoughts?
bug
opened by loveencounterflow 3

Treat diffs as separate groups

For example:


<iframe src="//player.bilibili.com/player.html?aid=303065226&bvid=BV1dP411n7bc&cid=833485551&page=1" scrolling="no" border="0" frameborder="no" framespacing="0" allowfullscreen="true"> </iframe>

<iframe src="//player.bilibili.com/player.html?aid=261233537&bvid=BV1xe411j7EQ&cid=851171461&page=1" scrolling="no" border="0" frameborder="no" framespacing="0" allowfullscreen="true"> </iframe>

<iframe src="//player.bilibili.com/player.html?aid=558528772&bvid=BV1Ee4y1r7wX&cid=848823074&page=1" scrolling="no" border="0" frameborder="no" framespacing="0" allowfullscreen="true"> </iframe>


<iframe src="//player.bilibili.com/player.html?aid=455751094&bvid=BV1U5411s7RU&cid=383073940&page=1" scrolling="no" border="0" frameborder="no" framespacing="0" allowfullscreen="true"> </iframe>

diff:

aid=303065226&bvid=BV1dP411n7bc&cid=833485551
aid=261233537&bvid=BV1xe411j7EQ&cid=851171461
aid=558528772&bvid=BV1Ee4y1r7wX&cid=848823074
aid=455751094&bvid=BV1U5411s7RU&cid=383073940

regex:

aid=([0-9]+)&bvid=([0-9a-zA-Z]+)&cid=([0-9]+)

current output grex -f grex.txt -g:

<iframe src="//player\.bilibili\.com/player\.html\?aid=(((261233537&bvid=BV1xe411j7EQ&cid=85117146|303065226&bvid=BV1dP411n7bc&cid=83348555)1|455751094&bvid=BV1U5411s7RU&cid=383073940)|558528772&bvid=BV1Ee4y1r7wX&cid=848823074)&page=1" scrolling="no" border="0" frameborder="no" framespacing="0" allowfullscreen="true"> </iframe>$
suliveevil@swy-M1 ~ % grex -f grex.txt -r
^<iframe src="/{2}player\.(?:bili){2}\.com/player\.html\?aid=(?:(?:45{2}751094&bvid=BV1U541{2}s7RU&cid=383073940|(?:26123{2}537&bvid=BV1xe41{2}j7EQ&cid=851{2}7146|(?:30){2}652{2}6&bvid=BV1dP41{2}n7bc&cid=83{2}485{3})1)|5{2}85287{2}2&bvid=BV1Ee4y1r7wX&cid=848{2}23074)&page=1" scrol{2}ing="no" border="0" frameborder="no" framespacing="0" al{2}owful{2}scre{2}n="true"> </iframe>

expected output:

<iframe src="//player\.bilibili\.com/player\.html\?aid=([0-9]+)&bvid=([0-9a-zA-Z]+)&cid=([0-9]+)&page=1" scrolling="no" border="0" frameborder="no" framespacing="0" allowfullscreen="true"> </iframe>

https://github.com/pemistahl/grex/issues/48

new feature

opened by suliveevil 2

Allow to specify characters that have to be converted to character class

First of all thank you for this great tool. When using, I often need to convert text into more detailed character classes, not just non-digits or non-blank characters. Is it possible to customize the range of characters to be converted into character classes, like [a-e\d], [①-⑨⒈-⒙] or specific languages such as Chinese and Japanese. For example, if the source text is 我的名字是Tom, I hope to get the regular expression [\u{4e00}-\u{9fa5}]{5}\w{3} instead of \w{8}, by specifying character class [\u{4e00}-\u{9fa5}]. And I want to specify the maximum and minimum length of repeated substrings. Sometimes I get results like (\w{5}|\w{7,8}|\w{10,17}), but the regular expression I expected is (\w{3,20}). So I hope to be able to specify the minimum and maximum repetition times of the substring, or combine the repetition times into an interval instead of multiple branches. I think these two points can be specified together, using multiple formats similar to \w{3,20} to specify characters that must be converted into character classes.
new feature

opened by NightWatch0 2
Provide more installation options

I think this is a great tool by description. But I use ubuntu and I doesn't have homebrew (and I don't want to install it only for this tool). I cannot try this tool :( I think good idea an wget one liner install script, or a step by step description, or apt or anithing what a tipical ubuntu laptop can do without install more package manager.
enhancement

opened by sarkiroka 7
Allow to provide test cases that must not be matched

Currently, only test cases that must be matched by the generated regular expression can be provided. It would be useful to additionally provide test cases that must not be matched by the generated expression. In combination with shorthand character classes this would allow for more specific and versatile regular expressions.
new feature

opened by pemistahl 2

Releases(v1.4.1)

v1.4.1(Oct 21, 2022)
Changes

clap has been updated to version 4.0. The help output of grex -h now looks a little different.

Bug Fixes

A bug in the grapheme segmentation was fixed that caused test cases which contain backslashes to produce incorrect regular expressions.

Source code(tar.gz)
Source code(zip)
grex-v1.4.1-aarch64-apple-darwin.tar.gz(906.31 KB)
grex-v1.4.1-x86_64-apple-darwin.tar.gz(981.98 KB)
grex-v1.4.1-x86_64-pc-windows-msvc.zip(814.96 KB)
grex-v1.4.1-x86_64-unknown-linux-musl.tar.gz(1.85 MB)
v1.4.0(Jul 26, 2022)
Features

The library can now be compiled to WebAssembly and be used in any JavaScript project. (#82)

The supported character set for regular expression generation has been updated to the current Unicode Standard 14.0.

structopt has been replaced with clap providing much nicer help output for the command-line tool.

Improvements

The regular expression generation performance has been significantly improved, especially for generating very long expressions from a large set of test cases. This has been accomplished by reducing the number of memory allocations, removing deprecated code and applying several minor optimizations.

Bug Fixes

Several bugs have been fixed that caused incorrect expressions to be generated in rare cases.

Source code(tar.gz)
Source code(zip)
grex-v1.4.0-aarch64-apple-darwin.tar.gz(926.58 KB)
grex-v1.4.0-x86_64-apple-darwin.tar.gz(1002.17 KB)
grex-v1.4.0-x86_64-pc-windows-msvc.zip(823.85 KB)
grex-v1.4.0-x86_64-unknown-linux-musl.tar.gz(1.87 MB)
v1.3.0(Sep 15, 2021)
Features

anchors can now be disabled so that the generated expression can be used as part of a larger one (#30)

the command-line tool can now be used within Unix pipelines (#45)

Changes

Additional methods have been added to RegExpBuilder in order to replace the enum Feature and make the library API more consistent. (#47)

Bug Fixes

Under rare circumstances, the conversion of repetitions did not work. This has been fixed. (#36)

Source code(tar.gz)
Source code(zip)
grex-v1.3.0-x86_64-apple-darwin.tar.gz(944.02 KB)
grex-v1.3.0-x86_64-pc-windows-msvc.zip(771.58 KB)
grex-v1.3.0-x86_64-unknown-linux-musl.tar.gz(1.77 MB)
v1.2.0(Mar 28, 2021)
Features

verbose mode is now supported with the --verbose flag to produce regular expressions which are easier to read (#17)

Source code(tar.gz)
Source code(zip)
grex-v1.2.0-x86_64-apple-darwin.tar.gz(944.22 KB)
grex-v1.2.0-x86_64-pc-windows-msvc.zip(792.63 KB)
grex-v1.2.0-x86_64-unknown-linux-musl.tar.gz(1.78 MB)
v1.1.0(Apr 17, 2020)
Features

case-insensitive matching regexes are now supported with the --ignore-case command-line flag or with Feature::CaseInsensitivity in the library (#23)

non-capturing groups are now the default; capturing groups can be enabled with the --capture-groups command-line flag or with Feature::CapturingGroup in the library (#15)

a lower bound for the conversion of repeated substrings can now be set by specifying --min-repetitions and --min-substring-length or using the library methods RegExpBuilder.with_minimum_repetitions() and RegExpBuilder.with_minimum_substring_length() (#10)

test cases can now be passed from a file within the library as well using RegExpBuilder::from_file() (#13)

Changes

the rules for the conversion of test cases to shorthand character classes have been updated to be compliant to the newest Unicode Standard 13.0 (#21)

the dependency on the unmaintained linked-list crate has been removed (#24)

Bug Fixes

test cases starting with a hyphen are now correctly parsed on the command-line (#12)

the common substring detection algorithm now uses optionality expressions where possible instead of redundant union operations (#22)

Test Coverage

new unit tests, integration tests and property tests have been added

Source code(tar.gz)
Source code(zip)
grex-v1.1.0-x86_64-apple-darwin.tar.gz(518.70 KB)
grex-v1.1.0-x86_64-pc-windows-msvc.zip(431.49 KB)
grex-v1.1.0-x86_64-unknown-linux-musl.tar.gz(1.15 MB)
v1.0.0(Feb 2, 2020)
Finally, the first stable release 1.0.0 is there. :-)

Features

conversion to character classes \d, \D, \s, \S, \w, \W is now supported

repetition detection now works with arbitrarily nested expressions. Input strings such as aaabaaab which were previously converted to ^(aaab){2}$ are now converted to ^(a{3}b){2}$.

optional syntax highlighting for the produced regular expressions can now be enabled using the --colorize command-line flag or with the library method RegExpBuilder.with_syntax_highlighting()

Test Coverage

new unit tests, integration tests and property tests have been added

Source code(tar.gz)
Source code(zip)
grex-v1.0.0-x86_64-apple-darwin.tar.gz(513.23 KB)
grex-v1.0.0-x86_64-pc-windows-msvc.zip(416.29 KB)
grex-v1.0.0-x86_64-unknown-linux-musl.tar.gz(1.13 MB)
v0.3.2(Jan 12, 2020)
Test Coverage

new property tests have been added that revealed new bugs

Bug Fixes

entire rewrite of the repetition detection algorithm

the former algorithm produced wrong regular expressions or even panicked for certain test cases (#9)

Source code(tar.gz)
Source code(zip)
grex-v0.3.2-x86_64-apple-darwin.tar.gz(453.21 KB)
grex-v0.3.2-x86_64-pc-windows-msvc.zip(361.22 KB)
grex-v0.3.2-x86_64-unknown-linux-musl.tar.gz(1.07 MB)
v0.3.1(Jan 6, 2020)
Test Coverage

property tests have been added using the proptest crate

big thanks go to Christophe Biocca for pointing me to the concept of property tests in the first place and for writing an initial implementation of these tests

Bug Fixes

some regular expression specific characters were not escaped correctly in the generated expression

expressions consisting of a single alternation such as ^(abc|xyz)$ were missing the outer parentheses. This caused an erroneous match of strings such as abc123 or 456xyz because of precedence rules.

the created DFA was wrong for repetition conversion in some corner cases. The input a, aa, aaa, aaaa, aaab previously returned the expression ^a{1,4}b?$ which erroneously matches aaaab. Now the correct expression ^(a{3}b|a{1,4})$ is returned.

Documentation

some minor documentation updates

Source code(tar.gz)
Source code(zip)
grex-v0.3.1-x86_64-apple-darwin.tar.gz(446.99 KB)
grex-v0.3.1-x86_64-pc-windows-msvc.zip(354.47 KB)
grex-v0.3.1-x86_64-unknown-linux-musl.tar.gz(1.06 MB)
v0.3.0(Dec 24, 2019)
Features

grex is now also available as a library

escaping of non-ascii characters is now supported with the -e flag

astral code points can be converted to surrogate with the --with-surrogates flag

repeated non-overlapping substrings can be converted to {min,max} quantifier notation using the -r flag

Bug Fixes

many many many bug fixes :-O

Source code(tar.gz)
Source code(zip)
grex-v0.3.0-x86_64-apple-darwin.tar.gz(444.03 KB)
grex-v0.3.0-x86_64-pc-windows-msvc.zip(350.46 KB)
grex-v0.3.0-x86_64-unknown-linux-musl.tar.gz(1.06 MB)
v0.2.0(Oct 19, 2019)
Features

character classes are now supported

input strings can now be read from a text file

Changes

unicode characters are not escaped anymore by default

the performance of the DFA minimization algorithm has been improved for large DFAs

regular expressions are now always surrounded by anchors ^ and $

Bug Fixes

fixed a bug that caused a panic when giving an empty string as input

Source code(tar.gz)
Source code(zip)
grex-v0.2.0-x86_64-apple-darwin.tar.gz(445.01 KB)
grex-v0.2.0-x86_64-pc-windows-msvc.zip(356.00 KB)
grex-v0.2.0-x86_64-unknown-linux-musl.tar.gz(999.84 KB)
v0.1.0(Oct 6, 2019)
This is the very first release of grex. It aims at simplifying the construction of regular expressions based on matching example input.

Features

literals

detection of common prefixes and suffixes

alternation using | operator

optionality using ? quantifier

concatenation of all of the former

Source code(tar.gz)
Source code(zip)
grex-v0.1.0-x86_64-apple-darwin.tar.gz(416.21 KB)
grex-v0.1.0-x86_64-pc-windows-msvc.zip(343.37 KB)
grex-v0.1.0-x86_64-unknown-linux-gnu.tar.gz(976.65 KB)

A command-line tool and library for generating regular expressions from user-provided test cases

Related tags

Overview

Table of Contents

1. What does this tool do? Top ▲

2. Do I still need to learn to write regexes then? Top ▲

3. Current Features Top ▲

4. How to install? Top ▲

4.1 The command-line tool Top ▲

4.2 The library Top ▲

5. How to use? Top ▲

5.1 The command-line tool Top ▲

5.2 The library Top ▲

5.2.1 Default settings

5.2.2 Convert to character classes

5.2.3 Convert repeated substrings

5.2.4 Escape non-ascii characters

5.2.5 Case-insensitive matching

5.2.6 Capturing Groups

5.2.7 Syntax highlighting

5.3 Examples Top ▲

6. How to build? Top ▲

7. How does it work? Top ▲

8. Do you want to contribute? Top ▲

Comments

Releases(v1.4.1)

v1.4.1(Oct 21, 2022)

Changes

Bug Fixes

v1.4.0(Jul 26, 2022)

Features

Improvements

Bug Fixes

v1.3.0(Sep 15, 2021)

Features

Changes

Bug Fixes

v1.2.0(Mar 28, 2021)

Features

v1.1.0(Apr 17, 2020)

Features

Changes

Bug Fixes

Test Coverage

v1.0.0(Feb 2, 2020)

Features

Test Coverage

v0.3.2(Jan 12, 2020)

Test Coverage

Bug Fixes

v0.3.1(Jan 6, 2020)

Test Coverage

Bug Fixes

Documentation

v0.3.0(Dec 24, 2019)

Features

Bug Fixes

v0.2.0(Oct 19, 2019)

Features

Changes

Bug Fixes

v0.1.0(Oct 6, 2019)

Features

Owner

Peter M. Stahl

Text Expression Runner – Readable and easy to use text expressions

A command line tool for renaming your ipa files quickly and easily.

A backend for mdBook written in Rust for generating PDF based on headless chrome and Chrome DevTools Protocol.

Zero-grammer definition command-line parser

Splits test files into multiple groups to run tests in parallel nodes

This tool is for those who often want to search for a string deeply into a directory in recursive mode, but not with the great tool: grep, ack, ripgrep .........一个工具最大的价值不是它有多少功能，而是它能够让你以多快的速度达成所愿......

Find all your TODO notes with one command!

Quickner is a new tool to quickly annotate texts for NER (Named Entity Recognition). It is written in Rust and accessible through a Python API.

A simple and fast linear algebra library for games and graphics

Difftastic is an experimental structured diff tool that compares files based on their syntax.

Papercraft is a tool to unwrap 3D models.

A build tool for illumos.

An efficient and powerful Rust library for word wrapping text.

👄 The most accurate natural language detection library in the Rust ecosystem, suitable for long and short text alike

A fast, low-resource Natural Language Processing and Text Correction library written in Rust.

1. What does this tool do? ^{Top ▲}

2. Do I still need to learn to write regexes then? ^{Top ▲}

3. Current Features ^{Top ▲}

4. How to install? ^{Top ▲}

4.1 The command-line tool ^{Top ▲}

4.2 The library ^{Top ▲}

5. How to use? ^{Top ▲}

5.1 The command-line tool ^{Top ▲}

5.2 The library ^{Top ▲}

5.3 Examples ^{Top ▲}

6. How to build? ^{Top ▲}

7. How does it work? ^{Top ▲}

8. Do you want to contribute? ^{Top ▲}