A new, portable, regular expression language

Related tags

Command-line regex
Overview

Crown in double quotes logo

rulex

A new, portable, regular expression language

Read the book to get started!

Examples

On the left are rulex expressions (rulexes for short), on the right is the compiled regex:

# String
'hello world'                 # hello world

# Greedy repetition
'hello'{1,5}                  # (?:hello){1,5}
'hello'*                      # (?:hello)*
'hello'+                      # (?:hello)+

# Lazy repetition
'hello'{1,5} lazy             # (?:hello){1,5}?
'hello'* lazy                 # (?:hello)*?
'hello'+ lazy                 # (?:hello)+?

# Alternation
'hello' | 'world'             # hello|world

# Character classes
['aeiou']                     # [aeiou]
['p'-'s']                     # [p-s]

# Named character classes
[.] [w] [s] [n]               # .\w\s\n

# Combined
[w 'a' 't'-'z' U+15]          # [\wat-z\x15]

# Negated character classes
!['a' 't'-'z']                # [^at-z]

# Unicode
[Greek] U+30F Grapheme        # \p{Greek}\u030F\X

# Boundaries
<% %>                         # ^$
% 'hello' !%                  # \bhello\B

# Non-capturing groups
'terri' ('fic' | 'ble')       # terri(?:fic|ble)

# Capturing groups
:('test')                     # (test)
:name('test')                 # (?P<name>test)

# Lookahead/lookbehind
>> 'foo' | 'bar'              # (?=foo|bar)
<< 'foo' | 'bar'              # (?<=foo|bar)
!>> 'foo' | 'bar'             # (?!foo|bar)
!<< 'foo' | 'bar'             # (?<!foo|bar)

# Backreferences
:('test') ::1                 # (test)\1
:name('test') ::name          # (?P<name>test)\1

# Ranges
range '0'-'999'               # 0|[1-9][0-9]{0,2}
range '0'-'255'               # 0|1[0-9]{0,2}|2(?:[0-4][0-9]?|5[0-5]?|[6-9])?|[3-9][0-9]?

Variables

let operator = '+' | '-' | '*' | '/';
let number = '-'? [digit]+;

number (operator number)*

Usage

Read the book to get started, or check out the CLI program, the Rust library and the procedural macro.

Why use this instead of normal regexes?

Normal regexes are very concise, but when they get longer, they get increasingly difficult to understand. By default, they don't have comments, and whitespace is significant. Then there's the plethora of sigils and backslash escapes that follow no discernible system: (?<=) (?P<>) .?? \N \p{} \k<> \g'' and so on. And with various inconsistencies between regex implementations, it's the perfect recipe for confusion.

Rulex solves these problems with a new, simpler but also more powerful syntax:

  • It's not whitespace sensitive and allows comments
  • Text must appear in quotes. This makes expressions longer, but also much easier to read
  • Non-capturing groups are the default
  • More intuitive, consistent syntax
  • Variables to make expressions DRY

Compatibility

Rulex is currently compatible with PCRE, JavaScript, Java, .NET, Python, Ruby and Rust. The regex flavor must be specified during compilation, so rulex can ensure that the produced regex works as desired on the targeted regex engine.

Important note for JavaScript users: Don't forget to enable the u flag. This is required for Unicode support. All other major regex engines support Unicode by default.

Diagnostics

Rulex looks for mistakes and displays helpful diagnostics:

  • It shows an error if you use a feature not supported by the targeted regex flavor
  • It detects syntax errors and shows suggestions how to resolve them
  • It parses backslash escapes (which are not allowed in a rulex) and explains what to write instead
  • It looks for likely mistakes and displays warnings
  • It looks for patterns that can be very slow for certain inputs and are susceptible to Denial-of-Service attacks (coming soon)

Roadmap

You can find the Roadmap here.

Contributing

You can contribute by using rulex and providing feedback. If you find a bug or have a question, please create an issue.

I also gladly accept code contributions. To make sure that CI succeeds, please run cargo fmt, cargo clippy and cargo test before creating a pull request.

License

Dual-licensed under the MIT license or the Apache 2.0 license.

Comments
  • FYI: Now installable from AUR

    FYI: Now installable from AUR

    ArchLinux users can install Rulex from AUR:

    • https://aur.archlinux.org/packages/rulex-rs-bin

    E.g.: yay -S rulex-rs-bin

    It will fetch the latest binary release from Microsoft Github and install it along with the CHANGELOG, the README and the licenses.

    opened by kseistrup 8
  • feat: remove thiserror dependency

    feat: remove thiserror dependency

    Description

    This PR removes the dependency on thiserror, which can be replaced by a manual implementation of std::error::Error and core::fmt::Display. This makes pomsky and pomsky-macro external dependency free! :tada:

    I made these changes to avoid adding proc-macro dependencies to a small project where I used pomsky-macro. I used Copilot to copy most of error messages, they should be checked again to ensure I didn't changed anything.

    ~~Fixes # (issue)~~ No issue was created.

    Checklist:

    • [x] My code is formatted with cargo fmt
    • [x] My code compiles with the latest stable Rust toolchain
    • [x] All tests pass with cargo test
    • [x] My changes generate no new warnings with cargo clippy
    • [ ] ~~I have commented my code, particularly in hard-to-understand areas~~
    • [ ] ~~My changes are covered by tests~~
    opened by baptiste0928 2
  • Rename rulex

    Rename rulex

    As explained in the blog post, we need to rename the project to pomsky.

    What needs to be done

    • [x] announce the new name
    • [x] rename the repository
    • [x] rename the types in the source code containing Rulex
    • [x] adjust the README
    • [x] adjust the documentation on the website
    • [x] adjust the source code documentation
    • [x] contact the maintainer of the AUR package
      • [x] rename the AUR package or publish a new one and delete the existing one
      • [ ] announce the AUR package again
    • [x] publish all the crates under their new names
    • [x] publish minor versions of the old crates that warn the user in the README and the documentation that they are unmaintained and should be replaced
    • [ ] rename the GitHub organization
      • this won't be done until we're sure that the old URLs will continue to work, so we can redirect them to the new website
    meta 
    opened by Aloso 2
  • Deprecate `<%` and `%>`

    Deprecate `<%` and `%>`

    Status Quo

    The meaning of <% and %> is not intuitively clear, and even confusing to some: Someone suggested that <% should indicate the end of the string, because the < angle should point towards the string, and vice versa for %>. Even worse, in right-to-left (RTL) languages the directions are reversed. Therefore it doesn't make sense to speak about the "left" and "right" end of a string, only the start and end of the string.

    In the latest version, two built-in variables, Start and End, were added that are currently aliases for <% and %>.

    Solution

    Deprecate <% and %>. Suggest using Start and End instead.

    Deprecation schedule

    • [x] Add the built-ins Start and End
    • [x] Update the documentation to recommend these built-ins
    • [x] Emit a deprecation warning in the CLI and the playground when the old syntax is used
    • [x] Remove documentation for the old syntax
    • [ ] Remove the old syntax in the next breaking release [target: 0.7]
    C-language 
    opened by Aloso 2
  • Add `no-new-line` flag, suppressing a new-line after compilation

    Add `no-new-line` flag, suppressing a new-line after compilation

    I opened issue #17, but decided to just write a PR myself.
    This adds a new flag no-new-line to rulex-bin, that suppresses a new-line that is currently printed by using println!("{compiled}");

    opened by N4tus 2
  • Compile `[d]` to `\d`

    Compile `[d]` to `\d`

    Is your feature request related to a problem? Please describe.

    When I typed [d], I thought rulex would output \d, but it output \p{Nd}.

    image

    Describe the solution you'd like

    1. output \d
    2. Prompt users to use Unicode /\p{Nd}/u

    Describe alternatives you've considered

    Additional context

    enhancement wontfix 
    opened by LuckyHookin 1
  • fix: prevent indexing strs within utf8 codepoints

    fix: prevent indexing strs within utf8 codepoints

    Description

    I discovered three panic scenarios while fuzzing the rulex parser, all related to indexing (by byte offset) into a str and landing within a multi-byte utf8 codepoint. The following queries trigger panics (where MB is a single multibyte codepoint):

    • "MB": MB within a quoted section
    • \MB: an escaped MB
    • "\MB: an escaped MB within a quoted section

    Checklist:

    • [x] My code is formatted with cargo fmt
    • [x] My code compiles with the latest stable Rust toolchain
    • [x] All tests pass with cargo test
    • [x] My changes generate no new warnings with cargo clippy
    • [x] I have commented my code, particularly in hard-to-understand areas
    • [x] My changes are covered by tests
    opened by evanrichter 1
  • No help message when trying to write regex lookbehind

    No help message when trying to write regex lookbehind

    Describe the bug

    When trying to write a regex lookahead, you get a nice help message:

    (?=Test)
    
    This syntax is not supported
    
    Help: Lookahead uses the `>>` syntax. For example, `>> 'bob'` matches if the position is followed by bob.
    

    However, no help message is shown for lookbehind (positive or negative).

    To Reproduce

    (?<=Test)
    

    or

    (?<!Test)
    

    Expected behavior

    A help message appears in the lines of

    Help: Lookbehind uses the `<<` syntax. For example, `<< 'bob'` matches if the position is preceded by bob.
    

    Additional context

    Tried it in the playground

    bug 
    opened by Aloso 1
  • fix: ignore repetitions for groups with empty string literals

    fix: ignore repetitions for groups with empty string literals

    Description

    Fixes #24

    Checklist:

    • [x] My code is formatted with cargo fmt
    • [x] My code compiles with the latest stable Rust toolchain
    • [x] All tests pass with cargo test
    • [x] My changes generate no new warnings with cargo clippy
    • [x] I have commented my code, particularly in hard-to-understand areas
    • [x] My changes are covered by tests
    opened by sebastiantoh 1
  • Back references

    Back references

    Support regex back and forward references.

    A reference is a number or identifier prefixed with two dots:

    :('hello world') '!'* ::1
    :name('hello world') '!'* ::name
    

    This matches the string hello world!!!hello world.

    In repetitions, some regex flavors allow forward references, i.e. references to a group that appears later in the regex. They match the text that was matched by that group in the previous repetition:

    ( ::1* '.' :('a') 'b' )+
    

    This matches the string .ab, but also .abaaa.ab.

    JavaScript supports backreferences, but only numbered. It doesn't support forward references.

    opened by Aloso 1
  • Inline regex

    Inline regex

    Is your feature request related to a problem? Please describe.

    Pomsky doesn't implement all regex features, and probably never will, given the vast number of regex engines.

    Describe the solution you'd like

    A special syntax to embed a regex in a pomsky expression, which is emitted unchanged. This looks like this:

    regex "Hello [world]"
    regex '\bHello\B'
    

    Describe alternatives you've considered

    Instead of a keyword, backticks could be used like in melody. I'm not a fan of this, because this feature should be available, but not encouraged. There should be "syntactic salt" to make it less convenient.

    Another, perhaps more fitting name for the keyword would be "raw", since the string is not escaped. However, the meaning of "raw" strings in PL is usually inverted: It refers to escaping in the input, not the compiled output. The keyword regex better describes what use cases this syntax enables.

    Roadmap

    • [x] Reserve keyword for 0.7
    • [x] Implement syntax for 0.8
    enhancement 
    opened by Aloso 0
  • Code of Conduct

    Code of Conduct

    We need a code of conduct that requires participants in discussions to adhere to certain standards. The goal is to foster a friendly and welcoming community where anyone can feel safe to participate.

    Most importantly, the following behaviors are prohibited:

    • harassment and violent language
    • racism, sexism, and discrimination based on gender, sexual orientation, disability, religion, etc.

    This is not in response to any specific incident. My discussions so far have been pleasant, and I want them to stay that way.

    I'm inclined to copy the CoC from Rust:

    • We are committed to providing a friendly, safe and welcoming environment for all, regardless of level of experience, gender identity and expression, sexual orientation, disability, personal appearance, body size, race, ethnicity, age, religion, nationality, or other similar characteristic.
    • Please avoid using overtly sexual aliases or other nicknames that might detract from a friendly, safe and welcoming environment for all.
    • Please be kind and courteous. There’s no need to be mean or rude.
    • Respect that people have differences of opinion and that every design or implementation choice carries a trade-off and numerous costs. There is seldom a right answer.
    • Please keep unstructured critique to a minimum. If you have solid ideas you want to experiment with, make a fork and see how it works.
    • We will exclude you from interaction if you insult, demean or harass anyone. That is not welcome behavior. We interpret the term “harassment” as including the definition in the Citizen Code of Conduct; if you have any lack of clarity about what might be included in that concept, please read their definition. In particular, we don’t tolerate behavior that excludes people in socially marginalized groups.
    • Private harassment is also unacceptable. No matter who you are, if you feel you have been or are being harassed or made uncomfortable by a community member, please contact one of the channel ops or any of the Rust moderation team immediately. Whether you’re a regular contributor or a newcomer, we care about making this community a safe place for you and we’ve got your back.
    • Likewise any spamming, trolling, flaming, baiting or other attention-stealing behavior is not welcome.

    A page with this code of conduct (adapted to Pomsky) will be added to the website footer and linked from the README. As (currently) the only maintainer I'm responsible for moderating discussions. This means that people who violate the code of conduct will face consequences, ranging from a stern warning to a permanent ban, depending on the severity.

    opened by Aloso 0
  • JSON output

    JSON output

    Is your feature request related to a problem? Please describe.

    Tools such as IDE plugins need Pomsky's output in a machine-readable format. There are multiple ways to achieve this:

    1. Publish a native library that other tools can dynamically link to
    2. Let tool authors create the bindings they need themselves by using the pomsky crate and distribute it as part of their tool
    3. Make the CLI more powerful, so it can be used by tools more effectively

    The solution I'm proposing is number 3 since I think it offers the best experience both for users and tool authors: Users need to have the CLI installed, but they don't have to download anything else, and tool authors don't need to update their tools every time a new Pomsky version is released.

    Describe the solution you'd like

    Add a --json flag to print all compilation results as JSON rather than plain text. The JSON is written to stdout on a single line. It's an object with the following structure:

    {
      version: "1"  // schema version
      success: bool  // true if no errors occurred
      output? : string  // compiled regex, only present if the compilation was successful
      diagnostics: object[]  // array of errors and warnings
        severity: "error" | "warning"
        kind: string  // e.g. "parse" for parse errors or "compat" for compatibility warnings
        spans: object[]  // source code locations that should be underlined
          // initially this array will always contain exactly one object
          start: int  // start byte offset, inclusive
          end: int  // end byte offset, exclusive
          label? : string  // optional additional information to this span
        description: string  // explanation of the error or warning
        help: string[]  // additional information to help the user fix the issue
        fixes: object[]  // "quick fixes", automatic source code transformations; may be displayed as a light bulb
          // since this feature isn't implemented in Pomsky yet, this array will always be empty initially
          description: string  // text to display 
          replacements: object[]  // array of source code locations that need to be modified
            start: int  // start byte offset, inclusive
            end: int  // end byte offset, exclusive
            insert: string  // text to replace the source code location with
    }
    

    Example:

    {
      "version": "1",
      "success": false,
      "diagnostics": [
        {
          "severity": "error",
          "kind": "lexer",
          "spans": [
            { "start": 3, "end": 4 }
          ],
          "description": "unexpected `$`",
          "help": "literal text must be wrapped in quotes, i.e. '$' or \"$\"",
          "fixes": []
        }
      ]
    }
    

    Byte offsets are zero-indexed, e.g. the first byte has the span { "start": 0, "end": 1 }.

    The input is always treated as UTF-8, so the characters in the string a💩ø have the spans { "start": 0, "end": 1 }, { "start": 1, "end": 5 } and { "start": 5, "end": 7 }.

    Describe alternatives you've considered

    There are other formats than JSON, like XML and MessagePack, but they're less widely supported, and JSON should be fast enough.

    Instead of reporting byte offsets, Pomsky could return row and column positions. However, that requires more work, and may not even work for every tool, since different IDEs have different plugin interfaces. Furthermore, UTF-8 has an exact definition, whereas rows and columns aren't well defined in the context of Unicode.

    Future possibilities

    We can keep the CLI running in the background, so tools can write to stdin and get a response at stdout, without having to spawn a process each time.

    We can add a --check flag that returns diagnostics but doesn't actually compile the expression, which should be faster. This could then be used to report errors on every keystroke.

    enhancement 
    opened by Aloso 2
  • Distribute Pomsky as a native library

    Distribute Pomsky as a native library

    Distributing Pomsky not only in the CLI format, but also as a native library (.so, .dll), allows tools to compile and receive an output in a standardized format which can be interpreted through interfaces like JNI or JNA. Example usage: compilation and error reporting inside of an IDE.

    I think a project like this https://github.com/Dushistov/flapigen-rs might come handy.

    opened by lppedd 5
  • Optimization: Deduplicate alternatives

    Optimization: Deduplicate alternatives

    Is your feature request related to a problem? Please describe.

    Duplicate alternatives can happen when combining different variables, or simply by accident.

    Example: A subset of C keyword with while appearing twice:

    % (
      | "break"
      | "case"
      | "char"
      | "const"
      | "continue"
      | "default"
      | "do"
      | "double"
      | "else"
      | "float"
      | "for"
      | "if"
      | "int"
      | "long"
      | "return"
      | "short"
      | "sizeof"
      | "struct"
      | "switch"
      | "void"
      | "while"
      | "while"
    ) %
    

    Describe the solution you'd like

    Assuming no side effects (e.g. capturing groups), duplicate alternatives can be removed without affecting the behavior of the pattern (negatively).

    More specifically, given an alternation, let index(x) -> int be a function that returns the index of a given alternative of the alternation in its list of alternatives. If there exist two alternatives a and b such that index(a) < index(b), then b can be removed without effecting the pattern.

    This optimization might even prevent very simple cases of exponential backtracking. E.g. ( "a" | "a" )+ "b".

    Describe alternatives you've considered

    A more complete solution would be to implement a more complete DFA-based solution to determine whether a given alternative is a subset of the union of all previous alternatives in the alternation. However, this is significantly more computationally expensive and does not support irregular features such as assertions.

    Additional context

    The regexp/no-dupe-disjunctions rule is an implementation of this optimization. This rule uses the DFA approach along if a simpler regex-comparison approach to determine duplicate alternatives.

    enhancement C-optimize 
    opened by RunDevelopment 2
  • Optimization: Remove unnecessary elements in character classes

    Optimization: Remove unnecessary elements in character classes

    Is your feature request related to a problem? Please describe.

    Pomsky currently does not remove unnecessary elements in character classes. E.g. [ w "abc" ] compiles to [\wabc] (Java). However, the abc is unnecessary because [\wabc] == \w.

    Describe the solution you'd like

    Remove unnecessary elements in character classes to optimize and simplify them.

    Additional context

    This requires knowing the precise set of characters accepted by each character class element. For an example implementation of this, checkout the regexp/no-dupe-characters-character-class rule.

    enhancement C-optimize 
    opened by RunDevelopment 4
Releases(v0.8)
Owner
Ludwig Stecher
I'm currently a software developer trainee at Interhyp in Munich. Rustacean. he/him
Ludwig Stecher
The Ergex Regular Expression Library

The Ergex Regular Expression Library Introduction Ergex is a regular expression library that does a few things rather differently from many other libr

Rob King 119 Dec 6, 2022
Rust regex in ECMAScript regular expression syntax!

ecma_regex The goal of ecma_regex is to provide the same functionality as the regex crate in ECMAScript regular expression syntax. Reliable regex engi

HeYunfei 6 Mar 7, 2023
Shell Of A New Machine: Quickly configure new environments

Shell Of A New Machine soanm is a dead-simple tool for easily configuring new UNIX machines, with almost zero prerequisites on the target machine. All

Ben Weinstein-Raun 41 Dec 22, 2022
RnR is a command-line tool to securely rename multiple files and directories that supports regular expressions

RnR is a command-line tool to securely rename multiple files and directories that supports regular expressions. Features Batch rename files and direct

Ismael González Valverde 219 Dec 31, 2022
Rust library for regular expressions using "fancy" features like look-around and backreferences

fancy-regex A Rust library for compiling and matching regular expressions. It uses a hybrid regex implementation designed to support a relatively rich

fancy-regex 302 Jan 3, 2023
😎 Pretty way of writing regular expressions in Rust

?? Write readable regular expressions The crate provides a clean and readable way of writing your regex in the Rust programming language: Without pret

Adi Salimgereyev 7 Aug 12, 2023
A library for loading and executing PE (Portable Executable) from memory without ever touching the disk

memexec A library for loading and executing PE (Portable Executable) from memory without ever touching the disk This is my own version for specific pr

FssAy 5 Aug 27, 2022
🐚+🦞 Ultra-portable Rust game engine suited for offline 2D games powered by WebAssembly

pagurus ?? + ?? Ultra-portable Rust game engine suited for offline 2D games powered by WebAssembly. Examples Snake Traditional snake game: examples/sn

Takeru Ohta 20 Mar 7, 2023
Revolutionize handheld gaming with adaptive game settings. Optimize graphics and gameplay experience based on real-time system metrics. Open-source project empowering developers to enhance games on portable devices

Welcome to the server-side application for the HarmonyLink project. This innovative software is developed with the Rust programming language and is ai

Jordon Brooks 5 Jun 28, 2023
🧮 Boolean expression evaluation engine. A Rust port of boolrule.

coolrule My blog post: Porting Boolrule to Rust Boolean expression evaluation engine (a port of boolrule to Rust). // Without context let expr = coolr

Andrew Healey 3 Aug 21, 2023
A new type of shell

A new type of shell

Nushell Project 22.5k Jan 8, 2023
A CLI tool that allow you to create a temporary new rust project using cargo with already installed dependencies

cargo-temp A CLI tool that allow you to create a new rust project in a temporary directory with already installed dependencies. Install Requires Rust

Yohan Boogaert 61 Oct 31, 2022
A command line application which sets your wall paper with new image generating pollens once they arrive.

pollenwall Table of Contents pollenwall About Installation Binary releases Build from source Usage Command Line Arguments Running as a service MacOS L

Pollinations.AI 2 Jan 7, 2022
A rust library + CLI tool that tells you when swas will upload new video through complex calculations

A rust library + CLI tool that tells you when swas will upload new video through complex calculations. It also lets you search and play youtube videos of swas and other channels. Searching about youtube channels is also an option. Basically it's a youtube search cli tool written in rust.

midnightFirefly 4 Jun 10, 2022
Terminal UI for erhanbaris/smartcalc, a new way to do calculations on-the-fly

smartcalc-tui Terminal UI for erhanbaris/smartcalc, a new way to do calculations on-the-fly. From the README: Do your calculation on text based querie

Aaron Ross 12 Sep 14, 2022
Rust implementation of PowerSession, with new features and enhancements

PowerSession Record a Session in PowerShell. PowerShell version of asciinema based on Windows Pseudo Console(ConPTY) This is a new Rust implemented ve

Watfaq Technologies Pty Ltd 43 Dec 26, 2022
🚩 Show sensitive command summary when open a new terminal

?? Show sensitive command summary when open a new terminal ?? Clear sensitive commands from shell history ?? Stash your history command before present

Rusty Ferris Club 161 Dec 26, 2022
EVA ICS v4 is a new-generation Industrial-IoT platform for Industry-4.0 automated control systems.

EVA ICS v4 EVA ICS® v4 is a new-generation Industrial-IoT platform for Industry-4.0 automated control systems. The world-first and only Enterprise aut

EVA ICS 25 Feb 1, 2023
A little tool to create region-free openingTitle.arc files for New Super Mario Bros. Wii, or to convert them from one region to another

smallworld ...though the mountains divide and the oceans are wide... smallworld is a little tool that can create region-free openingTitle.arc files fo

NSMBW Community 7 Feb 6, 2023