A fully-featured lexer generator, implement as a proc macro

Related tags

compilers lexer-generator
Overview

lexgen: A fully-featured lexer generator, implemented as a proc macro

lexer! {
    // First line specifies name of the lexer and the token type returned by
    // user actions
    Lexer -> Token;

    // Regular expressions can be named with `let` syntax
    let init = ['a'-'z'];
    let subseq = $init | ['A'-'Z' '0'-'9' '-' '_'];

    // Rule sets have names. Each rule set is compiled to a separate DFA.
    // Switching between rule sets is done explicitly in user actions.
    rule Init {
        // Rules without a right-hand sides for skipping whitespace, comments, etc.
        [' ' '\t' '\n']+,

        // Rule for matching identifiers
        $init $subseq* =>
            |lexer| {
                let token = Token::Id(lexer.match_().to_owned());
                lexer.return_(token)
            },
    }
}

// The token type
#[derive(Debug, PartialEq, Eq)]
enum Token {
    // An identifier
    Id(String),
}

// Generated lexers are initialized with a `&str` for the input
let mut lexer = Lexer::new(" abc123Q-t  z9_9");

// Lexers implement `Iterator<Item=Result<(usize, T, usize), LexerError>>`,
// where `T` is the token type specified in the lexer definition (`Token` in
// this case), and `usize`s indicate byte indices of beginning and end of the
// lexemes.
assert_eq!(
    lexer.next(),
    Some(Ok((1, Token::Id("abc123Q-t".to_owned()), 10)))
);
assert_eq!(
    lexer.next(),
    Some(Ok((12, Token::Id("z9_9".to_owned()), 16)))
);
assert_eq!(lexer.next(), None);

You can see more examples here, and a full Lua 5.1 lexer here.

Motivation

Implementing lexing is often (along with parsing) the most tedious part of implementing a language. Lexer generators make this much easier, but in Rust existing lexer generators miss essential features for practical use, and/or require a pre-processing step when building.

My goal with lexgen is to have a feature-complete and easy to use lexer generator.

Usage

lexgen doesn't require a build step. Just add it as a dependency in your Cargo.toml.

Lexer syntax

lexgen lexers start with type of the generated lexer struct, optional user state part, and the token type (type of values returned by user actions). Example:

lexer! {
    Lexer(LexerState) -> Token;
    ...
}

Here the lexer struct is named Lexer. User state type is LexerState (this type should be defined by the user). The token type is Token.

Next is let bindings for regular expressions. These are optional. The syntax is let <id> = <regex>; where <id> is a Rust identifier and regex is as described below.

let init = ['a'-'z'];
let subseq = $init | ['A'-'Z' '0'-'9' '-' '_'];

Finally we define the lexer rules:

rule Init {
    ...
}

rule SomeOtherRule {
    ...
}

The first rule set will be defining the initial state of the lexer and needs to be named Init.

In the body of a rule block we define the rules for that lexer state. The syntax for a rule is <regex> => <user action>,. Regex syntax is described below. User action is any Rust code with type fn(LexerHandle) -> LexerAction where LexerHandle and LexerAction are generated names derived from the lexer name (Lexer). More on these types below.

You can omit the rule Init { ... } part and have all of your rules at the top level if you don't need rule sets.

In summary:

  • First line is in form <lexer name>(<user state type>) -> <token type name>. The (<user state type>) part can be omitted for stateless lexers.

  • Next we have let bindings for regexes. This part is optional.

  • Next is the rule sets. There should be at least one rule set with the name Init, which is the name of the initial state.

Regex syntax

Regex syntax can be used in right-hand side of let bindings and left-hand side of rules. The syntax is:

  • $var for variables defined in the let binding section. Variables need to be defined before used.

  • $$var for built-in regexes (see "Built-in regular expressions" section below).

  • Rust character syntax for characters, e.g. 'a'.

  • Rust string syntax for strings, e.g. "abc".

  • [...] for character sets. Inside the brackets you can have one or more of:

    • Characters
    • Character ranges: e.g. 'a'-'z'

    Here's an example character set for ASCII alphanumerics: ['a'-'z' 'A'-'Z' '0'-'9']

  • <regex>* for zero or more repetitions of <regex>

  • <regex>+ for one or more repetitions of <regex>

  • <regex>? for zero or one repetitions of <regex>

  • <regex> <regex> for concatenation

  • <regex> | <regex> for alternation (match the first one, or the second one)

  • _ can only appear at the top-level (in the LHS of a rule) and matches when none of the other rules match.

*, +, and ? have the same binding power. | has the least binding power. You can use parenthesis for grouping, e.g. ('a' | 'b')*

Built-in regular expressions

lexgen comes with a set of built-in regular expressions. Regular expressions listed below match the same set of characters as their Rust counterparts. For example, $$alphabetic matches the same set of characters as Rust's char::is_alphabetic:

  • $$alphabetic
  • $$alphanumeric
  • $$ascii
  • $$ascii_alphabetic
  • $$ascii_alphanumeric
  • $$ascii_control
  • $$ascii_digit
  • $$ascii_graphic
  • $$ascii_hexdigit
  • $$ascii_lowercase
  • $$ascii_punctuation
  • $$ascii_uppercase
  • $$ascii_whitespace
  • $$control
  • $$lowercase
  • $$numeric
  • $$uppercase
  • $$whitespace

(Note that in the generated code we don't use Rust char methods. For simple cases like $$ascii we generate simple range checks. For more complicated cases like $$lowercase we generate a binary search table and run binary search when checking a character)

In addition, these two built-in regular expressions match Unicode XID_Start and XID_Continue:

  • $$XID_Start
  • $$XID_Continue

Rule syntax

A rule is just a <regex> => <user action>,. <regex> is as described above. <user action> is any Rust code with type fn(LexerHandle) -> LexerAction. More on these types below.

An alternative syntax without a right-hand side, <regex>,, can be used when the user action is just "continue". (more on user actions below)

Handle, rule, and action types

The lexer macro generates three types with names derived from the lexer name specified by the user. If the lexer name is Lexer, then these types are:

  • LexerAction: this is the type returned by user actions. You don't need to worry about the detail of this type as the handle type has methods for generating LexerActions.

  • LexerRule: see the LexerHandle::switch method below.

  • LexerHandle: this type is the argument type of user actions. It provides methods for manipulating user and lexer states, and getting the current match. The API is:

    • fn match_(&self) -> &str: returns the current match
    • fn peek(&mut self) -> Option<char>: looks ahead one character
    • fn state(&mut self) -> &mut <user state type>: returns a mutable reference to the user state
    • fn return_(self, token: <user token type>) -> LexerAction: returns the passed token as a match.
    • fn continue_(self) -> LexerAction: ignores the current match and continues lexing in the same lexer state. Useful for skipping whitespace and comments.
    • fn switch(self, rule: LexerRule) -> LexerAction: used for switching between lexer states. The LexerRule is an enum with a variant for each rule set name, for example, LexerRule::Init. See the stateful lexer example below.
    • fn switch_and_return(self, rule: LexerRule, token: <user token type>) -> LexerAction: switches to the given lexer state and returns the given token.

Stateful lexer example

Here's an example lexer that counts number of =s appear between two [s:

lexer! {
    Lexer(usize) -> usize;

    rule Init {
        ' ',

        '[' =>
            |mut lexer| {
                *lexer.state() = 0;                     // line 9
                lexer.switch(LexerRule::Count)          // line 10
            },
    }

    rule Count {
        '=' =>
            |mut lexer| {
                let n = *lexer.state();
                *lexer.state() = n + 1;                 // line 18
                lexer.continue_()                       // line 19
            },

        '[' =>
            |mut lexer| {
                let n = *lexer.state();
                lexer.switch_and_return(LexerRule::Init, n) // line 25
            },
    }
}

let mut lexer = Lexer::new("[[ [=[ [==[");
assert_eq!(lexer.next(), Some(Ok((0, 0, 2))));
assert_eq!(lexer.next(), Some(Ok((3, 1, 6))));
assert_eq!(lexer.next(), Some(Ok((7, 2, 11))));
assert_eq!(lexer.next(), None);

Initially (the Init rule set) we skip spaces. When we see a [ we initialize the user state (line 9) and switch to the Count state (line 10). In Count, each = increments the user state by one (line 18) and skips the match (line 19). A [ in the Count state returns the current number and switches to the Init state (line 25).

Implementation details

lexgen's implementation should be fairly standard. Each rule set is compiled to a separate NFA. NFAs are then compiled to DFAs. DFAs are added to the same DFA type but there are no transitions between nodes of different DFAs: transitions between DFAs are done by user action, using the switch method of lexer handles, as described above.

Generated code for a DFA is basically a loop that iterates over characters of the input string:

loop {
    match <lexer state> {
        S1 => {
            match <next character> {
                C1 => ...                  // transition to next state

                ...                        // other characters expected in this state

                _ => ...                   // for an accepting state, run user
                                           // action, for a non-accepting, fail
            }
        },
        ...                                // same stuff for other DFA states
    }
}
Issues
  • Windows-style line endings cause panic on Lua 5.1 example

    Windows-style line endings cause panic on Lua 5.1 example

    Haven't experimented much with the core package yet but right off the bat I noticed that when running the Lua example using Unix-style line endings, it works fine however, running the example using Windows-style line endings, it panics.

    opened by fuwn 2
  • Overlapping ranges break lexing

    Overlapping ranges break lexing

    Here's the test:

    lexer! {
        Lexer -> usize;
    
        ['a'-'b'] '1' = 1,
        ['a'-'c'] '2' => 2,
    }
    
    let mut lexer = Lexer::new("a1");
    assert_eq!(ignore_pos(lexer.next()), Some(Ok(1)));
    assert_eq!(lexer.next(), None);
    
    let mut lexer = Lexer::new("a2");
    assert_eq!(ignore_pos(lexer.next()), Some(Ok(2)));
    assert_eq!(lexer.next(), None);
    

    Lexing a1 works fine but we can't lex a2. I don't remember the code for adding ranges now and I didn't check yet, but I think bug is in handling of ranges: we implicitly assume character ranges don't overlap. In the example above, when adding the range 'a'-'c', we need to consider existing ranges at the same state ('a'-'b'), and make sure that the overlapping parts of the ranges will non-deterministically switch to the states of all of the overlapping ranges. So when 'a'-'b' matches we need to go to states for both '1' and '2'.

    bug 
    opened by osa1 2
  • Compile DFAs as lookup tables instead of linear if-elses

    Compile DFAs as lookup tables instead of linear if-elses

    It's not uncommon for a lexer state to be compiled to dozens of states. Currently we generate code like this:

    match self.state {
        0 -> ...,
        1 -> ...,
        ...
        _ -> ...,
    }
    

    For N states this means N-1 comparisons in the worst case to the take last branch.

    Instead we should compile each state to a function, then put them all in an array:

    static STATES: [fn(...) -> ...; N] = [...];
    

    Then just do STATES[self.state](...);.

    perf 
    opened by osa1 2
  • Consider inlining accepting states without transitions

    Consider inlining accepting states without transitions

    I will come up with a minimal repro, but here's an example for now:

    match self.state {
        ...
        56usize => match self.iter.peek().copied() {
            ...
            Some((char_idx, char)) => match char {
                '\u{5c}' => {
                    self.current_match_end += char.len_utf8();
                    let _ = self.iter.next();
                    self.state = 58usize;
                }
                ...
            },
        },
        ...
        58usize => {
            let rhs: fn(LexerHandle<'_, 'input>) -> LexerAction<Token> =
                |lexer| lexer.switch(LexerRule::StringEscape);
            let str = &self.input[self.current_match_start..self.current_match_end];
            let handle = LexerHandle {
                iter: &mut self.iter,
                match_: str,
                user_state: &mut self.user_state,
            };
            match rhs(handle) {
                LexerAction::Continue => {
                    self.state = self.initial_state;
                    continue;
                }
                LexerAction::Return(tok) => {
                    self.state = self.initial_state;
                    return Some(Ok((
                        self.current_match_start,
                        tok,
                        self.current_match_end,
                    )));
                }
                LexerAction::Switch(rule_set) => {
                    self.switch(rule_set);
                    continue;
                }
                LexerAction::SwitchAndReturn(tok, rule_set) => {
                    self.switch(rule_set);
                    return Some(Ok((
                        self.current_match_start,
                        tok,
                        self.current_match_end,
                    )));
                }
            }
        }
        ...
    }
    

    State 56 has a transition to 58. State 58 is an accepting state and it doesn't have any transitions.

    In general, when a state is an accepting state and doesn't have any transitions we can "inline" it and avoid (1) extra states (2) redundant looping (no need to update the state, continue).

    In the example state 58 is only referred by 56. In general I'm guessing this doesn't have to be the case, but I suspect it may still be OK to inline accepting states.

    perf 
    opened by osa1 1
  • Handle overlapping ranges

    Handle overlapping ranges

    Previously we implicitly assumed that character ranges in a NFA/DFA state cannot overlap, so for example, I cannot have rules '0'-'9' and '5'-'9' in the NFA/DFA state. This is obviously incorrect, as reported in #2.

    This patch introduces a "range map" that maps u32 ranges to values. A range can have multiple values (used in NFA-to-DFA conversion). When two ranges overlap, the overlapping part has union of values of the overlapping ranges.

    Using this type instead of HashMap<(char, char), HashSet<NfaStateIdx>> automatically fixes the issue, as the handling of overlapping ranges effectively implements switching to states of all of the overlapping ranges in the NFA. So most of the code is just implementing the RangeMap type. Rest are tests and replacing hash maps with RangeMap.

    I also benchmarked this patch. For the Lua 5.1 lexer, this PR reduces compile times 9%. Generated code is identical except orders of alternatives in or patterns for ranges.

    Fixes #2

    opened by osa1 0
  • Implement an easy way to get the matched char in a wildcard rule

    Implement an easy way to get the matched char in a wildcard rule

    Currently getting the matched character in a wildcard is quite verbose (and probably also inefficient):

    _ => |mut lexer| {
        let char = lexer.match_().chars().next_back().unwrap();
        ...
    }
    

    One easy fix would be to add a char method to lexer that returns the last matched character.

    Alternatively with #9 we could allow <char:_> => ... syntax.

    enhancement 
    opened by osa1 0
  • Implement a way to bind regexes in patterns

    Implement a way to bind regexes in patterns

    Suppose I want to parse \xHH syntax in a rule (H is a hex digit). Currently it's quite painful to get the chars for those digits in a larger rule:

    "\\x" $hex_digit $hex_digit => |mut lexer| {
        let match_ = lexer.match_();
        let bytes = match_.as_bytes();
    
        let digit1 = bytes[bytes.len() - 2];
        let digit2 = bytes[bytes.len() - 1];
    
        ...
    },
    

    The problem is the match will have the input since the beginning of the current match, so for example, if this is in a string lexing rule and the string is "blah blah \xAA" then match will be the entire string except the closing ".

    One workaround is to add another rule for the hex digits and push them to a buffer in each match, but that's also verbose.

    If we had a way to bind regexes inside patterns we could do:

    "\\x" <d1:$hex_digit> <d2:$hex_digit> => |mut lexer| {
        ...
    },
    

    where d1 and d2 would be chars.

    enhancement 
    opened by osa1 0
  • Consider moving semantic actions to functions, leave inlining decisions to LLVM/rustc

    Consider moving semantic actions to functions, leave inlining decisions to LLVM/rustc

    With #7 we now generate much more efficient code, but in principle it could lead to code size explosion when some semantic actions have large code and lots of incoming edges in the DFA. I didn't see this happen in practice yet, but I think it could happen.

    If we introduce a top-level function for semantic actions, then duplicated code will just be a call expression to the semantic action function. Inlining the semantic action code is then left to LLVM/rustc. I think this would give us more or less the same result for small semantic actions and avoid code size blowup in the worst case.

    perf code size 
    opened by osa1 0
  • Implement Logos benchmarks in lexgen

    Implement Logos benchmarks in lexgen

    https://github.com/maciejhirsz/logos/blob/master/tests/benches/bench.rs

    We don't have any runtime perf benchmarks yet, would be good to port these to lexgen and compare.

    opened by osa1 0
  • Implement fast paths for ASCII in unicode regexes

    Implement fast paths for ASCII in unicode regexes

    Some of the search tables for built-in unicode regular expressions are quite large, but I think 99.9999% of the time they will match ASCII characters, so we should implement a fast path for that. For example, XID_Start currently makes 9 range comparisons to match a.

    opened by osa1 0
Owner
Ömer Sinan Ağacan
Ömer Sinan Ağacan
🐴 RusTOTPony — CLI manager of one-time password generators aka Google Authenticator

?? RusTOTPony CLI manager of time-based one-time password generators. It is a desktop alternative for Google Authenticator. Installation Arch Linux Pa

German Lashevich 16 Aug 3, 2021
Custom Ethereum vanity address generator made in Rust

ethaddrgen Custom Ethereum address generator Get a shiny ethereum address and stand out from the crowd! Disclaimer: Do not use the private key shown i

Jakub Hlusička 99 Sep 9, 2021
Left Recursive PEG for rust

Left Recursive Parsing Expression Grammar (PEG) lrpeg allows left recursive rules, and uses ratpack parsing for speed. I wrote a blog post to introduc

Sean Young 47 Sep 6, 2021
The fast, light, and robust client for the Ethereum mainnet.

OpenEthereum Fast and feature-rich multi-network Ethereum client. » Download the latest release « Table of Contents Description Technical Overview Bui

OpenEthereum 1.1k Sep 12, 2021
Substrate Node for Anmol Network

Anmol Substrate Node ?? ??️ ?? Anmol is the First Cross-Chain NFT Toolkit, on Polkadot. Introducing: Moulds NFT Breeding Multi-Chain NFT Migration ink

Anmol Network 10 Sep 7, 2021
secret folders generator to hide hentais in your computer

hentai dream 95 secret folders generator to hide hentais in your computer, but its really old way as **** used techniquee one injection technique from

jumango pussu 7 Jul 8, 2021
A Rust library for working with Bitcoin SV

Rust-SV A library to build Bitcoin SV applications in Rust. Documentation Features P2P protocol messages (construction and serialization) Address enco

Brenton Gunning 30 Aug 27, 2021
Easy to use cryptographic framework for data protection: secure messaging with forward secrecy and secure data storage. Has unified APIs across 14 platforms.

Themis provides strong, usable cryptography for busy people General purpose cryptographic library for storage and messaging for iOS (Swift, Obj-C), An

Cossack Labs 1.3k Sep 18, 2021
Fast and efficient ed25519 signing and verification in Rust.

ed25519-dalek Fast and efficient Rust implementation of ed25519 key generation, signing, and verification in Rust. Documentation Documentation is avai

dalek cryptography 414 Aug 31, 2021
binance API client

bian-rs 币安API Rust async SDK 完成情况 接口 现货 U本位合约 币本位合约 欧式期权 http ?? 开发中 ?? ?? 开发中 未开始 websocket ?? 开发中 ?? ?? 开发中 未开始 使用 在 Cargo.toml 中添加依赖 [dependencies]

null 20 Sep 1, 2021
Usable, easy and safe pure-Rust crypto

orion About Orion is a cryptography library written in pure Rust. It aims to provide easy and usable crypto while trying to minimize the use of unsafe

Johannes 260 Sep 5, 2021
Exploration of using Storage instead of Allocator to parameterize collections in Rust

storage-poc aims at exploring the usage of custom Storages, rather than custom Allocators. Goals This is a Proof-of-Concept aiming at: Demonstrating t

null 59 Aug 17, 2021
C++ `std::unique_ptr` that represents each object as an NFT on the Ethereum blockchain

C++ `std::unique_ptr` that represents each object as an NFT on the Ethereum blockchain

null 1.4k Sep 15, 2021
Keyhouse is a skeleton of general-purpose Key Management System written in Rust.

Keyhouse Keyhouse is a skeleton of general-purpose Key Management System. Keyhouse is not an off-the-shelf system, and it's not ready for production.

Bytedance Inc. 23 Sep 9, 2021
Temporary edit external crates that your project depends on

rhack You want to quickly put a sneaky macro kind of like dbg! into external crates to find out how some internal data structure works? If so rhack is

Ryo Nakao 96 Jul 23, 2021
A high performance blockchain kernel for enterprise users.

English | 简体中文 What is CITA CITA is a fast and scalable blockchain kernel for enterprises. CITA supports both native contract and EVM contract, by whi

CITAHub 1.2k Sep 17, 2021
Rust Ethereum 2.0 Client

Lighthouse: Ethereum 2.0 An open-source Ethereum 2.0 client, written in Rust and maintained by Sigma Prime. Documentation Overview Lighthouse is: Read

Sigma Prime 1.3k Sep 7, 2021
The Nervos CKB is a public permissionless blockchain, and the layer 1 of Nervos network.

Nervos CKB - The Common Knowledge Base master develop About CKB CKB is the layer 1 of Nervos Network, a public/permissionless blockchain. CKB uses Pro

Nervos Network 866 Sep 2, 2021
A modern TLS library in Rust

Rustls is a modern TLS library written in Rust. It's pronounced 'rustles'. It uses ring for cryptography and libwebpki for certificate verification. S

ctz 2.9k Sep 15, 2021