A fully-featured lexer generator, implement as a proc macro

Ömer Sinan Ağacan

Last update: Nov 19, 2022

Related tags

Cryptography compilers lexer-generator

Overview

lexgen: A fully-featured lexer generator, implemented as a proc macro

lexer! {
    // First line specifies name of the lexer and the token type returned by
    // user actions
    Lexer -> Token;

    // Regular expressions can be named with `let` syntax
    let init = ['a'-'z'];
    let subseq = $init | ['A'-'Z' '0'-'9' '-' '_'];

    // Rule sets have names. Each rule set is compiled to a separate DFA.
    // Switching between rule sets is done explicitly in user actions.
    rule Init {
        // Rules without a right-hand sides for skipping whitespace, comments, etc.
        [' ' '\t' '\n']+,

        // Rule for matching identifiers
        $init $subseq* =>
            |lexer| {
                let token = Token::Id(lexer.match_().to_owned());
                lexer.return_(token)
            },
    }
}

// The token type
#[derive(Debug, PartialEq, Eq)]
enum Token {
    // An identifier
    Id(String),
}

// Generated lexers are initialized with a `&str` for the input
let mut lexer = Lexer::new(" abc123Q-t  z9_9");

// Lexers implement `Iterator<Item=Result<(usize, T, usize), LexerError>>`,
// where `T` is the token type specified in the lexer definition (`Token` in
// this case), and `usize`s indicate byte indices of beginning and end of the
// lexemes.
assert_eq!(
    lexer.next(),
    Some(Ok((1, Token::Id("abc123Q-t".to_owned()), 10)))
);
assert_eq!(
    lexer.next(),
    Some(Ok((12, Token::Id("z9_9".to_owned()), 16)))
);
assert_eq!(lexer.next(), None);

You can see more examples here, and a full Lua 5.1 lexer here.

Motivation

Implementing lexing is often (along with parsing) the most tedious part of implementing a language. Lexer generators make this much easier, but in Rust existing lexer generators miss essential features for practical use, and/or require a pre-processing step when building.

My goal with lexgen is to have a feature-complete and easy to use lexer generator.

Usage

lexgen doesn't require a build step. Just add it as a dependency in your Cargo.toml.

Lexer syntax

lexgen lexers start with type of the generated lexer struct, optional user state part, and the token type (type of values returned by user actions). Example:

lexer! {
    Lexer(LexerState) -> Token;
    ...
}

Here the lexer struct is named Lexer. User state type is LexerState (this type should be defined by the user). The token type is Token.

Next is let bindings for regular expressions. These are optional. The syntax is let <id> = <regex>; where <id> is a Rust identifier and regex is as described below.

let init = ['a'-'z'];
let subseq = $init | ['A'-'Z' '0'-'9' '-' '_'];

Finally we define the lexer rules:

rule Init {
    ...
}

rule SomeOtherRule {
    ...
}

The first rule set will be defining the initial state of the lexer and needs to be named Init.

In the body of a rule block we define the rules for that lexer state. The syntax for a rule is <regex> => <user action>,. Regex syntax is described below. User action is any Rust code with type fn(LexerHandle) -> LexerAction where LexerHandle and LexerAction are generated names derived from the lexer name (Lexer). More on these types below.

You can omit the rule Init { ... } part and have all of your rules at the top level if you don't need rule sets.

In summary:

First line is in form <lexer name>(<user state type>) -> <token type name>. The (<user state type>) part can be omitted for stateless lexers.
Next we have let bindings for regexes. This part is optional.
Next is the rule sets. There should be at least one rule set with the name Init, which is the name of the initial state.

Regex syntax

Regex syntax can be used in right-hand side of let bindings and left-hand side of rules. The syntax is:

$var for variables defined in the let binding section. Variables need to be defined before used.
$$var for built-in regexes (see "Built-in regular expressions" section below).
Rust character syntax for characters, e.g. 'a'.
Rust string syntax for strings, e.g. "abc".
[...] for character sets. Inside the brackets you can have one or more of:
- Characters
- Character ranges: e.g. 'a'-'z'
Here's an example character set for ASCII alphanumerics: ['a'-'z' 'A'-'Z' '0'-'9']
<regex>* for zero or more repetitions of <regex>
<regex>+ for one or more repetitions of <regex>
<regex>? for zero or one repetitions of <regex>
<regex> <regex> for concatenation
<regex> | <regex> for alternation (match the first one, or the second one)
_ can only appear at the top-level (in the LHS of a rule) and matches when none of the other rules match.

*, +, and ? have the same binding power. | has the least binding power. You can use parenthesis for grouping, e.g. ('a' | 'b')*

Built-in regular expressions

lexgen comes with a set of built-in regular expressions. Regular expressions listed below match the same set of characters as their Rust counterparts. For example, $$alphabetic matches the same set of characters as Rust's char::is_alphabetic:

$$alphabetic
$$alphanumeric
$$ascii
$$ascii_alphabetic
$$ascii_alphanumeric
$$ascii_control
$$ascii_digit
$$ascii_graphic
$$ascii_hexdigit
$$ascii_lowercase
$$ascii_punctuation
$$ascii_uppercase
$$ascii_whitespace
$$control
$$lowercase
$$numeric
$$uppercase
$$whitespace

(Note that in the generated code we don't use Rust char methods. For simple cases like $$ascii we generate simple range checks. For more complicated cases like $$lowercase we generate a binary search table and run binary search when checking a character)

In addition, these two built-in regular expressions match Unicode XID_Start and XID_Continue:

$$XID_Start
$$XID_Continue

Rule syntax

A rule is just a <regex> => <user action>,. <regex> is as described above. <user action> is any Rust code with type fn(LexerHandle) -> LexerAction. More on these types below.

An alternative syntax without a right-hand side, <regex>,, can be used when the user action is just "continue". (more on user actions below)

Handle, rule, and action types

The lexer macro generates three types with names derived from the lexer name specified by the user. If the lexer name is Lexer, then these types are:

LexerAction: this is the type returned by user actions. You don't need to worry about the detail of this type as the handle type has methods for generating LexerActions.
LexerRule: see the LexerHandle::switch method below.
LexerHandle: this type is the argument type of user actions. It provides methods for manipulating user and lexer states, and getting the current match. The API is:
- fn match_(&self) -> &str: returns the current match
- fn peek(&mut self) -> Option<char>: looks ahead one character
- fn state(&mut self) -> &mut <user state type>: returns a mutable reference to the user state
- fn return_(self, token: <user token type>) -> LexerAction: returns the passed token as a match.
- fn continue_(self) -> LexerAction: ignores the current match and continues lexing in the same lexer state. Useful for skipping whitespace and comments.
- fn switch(self, rule: LexerRule) -> LexerAction: used for switching between lexer states. The LexerRule is an enum with a variant for each rule set name, for example, LexerRule::Init. See the stateful lexer example below.
- fn switch_and_return(self, rule: LexerRule, token: <user token type>) -> LexerAction: switches to the given lexer state and returns the given token.

Stateful lexer example

Here's an example lexer that counts number of =s appear between two [s:

lexer! {
    Lexer(usize) -> usize;

    rule Init {
        ' ',

        '[' =>
            |mut lexer| {
                *lexer.state() = 0;                     // line 9
                lexer.switch(LexerRule::Count)          // line 10
            },
    }

    rule Count {
        '=' =>
            |mut lexer| {
                let n = *lexer.state();
                *lexer.state() = n + 1;                 // line 18
                lexer.continue_()                       // line 19
            },

        '[' =>
            |mut lexer| {
                let n = *lexer.state();
                lexer.switch_and_return(LexerRule::Init, n) // line 25
            },
    }
}

let mut lexer = Lexer::new("[[ [=[ [==[");
assert_eq!(lexer.next(), Some(Ok((0, 0, 2))));
assert_eq!(lexer.next(), Some(Ok((3, 1, 6))));
assert_eq!(lexer.next(), Some(Ok((7, 2, 11))));
assert_eq!(lexer.next(), None);

Initially (the Init rule set) we skip spaces. When we see a [ we initialize the user state (line 9) and switch to the Count state (line 10). In Count, each = increments the user state by one (line 18) and skips the match (line 19). A [ in the Count state returns the current number and switches to the Init state (line 25).

Implementation details

lexgen's implementation should be fairly standard. Each rule set is compiled to a separate NFA. NFAs are then compiled to DFAs. DFAs are added to the same DFA type but there are no transitions between nodes of different DFAs: transitions between DFAs are done by user action, using the switch method of lexer handles, as described above.

Generated code for a DFA is basically a loop that iterates over characters of the input string:

loop {
    match <lexer state> {
        S1 => {
            match <next character> {
                C1 => ...                  // transition to next state

                ...                        // other characters expected in this state

                _ => ...                   // for an accepting state, run user
                                           // action, for a non-accepting, fail
            }
        },
        ...                                // same stuff for other DFA states
    }
}

Comments

Invalid token doesn't consume input from match_ in non-initial rule

This works as I expected

lexer! {
    Lexer -> &'input str;

    rule Init {
        's' = "s",
        "a" => |lexer| lexer.return_(lexer.match_()),
    }
}

let input = "sxa";
let mut lexer = Lexer::new(input);
assert_eq!(lexer.next(), Some(Ok((loc(0, 0, 0), "s", loc(0, 1, 1)))));
assert_eq!(lexer.next(), Some(Err(LexerError { location: loc(0, 1, 1), kind: crate::LexerErrorKind::InvalidToken })));
assert_eq!(lexer.next(), Some(Ok((loc(0, 2, 2), "a", loc(0, 3, 3)))));
assert_eq!(lexer.next(), None);

though if I move "a" to a different rule and let "s" enter the rule, it starts behaving differently while I expected to return exactly same as above one

lexer! {
    Lexer -> &'input str;

    rule Init {
        's' => |lexer| lexer.switch_and_return(LexerRule::InString, "s"),
    }
    rule InString {
        "a" => |lexer| lexer.return_(lexer.match_()),
    }
}

let input = "sxa";
let mut lexer = Lexer::new(input);
assert_eq!(lexer.next(), Some(Ok((loc(0, 0, 0), "s", loc(0, 1, 1)))));
assert_eq!(lexer.next(), Some(Err(LexerError { location: loc(0, 1, 1), kind: crate::LexerErrorKind::InvalidToken })));
// Why "x" is included here???
assert_eq!(lexer.next(), Some(Ok((loc(0, 1, 1), "xa", loc(0, 3, 3)))));
// Perhaps this is because we don't switch back to the Init rule?
assert_eq!(lexer.next(), Some(Err(LexerError { location: loc(0, 3, 3), kind: crate::LexerErrorKind::InvalidToken })));
assert_eq!(lexer.next(), None);

It'd be great if

InvalidToken consume the invalid part always, and
if the last LexerError is expected, have it documented in README.md (and perhaps return a different error kind than InvalidToken?)

opened by MiSawa 8

Lookahead could be useful
Suppose I'm writing a Rust lexer. Naively, lexing integer literals could be implemented like this:

$dec_digit ($dec_digit | '_')* $int_suffix? => ..., "0b" ($bin_digit | '_')* $bin_digit ($bin_digit | '_')* $int_suffix? => ..., "0o" ($oct_digit | '_')* $oct_digit ($oct_digit | '_')* $int_suffix? => ..., "0x" ($hex_digit | '_')* $hex_digit ($hex_digit | '_')* $int_suffix? => ...,

The problem with these rules is they generate multiple tokens for invalid literals like 0invalidSuffix or 123AFB43.

Currently the way to fix this is by introducing new rules for each of these literals:

rule Init { ... $dec_digit => |lexer| lexer.switch(LexerRule::DecInt), "0b" => |lexer| lexer.switch(LexerRule::BinInt), "0o" => |lexer| lexer.switch(LexerRule::OctInt), "0x" => |lexer| lexer.switch(LexerRule::HexInt), } rule DecInt { ($dec_digit | '_')* $int_suffix?, $ => ... return token ..., $whitespace => ... return token ..., } rule BinInt { ... } rule OctInt { ... } rule HexInt { ... }

What these rules do is when we see a character for an integer literal, we consume every character until the end of the input or we see a whitespace. Seeing an invalid character for the literal becomes an error.

If we had a regex for "followed by", we could use the rules in the first version, with a "followed by whitespace or end-of-input" part:

$dec_digit ($dec_digit | '_')* $int_suffix? > ($whitespace | $) => ...,

where > is the "followed by" syntax.

The difference from concatenation is the "followed by" part should not be included in the match.
feature design
opened by osa1 8

Regex compilation bug when prefix is shared by regexes

use lexgen::lexer;

fn ignore_pos<A, E>(ret: Option<Result<(usize, A, usize), E>>) -> Option<Result<A, E>> {
    ret.map(|res| res.map(|(_, a, _)| a))
}

fn next<A, E>(
    iter: &mut dyn Iterator<Item = Result<(usize, A, usize), E>>,
) -> Option<Result<A, E>> {
    ignore_pos(iter.next())
}

fn return_match<'lexer, 'input>(lexer: LexerHandle<'lexer, 'input>) -> LexerAction<&'input str> {
    let match_ = lexer.match_();
    lexer.return_(match_)
}

lexer! {
    Lexer -> &'input str;

    "xyzxyz" => return_match,
    "xyz" => return_match,
    "xya" => return_match,
}

fn main() {
    let mut lexer = Lexer::new("xyzxya");
    assert_eq!(next(&mut lexer), Some(Ok("xyz")));  // fails
    assert_eq!(next(&mut lexer), Some(Ok("xya")));
    assert_eq!(next(&mut lexer), None);
}

Interestingly, we have very similar examples that work fine. For example, Lua lexer has regexes for "elseif" and "else".

bug

opened by osa1 6

`switch_and_return` does not consume the returned token?
First, I want to thank you for writing lexgen. I find it incredibly useful when paired with LALRPOP (and would at some point like to contribute some integration to it, though I'm not sure exactly what that should look like right now).

I have a lexer with two rules- Init and NoteBody. The purpose of NoteBody is to consume characters until a newline is found, and the change between rules happens when an = (Tok::Equal) is detected and certain conditions (in_note_name state) are met. The logic looks like this:

"=" => |mut lexer| { if lexer.state().in_note_name { lexer.switch_and_return(LexerRule::NoteBody, Tok::Equal) } else { lexer.return_(Tok::Equal) } },

When I've detected a newline in the NoteBody rule by peek()ing, I want to return all the characters I've found after the equals sign and up to (not including) the newline. I would expect the logic for this to look something like:

let match_ = lexer.match_(); ... lexer.switch_and_return(LexerRule::Init, Tok::NoteBody(&match_))

However, it turns out that this will return the = token prepended to the text I actually want to match. In order to return just the text after the = token, I need to do:

let match_ = lexer.match_(); ... lexer.switch_and_return(LexerRule::Init, Tok::NoteBody(&match_[1..=match_end]))

My question is: Is this a bug, or intended behavior? If the latter, is my workaround to strip the first character from the match correct, or can you suggest a method for removing the = sign before the transition from the Init to NoteBody rule?
opened by cr1901 6
Implement Default for Loc
Thank you for the amazing lexer!

The generated lexer seems to be 99% compatible with what lalrpop takes. The only incompatibility is the lack of Default implementation for lexgen_util::Loc, which lalrpop uses to initialize the location. If there's no reason not to implement it, it'd be great to have it implemented so that the following just works (with type Location = lexgen_util::Loc and type Error = lexgen_util::LexerError<MyError> defined in the extern section of the .lalrpop file).

let lexer = Lexer::new(query); // The lexgen-generated lexer let query = QueryParser::new() // The lalrpop-generated parser .parse(query, lexer) .unwrap();
opened by MiSawa 5

Empty rule fails or loops depending on other rules

use lexgen::lexer;

lexer! {
    Lexer -> &'input str;

    let whitespace =
        ['\t' '\n' '\u{B}' '\u{C}' '\r' ' ' '\u{85}' '\u{200E}' '\u{200F}' '\u{2028}' '\u{2029}'];

    let dec_digit = ['0'-'9'];

    let hex_digit = ['0'-'9' 'a'-'f' 'A'-'F'];

    let int_suffix = ('u' | 'i') '8' | ('u' | 'i') "16" | ('u' | 'i') "32" |
            ('u' | 'i') "64" | ('u' | 'i') "128" | ('u' | 'i') "size";

    rule Init {
        "0x" => |lexer| lexer.switch(LexerRule::HexInt),

        $dec_digit => |lexer| lexer.switch(LexerRule::DecInt),
    }

    rule DecInt {
        ($dec_digit | '_')* $int_suffix?,

        $ => |lexer| {
            let match_ = lexer.match_();
            lexer.return_(match_)
        },

        $whitespace => |lexer| {
            let match_ = lexer.match_();
            lexer.return_(&match_[..match_.len() - match_.chars().last().unwrap().len_utf8()])
        },
    }

    rule HexInt {

    }
}

fn ignore_loc<A, E, L>(ret: Option<Result<(L, A, L), E>>) -> Option<Result<A, E>> {
    ret.map(|res| res.map(|(_, a, _)| a))
}

fn main() {
    let mut lexer = Lexer::new("0xff");
    assert_eq!(ignore_loc(lexer.next()), Some(Ok("0xff")));
}

The repro above loops, which is a bug. An empty rule should always fail.

Interestingly, if I remove the contents of the rule DecInt, it starts to fail, but I think the error value is not right:

thread 'main' panicked at 'assertion failed: `(left == right)`
  left: `Some(Err(InvalidToken { location: Loc { line: 0, col: 1, byte_idx: 1 } }))`,
 right: `Some(Ok("0xff"))`', src/main.rs:37:5

I would expect the error location to be column 2 as 0x part should be consumed for the rule starting with "0x".

bug

opened by osa1 5

The state type is required to implement Default even if client code never calls `new` or `new_from_iter`

The README claims the following: https://github.com/osa1/lexgen/blob/6be10c9ab5c56a5421a72891ba1069acec8e9b15/README.md?plain=1#L302-L308

However, as the library currently stands, the state type is required to implement Default even if the new and new_from_iter functions are never used. The reason is that the new and new_from_iter do not list $StateType: Default as a constraint, yet assume an implementation of Default in their bodies, and therefore if $StateType does not implement Default, the compilation of the entire program fails, even if the client never uses those functions.

A partial solution, which I implemented in the process of making https://github.com/osa1/lexgen/pull/53, is to add $StateType: Default as an explicit constraint. However, this only bypasses the problem if$StateType contains generic parameters. So, if $StateType does not implement Default, $StateType = Foo<'input> and $StateType = Foo<'a> are fine, but $StateType = Foo<'static> and $StateType = Foo are not. The fact that Rust rejects Foo<'static>: Default means that I had to do 'input: static and Foo<'input>, which solves the issue but creates warnings that the generic parameter 'input should be replaced with 'static. This is also only a partial solution because it does not allow a state type that does not implement Default if the state has no generic parameters.

(In my opinion, the rejection of a function with constraint Foo: Default, where Foo does not implement Default, is an illogical design choice on the part of Rust. If the constraint Foo: Trait is not satisfied, that is no different in principle than if Foo<T>: Trait is not satisfied for some choice of T. If Foo<T>: Trait is not satisfied for some choice of T, that simply means that the function is not usable for such T. If Foo: Trait is not satisfied where Foo has no generic parameters, this should simply mean that the function is not usable under any circumstances, not that the compiler should reject the function definition. This illogical design choice unfortunately makes my PR messy.)
bug

opened by alan-j-hu 4
Generate functions for semantic actions
To be able to implement backtracking in #16 we need to store semantic action of the last accepting state we've seen. For that we need to store semantic actions as functions.

DFA simplification duplicates semantic actions, having functions for semantic actions means we leave inlining decisions to the compiler (fixes #8).
opened by osa1 4
Allow initializing lexers with a char iterator
This fixes #41 in an almost backwards compatible way. Generated lexers now have an extra constructor:

impl<I: Iterator<Item = char> + Clone> Lexer<'static, I> { fn new_from_iter(iter: I) -> Self { Lexer(::lexgen_util::Lexer::new_from_iter(iter)) } }

API of the generated lexers are exactly the same, however, if a lexer is constructed with new_from_iter instead of new or new_with_state, then match_ method will panic in runtime. This is because in lexers constructed with new_from_iter we don't have the input string, so cannot return a slice to it. Instead use match_loc to get the start and end locations of the current match.

Only breaking change is the generated types now have one more generic argument, for the iterator type. So for a lexer like:

lexer! { MyLexer -> MyToken; ... }

Instead of

struct MyLexer<'input>(...);

we now generate

struct MyLexer<'input, I: Iterator<Item = char> + Clone>(...);

So any code that refers to the lexer type will break.

Other than this the changes should be backwards compatible.

Fixes #41
opened by osa1 3
Implement returning multiple values in NFA and DFA simulations

It would be easier to debug bugs like #16 if NFA and DFA simulations worked more like the implementation (generated code). Currently simulations only accept the whole input, so we can't test the NFA or DFA for the lexer in #16.
feature

opened by osa1 2
Consider moving semantic actions to functions, leave inlining decisions to LLVM/rustc

With #7 we now generate much more efficient code, but in principle it could lead to code size explosion when some semantic actions have large code and lots of incoming edges in the DFA. I didn't see this happen in practice yet, but I think it could happen.

If we introduce a top-level function for semantic actions, then duplicated code will just be a call expression to the semantic action function. Inlining the semantic action code is then left to LLVM/rustc. I think this would give us more or less the same result for small semantic actions and avoid code size blowup in the worst case.
perf code size

opened by osa1 2
Return multiple tokens

Sometimes I want a lexer rule to be able to return multiple tokens, e.g. to emit a dummy token so parser can use it as an end-marker for some syntax. Maybe I should just use Lexer -> Vec<MyToken> and flatten it later, though it'd be great if this is supported by the library side.
feature

opened by MiSawa 1

Avoid redunant backtrack state updates

In the Lua lexer I see code like

'>' => {
    self.0.set_accepting_state(Lexer_ACTION_13);          // 2
    match self.0.next() {
        None => {
            self.0.__done = true;
            match self.0.backtrack() {                    // 6
                ...
            }
        }
        Some(char) => match char {
            '>' => {
                self.0.reset_accepting_state();           // 12
                match Lexer_ACTION_31(self) {
                    ...
                }
            }
            '=' => {
                self.0.reset_accepting_state();           // 18
                match Lexer_ACTION_11(self) {
                    ...
                }
            }
            _ => match self.0.backtrack() {               // 23
                ...
            },
        },
    }
}

Here we don't need to set accepting state in line 2. Lines 6 and 23 call the semantic action set in line 2, and could call the function directly instead of going through backtrack. Lines 12 and 18 ignore accepting state entirely and call other semantic action functions.

I think if we move set_accepting_state calls to the branches it may become easier in the code generator to handle these cases. In that case the code above would look like:

'>' => {
    match self.0.next() {
        None => {
            self.0.set_accepting_state(Lexer_ACTION_13);       
            self.0.__done = true;
            match self.0.backtrack() {                
                ...
            }
        }
        Some(char) => match char {
            '>' => {
                self.0.set_accepting_state(Lexer_ACTION_13);       
                self.0.reset_accepting_state();      
                match Lexer_ACTION_31(self) {
                    ...
                }
            }
            '=' => {
                self.0.set_accepting_state(Lexer_ACTION_13);       
                self.0.reset_accepting_state();     
                match Lexer_ACTION_11(self) {
                    ...
                }
            }
            _ => {
                self.0.set_accepting_state(Lexer_ACTION_13);       
                match self.0.backtrack() {        
                    ...
                }
            }
        },
    }
}

Now in generate_state_arm, we can avoid generating set_accepting_state if we backtrack or call another semantic action function. We also need to use peek instead of next.

perf

opened by osa1 1

Consider merging char and range transitions in NFA and DFA
Currently we have these transitions in NFAs:

struct State<A> { char_transitions: Map<char, Set<StateIdx>>, range_transitions: RangeMap<Set<StateIdx>>, empty_transitions: Set<StateIdx>, any_transitions: Set<StateIdx>, end_of_input_transitions: Set<StateIdx>, ... }

(I was confused for a few seconds here on why we have both empty_transitions and end_of_input_transitions, we should document that "empty" is epsilon)

and these in DFAs:

pub struct State<T, A> { char_transitions: Map<char, T>, range_transitions: RangeMap<T>, any_transition: Option<T>, end_of_input_transition: Option<T>, ... }

In both of these, we could merge char and range transitions in a single RangeMap field. When the transition happens on a character the range will include a single character. This has a few advantages:

It simplifies things and adds no new or special cases to consider, as range transitions can already have just one character today.

I think this case may be handled poorly in the code generator today, e.g. maybe we're generating guards like x >= 'a' && x <= 'a'. We should check this.

When we implement DFA minimization (#38) we will have to iterate all inputs of a partition (i.e. set of states). There we will have to merge (or actually, take difference of) characters and ranges. ~~For example, in a partition, if a state has 'a' as input, another has ['a'-'z'] as input, we will have to test the inputs 'a' and ['b'-'z'] on the partition. I don't know how to implement this yet, but clearly we need to consider range-range overlaps and also range-char overlaps. Removing range vs. char distinction means less cases to handle.~~

Actually, all states in a partition need to handle the same range, otherwise they can be distinguished and so should be moved to separate partitions. So we will have [RangeMap] (one for each state in the partition), split the partition into new partitions where for every state in a new partition, the RangeMaps have the same domain. Then we can use any of the RangeMaps to make sure they map same inputs to same partitions. Having just RangeMaps simplifies this.

Disadvantage is that for char transitions we will need heap allocation for the Vec<Range>. Two ways to avoid that:

Use SmallVec<[Range; 1]>

Make RangeMap an enum, with a variant for single-character ranges. With this we will still need to treat single-char ranges differently, but the handling will be encapsulated in RangeMap module. Outside of RangeMap we will only have ranges. (except in the codegen we will need to ask if a range is single-char to generate simpler conditionals)

refactoring
opened by osa1 0
Implement DFA minimization

For the algorithm, in addition to dragon book, there's a paper "Fast brief practical DFA minimization" which is paywalled but available on sci-hub. (doi:10.1016/j.ipl.2011.12.004) (edit: also available here https://www.cs.cmu.edu/~cdm/papers/Valmari12.pdf)

A real-world example of how much difference DFA minimization makes: https://twitter.com/cfbolz/status/1461300039398088707
perf code size

opened by osa1 0