Implement a way to bind match results in patterns #9

osa1 · 2021-06-19T12:32:56Z

Suppose I want to parse \xHH syntax in a rule (H is a hex digit). Currently it's quite painful to get the chars for those digits in a larger rule:

"\\x" $hex_digit $hex_digit => |mut lexer| {
    let match_ = lexer.match_();
    let bytes = match_.as_bytes();

    let digit1 = bytes[bytes.len() - 2];
    let digit2 = bytes[bytes.len() - 1];

    ...
},

The problem is the match will have the input since the beginning of the current match, so for example, if this is in a string lexing rule and the string is "blah blah \xAA" then match will be the entire string except the closing ".

One workaround is to add another rule for the hex digits and push them to a buffer in each match, but that's also verbose.

If we had a way to bind regexes inside patterns we could do:

"\\x" <d1:$hex_digit> <d2:$hex_digit> => |mut lexer| {
    ...
},

where d1 and d2 would be chars.

The text was updated successfully, but these errors were encountered:

osa1 · 2021-10-30T07:22:51Z

Here's another use case. Rust integers can be lexed like this:

rule Init {
    ...

    ("0b" | "0o" | "0x")? ($digit | '_')* $id? =? |lexer| {
        let match_ = lexer.match_();
        lexer.return_(check_int(match_))
    },
}

// - Check that digits match the prefix (e.g. if the match starts with "0b", digits must be 1 or 0)
// - Check that that is at least one digit
// - Check that the suffix is valid ("i32", "u64", etc.)
fn check_int<'input>(match_: &'input str) -> Result<Token, CustomError> { ... }

With this implementation, check_int needs to split the string into prefix, digits, and suffix, even though we already did that in the lexer.

With the proposed feature, it could be implemented as:

rule Init {
    ...

    <prefix:("0b" | "0o" | "0x")?> <digits:($digit | '_')*> <suffix:$id?> =? |lexer| {
        let match_ = lexer.match_();
        lexer.return_(check_int(prefix, digits, suffix))
    },
}

// - Check that digits match the prefix (e.g. if the match starts with "0b", digits must be 1 or 0)
// - Check that that is at least one digit
// - Check that the suffix is valid ("i32", "u64", etc.)
fn check_int<'input>(prefix: &'input str, digits: &'input str, suffix: &'input str) -> Result<Token, CustomError> { ... }

osa1 · 2021-11-01T07:53:42Z

I'm not sure how to implement this feature yet, but here's an idea. Suppose I have a rule with these regexes:

<a:"aaa"> <b:"a"> <c:"x"> => ...,
<a:"a"> <b:"aaa"> <c:"y"> => ...,

In the generated code, after seeing 4 as, we will need a "match stack" with contents:

[
    [(0, 3), (3, 4)], // bound matches for the first rule
    [(0, 1), (1, 4)], // bound matches for the second rule
]

After 4 as, if the next character is x, we extend the first match stack with (4, 5) and run the semantic action with the ranges in the list bound. If the next character is y then do the same using the second list.

Conceptually, it seems like we could annotate NFA nodes with "start match" and "end match". In the example above, initially we will have 2 NFAs:

regex 1: 0 (init) -(a)-> 1 -(a)-> 2 -(a)-> 3 -> -(a)-> 4 -(x)-> 5 (accept)
regex 2: 0 (init) -(a)-> 1 -(a)-> 2 -(a)-> 3 -> -(a)-> 4 -(y)-> 5 (accept)

We add the captures to these NFAs like this:

regex 1: 0 (init, start match) -(a)-> 1 -(a)-> 2 -(a)-> 3 (end match, start match) -> -(a)-> 4 (end match, start match) -(x)-> 5 (end match, accept)
regex 2: 0 (init, start match) -(a)-> 1 (end match, start match) -(a)-> 2 -(a)-> 3 -> -(a)-> 4 (end match, start match) -(y)-> 5 (end match, accept)

We will also need the regex number for these "start match" and "end match" annotations, to be able to update the right match stack in the match stack list when starting/ending a match.

Once we have these, we can generate the DFA like this:

    0 (init, start match 1, start match 2)      match stacks = [(0, ?)], [(0, ?)]
-(a)->
    1 (end match 2, start match 2)              match stacks = [(0, ?)], [(0, 1), (1, ?)]
-(a)->
    2
-(a)->
    3 (end match 1, start match 1)              match stacks = [(0, 3), (3, ?)], [(0, 1), (1, ?)]
-(a)->
    4 (end match 1 2, start match 1 2)          match stacks = [(0, 3), (3, 4), (4, ?)], [(0, 1), (1, 4), (4, ?)]

(Notation is a bit strange, but hopefully conveys the idea)

Now in state 4 if we see an x:

-(x)->
    5 (end match 1, accept)                     match stacks = [(0, 3), (3, 4), (4, ?)], [(0, 1), (1, 4), (4, 5)]

Now we bind the variables in the first regex to the matches in the second match stack and run the semantic action. Same thing if we see an y, except we bind variables using the first mark stack.

Regexes that don't bind variables do not need a match stack, so the max. number of stacks will be quite small for real-world examples. I doubt we will see more than 5-6 stacks at a time.

We know size of the each match stack in compile time, so we can allocate each match stack with the right capacity.

Lexers that don't use this feature will still be allocation-free. Others will allocate a vector every time we start matching a regex that uses this feature. In that sense this feature is "zero cost" (i.e. you don't pay a price if you don't use it).

Match stacks are reset when we run a semantic action. We remove the match stack used in the semantic action and clear the list.

TODO: Need to figure out how to implement backtracking.

osa1 · 2021-11-01T10:20:27Z

It seems difficult to support this syntax with full generality. For example

('a' <x:'b'>? 'c')+ => ...

Here type of x is Vec<Option<&'input str>>. To maintain this match stack, we need to push it when we enter the state for this regex (instead of entering a captured regex, like in the example in my previous comment above), and then every time we see a 'b' we push a Some(...) and every time we see a 'c' we push a None.

No idea how to generalize the idea in my previous comment to handle this case.

osa1 added the feature New feature or request label Jun 19, 2021

osa1 mentioned this issue Jun 19, 2021

Implement an easy way to get the matched char in a wildcard rule #10

Open

osa1 changed the title ~~Implement a way to bind regexes in patterns~~ Implement a way to bind match results in patterns Oct 22, 2021

osa1 mentioned this issue Oct 26, 2021

Lookahead could be useful #29

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement a way to bind match results in patterns #9

Implement a way to bind match results in patterns #9

osa1 commented Jun 19, 2021

osa1 commented Oct 30, 2021

osa1 commented Nov 1, 2021

osa1 commented Nov 1, 2021 •

edited

Implement a way to bind match results in patterns #9

Implement a way to bind match results in patterns #9

Comments

osa1 commented Jun 19, 2021

osa1 commented Oct 30, 2021

osa1 commented Nov 1, 2021

osa1 commented Nov 1, 2021 • edited

osa1 commented Nov 1, 2021 •

edited