Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement a way to bind match results in patterns #9

Open
osa1 opened this issue Jun 19, 2021 · 3 comments
Open

Implement a way to bind match results in patterns #9

osa1 opened this issue Jun 19, 2021 · 3 comments
Labels
feature New feature or request

Comments

@osa1
Copy link
Owner

osa1 commented Jun 19, 2021

Suppose I want to parse \xHH syntax in a rule (H is a hex digit). Currently it's quite painful to get the chars for those digits in a larger rule:

"\\x" $hex_digit $hex_digit => |mut lexer| {
    let match_ = lexer.match_();
    let bytes = match_.as_bytes();

    let digit1 = bytes[bytes.len() - 2];
    let digit2 = bytes[bytes.len() - 1];

    ...
},

The problem is the match will have the input since the beginning of the current match, so for example, if this is in a string lexing rule and the string is "blah blah \xAA" then match will be the entire string except the closing ".

One workaround is to add another rule for the hex digits and push them to a buffer in each match, but that's also verbose.

If we had a way to bind regexes inside patterns we could do:

"\\x" <d1:$hex_digit> <d2:$hex_digit> => |mut lexer| {
    ...
},

where d1 and d2 would be chars.

@osa1 osa1 added the feature New feature or request label Jun 19, 2021
@osa1 osa1 changed the title Implement a way to bind regexes in patterns Implement a way to bind match results in patterns Oct 22, 2021
@osa1
Copy link
Owner Author

osa1 commented Oct 30, 2021

Here's another use case. Rust integers can be lexed like this:

rule Init {
    ...

    ("0b" | "0o" | "0x")? ($digit | '_')* $id? =? |lexer| {
        let match_ = lexer.match_();
        lexer.return_(check_int(match_))
    },
}

// - Check that digits match the prefix (e.g. if the match starts with "0b", digits must be 1 or 0)
// - Check that that is at least one digit
// - Check that the suffix is valid ("i32", "u64", etc.)
fn check_int<'input>(match_: &'input str) -> Result<Token, CustomError> { ... }

With this implementation, check_int needs to split the string into prefix, digits, and suffix, even though we already did that in the lexer.

With the proposed feature, it could be implemented as:

rule Init {
    ...

    <prefix:("0b" | "0o" | "0x")?> <digits:($digit | '_')*> <suffix:$id?> =? |lexer| {
        let match_ = lexer.match_();
        lexer.return_(check_int(prefix, digits, suffix))
    },
}

// - Check that digits match the prefix (e.g. if the match starts with "0b", digits must be 1 or 0)
// - Check that that is at least one digit
// - Check that the suffix is valid ("i32", "u64", etc.)
fn check_int<'input>(prefix: &'input str, digits: &'input str, suffix: &'input str) -> Result<Token, CustomError> { ... }

@osa1
Copy link
Owner Author

osa1 commented Nov 1, 2021

I'm not sure how to implement this feature yet, but here's an idea. Suppose I have a rule with these regexes:

<a:"aaa"> <b:"a"> <c:"x"> => ...,
<a:"a"> <b:"aaa"> <c:"y"> => ...,

In the generated code, after seeing 4 as, we will need a "match stack" with contents:

[
    [(0, 3), (3, 4)], // bound matches for the first rule
    [(0, 1), (1, 4)], // bound matches for the second rule
]

After 4 as, if the next character is x, we extend the first match stack with (4, 5) and run the semantic action with the ranges in the list bound. If the next character is y then do the same using the second list.

Conceptually, it seems like we could annotate NFA nodes with "start match" and "end match". In the example above, initially we will have 2 NFAs:

regex 1: 0 (init) -(a)-> 1 -(a)-> 2 -(a)-> 3 -> -(a)-> 4 -(x)-> 5 (accept)
regex 2: 0 (init) -(a)-> 1 -(a)-> 2 -(a)-> 3 -> -(a)-> 4 -(y)-> 5 (accept)

We add the captures to these NFAs like this:

regex 1: 0 (init, start match) -(a)-> 1 -(a)-> 2 -(a)-> 3 (end match, start match) -> -(a)-> 4 (end match, start match) -(x)-> 5 (end match, accept)
regex 2: 0 (init, start match) -(a)-> 1 (end match, start match) -(a)-> 2 -(a)-> 3 -> -(a)-> 4 (end match, start match) -(y)-> 5 (end match, accept)

We will also need the regex number for these "start match" and "end match" annotations, to be able to update the right match stack in the match stack list when starting/ending a match.

Once we have these, we can generate the DFA like this:

    0 (init, start match 1, start match 2)      match stacks = [(0, ?)], [(0, ?)]
-(a)->
    1 (end match 2, start match 2)              match stacks = [(0, ?)], [(0, 1), (1, ?)]
-(a)->
    2
-(a)->
    3 (end match 1, start match 1)              match stacks = [(0, 3), (3, ?)], [(0, 1), (1, ?)]
-(a)->
    4 (end match 1 2, start match 1 2)          match stacks = [(0, 3), (3, 4), (4, ?)], [(0, 1), (1, 4), (4, ?)]

(Notation is a bit strange, but hopefully conveys the idea)

Now in state 4 if we see an x:

-(x)->
    5 (end match 1, accept)                     match stacks = [(0, 3), (3, 4), (4, ?)], [(0, 1), (1, 4), (4, 5)]

Now we bind the variables in the first regex to the matches in the second match stack and run the semantic action. Same thing if we see an y, except we bind variables using the first mark stack.

Regexes that don't bind variables do not need a match stack, so the max. number of stacks will be quite small for real-world examples. I doubt we will see more than 5-6 stacks at a time.

We know size of the each match stack in compile time, so we can allocate each match stack with the right capacity.

Lexers that don't use this feature will still be allocation-free. Others will allocate a vector every time we start matching a regex that uses this feature. In that sense this feature is "zero cost" (i.e. you don't pay a price if you don't use it).

Match stacks are reset when we run a semantic action. We remove the match stack used in the semantic action and clear the list.

TODO: Need to figure out how to implement backtracking.

@osa1
Copy link
Owner Author

osa1 commented Nov 1, 2021

It seems difficult to support this syntax with full generality. For example

('a' <x:'b'>? 'c')+ => ...

Here type of x is Vec<Option<&'input str>>. To maintain this match stack, we need to push it when we enter the state for this regex (instead of entering a captured regex, like in the example in my previous comment above), and then every time we see a 'b' we push a Some(...) and every time we see a 'c' we push a None.

No idea how to generalize the idea in my previous comment to handle this case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant