Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overlapping ranges break lexing #2

Closed
osa1 opened this issue May 26, 2021 · 2 comments · Fixed by #3
Closed

Overlapping ranges break lexing #2

osa1 opened this issue May 26, 2021 · 2 comments · Fixed by #3
Labels
bug Something isn't working

Comments

@osa1
Copy link
Owner

osa1 commented May 26, 2021

Here's the test:

lexer! {
    Lexer -> usize;

    ['a'-'b'] '1' = 1,
    ['a'-'c'] '2' => 2,
}

let mut lexer = Lexer::new("a1");
assert_eq!(ignore_pos(lexer.next()), Some(Ok(1)));
assert_eq!(lexer.next(), None);

let mut lexer = Lexer::new("a2");
assert_eq!(ignore_pos(lexer.next()), Some(Ok(2)));
assert_eq!(lexer.next(), None);

Lexing a1 works fine but we can't lex a2. I don't remember the code for adding ranges now and I didn't check yet, but I think bug is in handling of ranges: we implicitly assume character ranges don't overlap. In the example above, when adding the range 'a'-'c', we need to consider existing ranges at the same state ('a'-'b'), and make sure that the overlapping parts of the ranges will non-deterministically switch to the states of all of the overlapping ranges. So when 'a'-'b' matches we need to go to states for both '1' and '2'.

@osa1 osa1 added the bug Something isn't working label May 26, 2021
@osa1
Copy link
Owner Author

osa1 commented May 26, 2021

A simple way to fix this would be to "desugar" ranges into character sets. So if I have ['a'-'c'] that would be desugared to ['a' 'b' 'c']. Now we don't have to deal with ranges and range overlap checks anymore.

If we do this we will have to be smarter in the backend, as a naive implementation will crate a branch in the generated code for each character in the range. So if I have ['a'-z'], assuming no overlaps, what would previously be one range test (char >= 'a' && char <= 'z') will now be 26 tests. Also, the intermediate representation of the NFAs and DFAs will be much larger for the same reason.

@osa1
Copy link
Owner Author

osa1 commented May 28, 2021

The issue is fixed in the range_set branch. I implemented a simple "range map" that maps ranges into values. Insertion allows overlaps, so if I have a range (10, 20) mapped to x and add another range (15, 25) to y, I get three ranges:

  • (10, 14) -> x
  • (15, 20) -> x, y
  • (21, 25) -> y

Using this type instead of the previous FxHashMap<(char, char), StateSet> automatically fixes the issue as the new type handles overlapping ranges by taking union of the overlapping range's values as the value of the overlap.

Only concern is the code is fairly complicated so I'm not that sure about the correctness. I have a few unit tests and they all pass, but ideally I'd like some kind of proptest tests.

@osa1 osa1 closed this as completed in #3 May 28, 2021
osa1 added a commit that referenced this issue May 28, 2021
Previously we implicitly assumed that character ranges in a NFA/DFA state
cannot overlap, so for example, I cannot have rules `'0'-'9'` and `'5'-'9'` in
the NFA/DFA state. This is obviously incorrect, as reported in #2.

This patch introduces a "range map" that maps `u32` ranges to values. A range
can have multiple values (used in NFA-to-DFA conversion). When two ranges
overlap, the overlapping part has union of values of the overlapping ranges.

Using this type instead of `HashMap<(char, char), HashSet<NfaStateIdx>>`
automatically fixes the issue, as the handling of overlapping ranges
effectively implements switching to states of all of the overlapping ranges in
the NFA. So most of the code is just implementing the `RangeMap` type. Rest are
tests and replacing hash maps with `RangeMap`.

I also benchmarked this patch. For the Lua 5.1 lexer, this PR reduces compile
times 9%. Generated code is identical except orders of alternatives in or
patterns for ranges.

Fixes #2
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant