New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Overlapping ranges break lexing #2
Comments
A simple way to fix this would be to "desugar" ranges into character sets. So if I have If we do this we will have to be smarter in the backend, as a naive implementation will crate a branch in the generated code for each character in the range. So if I have |
The issue is fixed in the
Using this type instead of the previous Only concern is the code is fairly complicated so I'm not that sure about the correctness. I have a few unit tests and they all pass, but ideally I'd like some kind of proptest tests. |
Previously we implicitly assumed that character ranges in a NFA/DFA state cannot overlap, so for example, I cannot have rules `'0'-'9'` and `'5'-'9'` in the NFA/DFA state. This is obviously incorrect, as reported in #2. This patch introduces a "range map" that maps `u32` ranges to values. A range can have multiple values (used in NFA-to-DFA conversion). When two ranges overlap, the overlapping part has union of values of the overlapping ranges. Using this type instead of `HashMap<(char, char), HashSet<NfaStateIdx>>` automatically fixes the issue, as the handling of overlapping ranges effectively implements switching to states of all of the overlapping ranges in the NFA. So most of the code is just implementing the `RangeMap` type. Rest are tests and replacing hash maps with `RangeMap`. I also benchmarked this patch. For the Lua 5.1 lexer, this PR reduces compile times 9%. Generated code is identical except orders of alternatives in or patterns for ranges. Fixes #2
Here's the test:
Lexing
a1
works fine but we can't lexa2
. I don't remember the code for adding ranges now and I didn't check yet, but I think bug is in handling of ranges: we implicitly assume character ranges don't overlap. In the example above, when adding the range'a'-'c'
, we need to consider existing ranges at the same state ('a'-'b'
), and make sure that the overlapping parts of the ranges will non-deterministically switch to the states of all of the overlapping ranges. So when'a'-'b'
matches we need to go to states for both'1'
and'2'
.The text was updated successfully, but these errors were encountered: