Overlapping ranges break lexing #2

osa1 · 2021-05-26T10:14:04Z

Here's the test:

lexer! {
    Lexer -> usize;

    ['a'-'b'] '1' = 1,
    ['a'-'c'] '2' => 2,
}

let mut lexer = Lexer::new("a1");
assert_eq!(ignore_pos(lexer.next()), Some(Ok(1)));
assert_eq!(lexer.next(), None);

let mut lexer = Lexer::new("a2");
assert_eq!(ignore_pos(lexer.next()), Some(Ok(2)));
assert_eq!(lexer.next(), None);

Lexing a1 works fine but we can't lex a2. I don't remember the code for adding ranges now and I didn't check yet, but I think bug is in handling of ranges: we implicitly assume character ranges don't overlap. In the example above, when adding the range 'a'-'c', we need to consider existing ranges at the same state ('a'-'b'), and make sure that the overlapping parts of the ranges will non-deterministically switch to the states of all of the overlapping ranges. So when 'a'-'b' matches we need to go to states for both '1' and '2'.

The text was updated successfully, but these errors were encountered:

osa1 · 2021-05-26T10:29:51Z

A simple way to fix this would be to "desugar" ranges into character sets. So if I have ['a'-'c'] that would be desugared to ['a' 'b' 'c']. Now we don't have to deal with ranges and range overlap checks anymore.

If we do this we will have to be smarter in the backend, as a naive implementation will crate a branch in the generated code for each character in the range. So if I have ['a'-z'], assuming no overlaps, what would previously be one range test (char >= 'a' && char <= 'z') will now be 26 tests. Also, the intermediate representation of the NFAs and DFAs will be much larger for the same reason.

osa1 · 2021-05-28T08:51:10Z

The issue is fixed in the range_set branch. I implemented a simple "range map" that maps ranges into values. Insertion allows overlaps, so if I have a range (10, 20) mapped to x and add another range (15, 25) to y, I get three ranges:

(10, 14) -> x
(15, 20) -> x, y
(21, 25) -> y

Using this type instead of the previous FxHashMap<(char, char), StateSet> automatically fixes the issue as the new type handles overlapping ranges by taking union of the overlapping range's values as the value of the overlap.

Only concern is the code is fairly complicated so I'm not that sure about the correctness. I have a few unit tests and they all pass, but ideally I'd like some kind of proptest tests.

Previously we implicitly assumed that character ranges in a NFA/DFA state cannot overlap, so for example, I cannot have rules `'0'-'9'` and `'5'-'9'` in the NFA/DFA state. This is obviously incorrect, as reported in #2. This patch introduces a "range map" that maps `u32` ranges to values. A range can have multiple values (used in NFA-to-DFA conversion). When two ranges overlap, the overlapping part has union of values of the overlapping ranges. Using this type instead of `HashMap<(char, char), HashSet<NfaStateIdx>>` automatically fixes the issue, as the handling of overlapping ranges effectively implements switching to states of all of the overlapping ranges in the NFA. So most of the code is just implementing the `RangeMap` type. Rest are tests and replacing hash maps with `RangeMap`. I also benchmarked this patch. For the Lua 5.1 lexer, this PR reduces compile times 9%. Generated code is identical except orders of alternatives in or patterns for ranges. Fixes #2

osa1 added the bug Something isn't working label May 26, 2021

osa1 mentioned this issue May 28, 2021

Handle overlapping ranges #3

Merged

osa1 closed this as completed in #3 May 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Overlapping ranges break lexing #2

Overlapping ranges break lexing #2

osa1 commented May 26, 2021 •

edited

osa1 commented May 26, 2021

osa1 commented May 28, 2021 •

edited

Overlapping ranges break lexing #2

Overlapping ranges break lexing #2

Comments

osa1 commented May 26, 2021 • edited

osa1 commented May 26, 2021

osa1 commented May 28, 2021 • edited

osa1 commented May 26, 2021 •

edited

osa1 commented May 28, 2021 •

edited