Here's an idea that I think will generate LALR(1) parsers from LR(0) states in a simple way. This idea may be similar to the first step of lane table methods for generating LR(1) parsers from LR(0) states (e.g. phase 1 in The Lane Table Method Of Constructing LR(1) parsers) or may even be identical. I can't tell because my attention span was not enough to decipher that paper.
We start with LR(0) automaton. A reduce/reduce conflict is when in a state we see two or more items of form [T1 -> ... |]
, [T2 -> ... |]
(T1 and T2 may be the same). A shift/reduce conflict is when in a state we see an item of form [T1 -> ...|]
(i.e. a possible reduction) and another item of form [T2 -> ... | t ...]
(i.e. a possible shift, t
is a terminal, T1 and T2 may be the same).
In these cases we want to know the lookaheads of the reduction items as that will potentially allow us resolve some of the conflicts. For example, if we have a reduction that only makes sense when the next token is t1
(e.g. the LR(1) item [T1 -> ...|, t1]
) and a shift of another terminal in the same state (e.g. [T2 -> ... |t2 ...])
, where t2
is not the same as t1
), then this is not a shift/reduce conflict anymore. Similar for reduce/reduce conflicts: when the lookaheads are different the conflict is resolved.
How to generate these lookeahead for these reduce items that are involved in a shift/reduce or reduce/reduce conflict?
Let's consider how lookaheads of LR(1) reduce items are generated:
- The initial item will have
$
(end-of-input) as the lookahead
- An item is generated in one of these two ways:
- By a shift from a previous state. E.g. when in a state we have
[T1 -> ... | t1 ...]
, we generate a new state by "shifting" t1 with initial state [T1 -> ... t1 | ...]
.
- By an "expansion" of a non-terminal in the current state. E.g. we have
[T1 -> ... | T2 ...]
in the current state, for each production in T1, we add the production to the current state.
In LR(1), the lookahead terminal for a state generated with step (1) will have the same lookahead as the "parent" item. (code)
In step (2), the lookahead of the new item will be the "first" set of the parent item's continuation after the expanded non-terminal. Note that when computing this "first" set, we add the parent item's lookead to the end of the item's continuation. E.g. in [T1 -> ... | <rest>, x]
the "continuation" is <rest> x
. (code)
So for each item, we know its dependencies:
- If an item is in form
[T -> ... t | ...]
, then it's generated by step (1) from an item in the previous state.
- If an item is in form
[T -> | ...]
, then it's generated by step (2) from an item in the current state.
It's easy to add parent item to an item when generating it. (TODO: There should be cycles in some cases, find an example?)
With this dependency graph, when we want to know lookahead of a reduction item (i.e. when it's involved in a shift/reduce or reduce/reduce conflict), we go backwards and try to obtain its lookahead.
- If the current item is generated by step (1), we recurse to find the parent's lookahead
- If the current item is generated by step (2), then we find the parent's "first" set after the current item's non-terminal. When the first set contains empty string, we will recurse to the parent's parent to find its lookahead.
I think we might want to add lookaheads to items as we visit them during this recursive process, as the lookahead of an item may be needed to resolve multiple conflicts. (TODO: Find an example?)
Note that this recursive process will return a set of lookaheads, not just a single lookahead. Also, because the dependency graph may have cycles, we need to keep track of visited items to avoid looping. The state at the beginning of this process should have {}
as the lookahead set. At the end of the process the set should be non-empty.
That's it. I think this is probably the same as the lane table method phase 1. Since the number of states will be the same as LR(0) sets of the grammar, this algorithm should give us LALR(1). Lane table method phase 2 is about splitting some of the states/items to recover LR(1).
One of the things that I don't understand about the lane table method is for LR(1) grammars it somehow generates less number of states than Knuth's LR(1) algorithm (which is probably the one we have in parsegen). That suggests that, starting with Knuth's algorithm, we may be able to reduce the number of states without losing power. I don't remember this mentioned before, I thought LR(1) is LR(1) -- the sets are minimal (and huge).