An LR parser generator, implemented as a proc macro

Ömer Sinan Ağacan

Last update: Feb 28, 2022

Related tags

Parsing parsegen

Overview

parsegen

parsegen is an LR parser generator, similar to happy, ocamlyacc, and lalrpop.

It currently generates canonical LR(1) parsers, but LALR(1) and GLR are planned.

Currently under development. Expect bugs and slow compile times. Also, it has no documentation. All of these will be fixed.

See my MinCaml parser as an example parser implemented using parsegen.

Comments

Implement a CLI
In the lane_table branch I'm using an executable to debug the algorithm and print a graphviz graph for the state machines.

Outside of debugging, a CLI can be used to build parsers in a build script. This can help avoiding redundant recompilations until we make the proc macro cache its outputs.

Here's the crate structure I have in mind:

parsegen_lib will be the main thing. This is basically the current parsegen crate, except it shouldn't be marked as a proc macro.

parsegen_cli should provide a parsegen executable that can compile parsegen inputs to Rust using parsegen_lib. It should have flags for debug outputs, generating graphviz etc.

parsegen: A proc_macro wrapper around parsegen_lib.

To allow debugging proc macro usage, parsegen_cli should have a simple parser to extract the parser! { ... } part of a file and pass it to parsegen_lib. This way we don't have to add syntax to the proc macro for debugging, we can pass debug flags to parsegen_cli, and pass it the file with parsegen parser(s).
opened by osa1 0
Avoid generating unused GOTO entries

See https://github.com/lalrpop/lalrpop/commit/688b91930a6473098ea92f1b08b1f47d2838f67a

The idea is that GOTO[state] is only used when state has a reduce action.

If we re-order the states so that states with reduce actions come first we can generate smaller GOTO tables by just omitting states with no reduce actions.

opened by osa1 0
A simple (?) way of generating LALR(1)
Here's an idea that I think will generate LALR(1) parsers from LR(0) states in a simple way. This idea may be similar to the first step of lane table methods for generating LR(1) parsers from LR(0) states (e.g. phase 1 in The Lane Table Method Of Constructing LR(1) parsers) or may even be identical. I can't tell because my attention span was not enough to decipher that paper.

We start with LR(0) automaton. A reduce/reduce conflict is when in a state we see two or more items of form [T1 -> ... |], [T2 -> ... |] (T1 and T2 may be the same). A shift/reduce conflict is when in a state we see an item of form [T1 -> ...|] (i.e. a possible reduction) and another item of form [T2 -> ... | t ...] (i.e. a possible shift, t is a terminal, T1 and T2 may be the same).

In these cases we want to know the lookaheads of the reduction items as that will potentially allow us resolve some of the conflicts. For example, if we have a reduction that only makes sense when the next token is t1 (e.g. the LR(1) item [T1 -> ...|, t1]) and a shift of another terminal in the same state (e.g. [T2 -> ... |t2 ...]), where t2 is not the same as t1), then this is not a shift/reduce conflict anymore. Similar for reduce/reduce conflicts: when the lookaheads are different the conflict is resolved.

How to generate these lookeahead for these reduce items that are involved in a shift/reduce or reduce/reduce conflict?

Let's consider how lookaheads of LR(1) reduce items are generated:

The initial item will have $ (end-of-input) as the lookahead

An item is generated in one of these two ways:

By a shift from a previous state. E.g. when in a state we have [T1 -> ... | t1 ...], we generate a new state by "shifting" t1 with initial state [T1 -> ... t1 | ...].

By an "expansion" of a non-terminal in the current state. E.g. we have [T1 -> ... | T2 ...] in the current state, for each production in T1, we add the production to the current state.

In LR(1), the lookahead terminal for a state generated with step (1) will have the same lookahead as the "parent" item. (code)

In step (2), the lookahead of the new item will be the "first" set of the parent item's continuation after the expanded non-terminal. Note that when computing this "first" set, we add the parent item's lookead to the end of the item's continuation. E.g. in [T1 -> ... | <rest>, x] the "continuation" is <rest> x. (code)

So for each item, we know its dependencies:

If an item is in form [T -> ... t | ...], then it's generated by step (1) from an item in the previous state.

If an item is in form [T -> | ...], then it's generated by step (2) from an item in the current state.

It's easy to add parent item to an item when generating it. (TODO: There should be cycles in some cases, find an example?)

With this dependency graph, when we want to know lookahead of a reduction item (i.e. when it's involved in a shift/reduce or reduce/reduce conflict), we go backwards and try to obtain its lookahead.

If the current item is generated by step (1), we recurse to find the parent's lookahead

If the current item is generated by step (2), then we find the parent's "first" set after the current item's non-terminal. When the first set contains empty string, we will recurse to the parent's parent to find its lookahead.

I think we might want to add lookaheads to items as we visit them during this recursive process, as the lookahead of an item may be needed to resolve multiple conflicts. (TODO: Find an example?)

Note that this recursive process will return a set of lookaheads, not just a single lookahead. Also, because the dependency graph may have cycles, we need to keep track of visited items to avoid looping. The state at the beginning of this process should have {} as the lookahead set. At the end of the process the set should be non-empty.

That's it. I think this is probably the same as the lane table method phase 1. Since the number of states will be the same as LR(0) sets of the grammar, this algorithm should give us LALR(1). Lane table method phase 2 is about splitting some of the states/items to recover LR(1).

One of the things that I don't understand about the lane table method is for LR(1) grammars it somehow generates less number of states than Knuth's LR(1) algorithm (which is probably the one we have in parsegen). That suggests that, starting with Knuth's algorithm, we may be able to reduce the number of states without losing power. I don't remember this mentioned before, I thought LR(1) is LR(1) -- the sets are minimal (and huge).
opened by osa1 6

Combine LR1 items with same productions

LR1 items: https://github.com/osa1/parsegen/blob/f368fac8b352f01358dd7ad40152da14983019f9/src/lr1.rs#L10-L17

LR1 sets: https://github.com/osa1/parsegen/blob/f368fac8b352f01358dd7ad40152da14983019f9/src/lr1.rs#L80-L84

This representation generates a lot of redundant (duplicate) info in states:

1571: {
  [LetExpr -> "let" "(" Binder CommaBinder1_Rev ")" "=" SeqExpr "in" SeqExpr|, "("]
  [LetExpr -> "let" "(" Binder CommaBinder1_Rev ")" "=" SeqExpr "in" SeqExpr|, "bool"]
  [LetExpr -> "let" "(" Binder CommaBinder1_Rev ")" "=" SeqExpr "in" SeqExpr|, "-"]
  [LetExpr -> "let" "(" Binder CommaBinder1_Rev ")" "=" SeqExpr "in" SeqExpr|, "-."]
  [LetExpr -> "let" "(" Binder CommaBinder1_Rev ")" "=" SeqExpr "in" SeqExpr|, "+"]
  [LetExpr -> "let" "(" Binder CommaBinder1_Rev ")" "=" SeqExpr "in" SeqExpr|, "+."]
  [LetExpr -> "let" "(" Binder CommaBinder1_Rev ")" "=" SeqExpr "in" SeqExpr|, "*."]
  [LetExpr -> "let" "(" Binder CommaBinder1_Rev ")" "=" SeqExpr "in" SeqExpr|, "/."]
  [LetExpr -> "let" "(" Binder CommaBinder1_Rev ")" "=" SeqExpr "in" SeqExpr|, "="]
  [LetExpr -> "let" "(" Binder CommaBinder1_Rev ")" "=" SeqExpr "in" SeqExpr|, "<>"]
  [LetExpr -> "let" "(" Binder CommaBinder1_Rev ")" "=" SeqExpr "in" SeqExpr|, "<="]
  [LetExpr -> "let" "(" Binder CommaBinder1_Rev ")" "=" SeqExpr "in" SeqExpr|, "<"]
  [LetExpr -> "let" "(" Binder CommaBinder1_Rev ")" "=" SeqExpr "in" SeqExpr|, ">="]
  [LetExpr -> "let" "(" Binder CommaBinder1_Rev ")" "=" SeqExpr "in" SeqExpr|, ">"]
  [LetExpr -> "let" "(" Binder CommaBinder1_Rev ")" "=" SeqExpr "in" SeqExpr|, ","]
  [LetExpr -> "let" "(" Binder CommaBinder1_Rev ")" "=" SeqExpr "in" SeqExpr|, ";"]
  [LetExpr -> "let" "(" Binder CommaBinder1_Rev ")" "=" SeqExpr "in" SeqExpr|, "id"]
  [LetExpr -> "let" "(" Binder CommaBinder1_Rev ")" "=" SeqExpr "in" SeqExpr|, "int"]
  [LetExpr -> "let" "(" Binder CommaBinder1_Rev ")" "=" SeqExpr "in" SeqExpr|, "float"]
}

All of the items in this state have the same production. We can combine all of these items into a single one with a lookahead set:

struct LR1Item {
    non_terminal_idx: NonTerminalIdx,
    production_idx: ProductionIdx,
    cursor: usize,
    lookahead: FxHashSet<TerminalIdx>,
}

We can either add one more field for whether EOF is a valid lookahead, or we can assign EOF a TerminalIdx.

See #1 for optimizing terminal sets. (FxHashSet<TerminalIdx> in the code above)

opened by osa1 0

Implement associativity attributes
It's tedious to parse expressions without associativity attributes.

Left associative = reduce in a shift/reduce conflict Right associative = shift in a shift/reduce conflict

I don't like ocamlyacc/menhir/happy style associativity lines before non-terminals/productions. I think we should probably use Rust attribute syntax and annotate productions with the associativity attributes. Examples:

#[assoc(left)] <e1:Expr> "+" <e2:Expr> => ... #[assoc(right)] <e1:Expr> "?" <e2:Expr> ":" <e3:Expr> => ...
opened by osa1 1
Remove (or make optional) semicolons after non-terminals
I don't remember why I required a semicolon after a non-terminal (maybe I just wanted the syntax to be compatible with LALRPOP?), but I think it shouldn't be necessary.

We currently have two syntax for non-terminals:

With single production:

<non-terminal name> : <non-terminal type> = <production>,

With multiple productions:

<non-terminal name> : <non-terminal type> = { <production>,* };

The semicolon in second syntax can be removed.
opened by osa1 0

Owner

Ömer Sinan Ağacan

GitHub

Soon to be AsciiDoc parser implemented in rust!

pagliascii "But ASCII Doc, I am Pagliascii" Soon to be AsciiDoc parser implemented in rust! This project is the current implementation of the requeste

49 Dec 11, 2022

Parsing Expression Grammar (PEG) parser generator for Rust

Parsing Expression Grammars in Rust Documentation | Release Notes rust-peg is a simple yet flexible parser generator that makes it easy to write robus

1.2k Dec 30, 2022

LR(1) parser generator for Rust

LALRPOP LALRPOP is a Rust parser generator framework with usability as its primary goal. You should be able to write compact, DRY, readable grammars.

2.4k Jan 7, 2023

A typed parser generator embedded in Rust code for Parsing Expression Grammars

Oak Compiled on the nightly channel of Rust. Use rustup for managing compiler channels. You can download and set up the exact same version of the comp

138 Nov 25, 2022

Yet Another Parser library for Rust. A lightweight, dependency free, parser combinator inspired set of utility methods to help with parsing strings and slices.

Yap: Yet another (rust) parsing library A lightweight, dependency free, parser combinator inspired set of utility methods to help with parsing input.

117 Dec 14, 2022

Website for Microformats Rust parser (using 'microformats-parser'/'mf2')

5 Jul 19, 2022

Parsing and inspecting Rust literals (particularly useful for proc macros)

litrs: parsing and inspecting Rust literals litrs offers functionality to parse Rust literals, i.e. tokens in the Rust programming language that repre

31 Dec 26, 2022

A procedural macro for defining nom combinators in simple DSL

22 Dec 12, 2022

A native Rust port of Google's robots.txt parser and matcher C++ library.

robotstxt A native Rust port of Google's robots.txt parser and matcher C++ library. Native Rust port, no third-part crate dependency Zero unsafe code

72 Dec 11, 2022

Rust parser combinator framework

nom, eating data byte by byte nom is a parser combinators library written in Rust. Its goal is to provide tools to build safe parsers without compromi

7.6k Jan 7, 2023

url parameter parser for rest filter inquiry

inquerest Inquerest can parse complex url query into a SQL abstract syntax tree. Example this url: /person?age=lt.42&(student=eq.true|gender=eq.'M')&

25 Nov 2, 2020

A fast monadic-style parser combinator designed to work on stable Rust.

Chomp Chomp is a fast monadic-style parser combinator library designed to work on stable Rust. It was written as the culmination of the experiments de

228 Oct 31, 2022

A parser combinator library for Rust

combine An implementation of parser combinators for Rust, inspired by the Haskell library Parsec. As in Parsec the parsers are LL(1) by default but th

1.1k Dec 28, 2022

The Elegant Parser

pest. The Elegant Parser pest is a general purpose parser written in Rust with a focus on accessibility, correctness, and performance. It uses parsing

3.5k Jan 8, 2023

Rust query string parser with nesting support

What is Queryst? This is a fork of the original, with serde and serde_json updated to 0.9 A query string parsing library for Rust inspired by https://

67 Nov 16, 2022

A fast, extensible, command-line arguments parser

parkour A fast, extensible, command-line arguments parser. Introduction ?? The most popular argument parser, clap, allows you list all the possible ar

18 Apr 19, 2021

A rusty, dual-wielding Quake and Half-Life texture WAD parser.

Ogre A rusty, dual-wielding Quake and Half-Life texture WAD parser ogre is a rust representation and nom parser for Quake and Half-Life WAD files. I

16 Dec 5, 2022

A modern dialogue executor and tree parser using YAML.

A modern dialogue executor and tree parser using YAML. This crate is for building(ex), importing/exporting(ex), and walking(ex) dialogue trees. convo

27 Aug 3, 2022

A friendly parser combinator crate

Chumsky A friendly parser combinator crate that makes writing LL-1 parsers with error recovery easy. Example Here follows a Brainfuck parser. See exam

2.4k Jan 8, 2023

An LR parser generator, implemented as a proc macro

Related tags

Overview

parsegen

Comments

Implement a CLI

Avoid generating unused GOTO entries

A simple (?) way of generating LALR(1)

Combine LR1 items with same productions

Implement associativity attributes

Remove (or make optional) semicolons after non-terminals

Owner

Ömer Sinan Ağacan

Soon to be AsciiDoc parser implemented in rust!

Parsing Expression Grammar (PEG) parser generator for Rust

LR(1) parser generator for Rust

A typed parser generator embedded in Rust code for Parsing Expression Grammars

Yet Another Parser library for Rust. A lightweight, dependency free, parser combinator inspired set of utility methods to help with parsing strings and slices.

Website for Microformats Rust parser (using 'microformats-parser'/'mf2')

Parsing and inspecting Rust literals (particularly useful for proc macros)

A procedural macro for defining nom combinators in simple DSL

A native Rust port of Google's robots.txt parser and matcher C++ library.

Rust parser combinator framework

url parameter parser for rest filter inquiry

A fast monadic-style parser combinator designed to work on stable Rust.

A parser combinator library for Rust

The Elegant Parser

Rust query string parser with nesting support

A fast, extensible, command-line arguments parser

A rusty, dual-wielding Quake and Half-Life texture WAD parser.

A modern dialogue executor and tree parser using YAML.

A friendly parser combinator crate