An LR parser generator, implemented as a proc macro

Related tags

Parsing parsegen
Overview

parsegen

parsegen is an LR parser generator, similar to happy, ocamlyacc, and lalrpop.

It currently generates canonical LR(1) parsers, but LALR(1) and GLR are planned.

Currently under development. Expect bugs and slow compile times. Also, it has no documentation. All of these will be fixed.

See my MinCaml parser as an example parser implemented using parsegen.

Comments
  • Implement a CLI

    Implement a CLI

    In the lane_table branch I'm using an executable to debug the algorithm and print a graphviz graph for the state machines.

    Outside of debugging, a CLI can be used to build parsers in a build script. This can help avoiding redundant recompilations until we make the proc macro cache its outputs.

    Here's the crate structure I have in mind:

    • parsegen_lib will be the main thing. This is basically the current parsegen crate, except it shouldn't be marked as a proc macro.
    • parsegen_cli should provide a parsegen executable that can compile parsegen inputs to Rust using parsegen_lib. It should have flags for debug outputs, generating graphviz etc.
    • parsegen: A proc_macro wrapper around parsegen_lib.

    To allow debugging proc macro usage, parsegen_cli should have a simple parser to extract the parser! { ... } part of a file and pass it to parsegen_lib. This way we don't have to add syntax to the proc macro for debugging, we can pass debug flags to parsegen_cli, and pass it the file with parsegen parser(s).

    opened by osa1 0
  • Avoid generating unused GOTO entries

    Avoid generating unused GOTO entries

    See https://github.com/lalrpop/lalrpop/commit/688b91930a6473098ea92f1b08b1f47d2838f67a

    The idea is that GOTO[state] is only used when state has a reduce action.

    If we re-order the states so that states with reduce actions come first we can generate smaller GOTO tables by just omitting states with no reduce actions.

    opened by osa1 0
  • A simple (?) way of generating LALR(1)

    A simple (?) way of generating LALR(1)

    Here's an idea that I think will generate LALR(1) parsers from LR(0) states in a simple way. This idea may be similar to the first step of lane table methods for generating LR(1) parsers from LR(0) states (e.g. phase 1 in The Lane Table Method Of Constructing LR(1) parsers) or may even be identical. I can't tell because my attention span was not enough to decipher that paper.

    We start with LR(0) automaton. A reduce/reduce conflict is when in a state we see two or more items of form [T1 -> ... |], [T2 -> ... |] (T1 and T2 may be the same). A shift/reduce conflict is when in a state we see an item of form [T1 -> ...|] (i.e. a possible reduction) and another item of form [T2 -> ... | t ...] (i.e. a possible shift, t is a terminal, T1 and T2 may be the same).

    In these cases we want to know the lookaheads of the reduction items as that will potentially allow us resolve some of the conflicts. For example, if we have a reduction that only makes sense when the next token is t1 (e.g. the LR(1) item [T1 -> ...|, t1]) and a shift of another terminal in the same state (e.g. [T2 -> ... |t2 ...]), where t2 is not the same as t1), then this is not a shift/reduce conflict anymore. Similar for reduce/reduce conflicts: when the lookaheads are different the conflict is resolved.

    How to generate these lookeahead for these reduce items that are involved in a shift/reduce or reduce/reduce conflict?

    Let's consider how lookaheads of LR(1) reduce items are generated:

    • The initial item will have $ (end-of-input) as the lookahead
    • An item is generated in one of these two ways:
      1. By a shift from a previous state. E.g. when in a state we have [T1 -> ... | t1 ...], we generate a new state by "shifting" t1 with initial state [T1 -> ... t1 | ...].
      2. By an "expansion" of a non-terminal in the current state. E.g. we have [T1 -> ... | T2 ...] in the current state, for each production in T1, we add the production to the current state.

    In LR(1), the lookahead terminal for a state generated with step (1) will have the same lookahead as the "parent" item. (code)

    In step (2), the lookahead of the new item will be the "first" set of the parent item's continuation after the expanded non-terminal. Note that when computing this "first" set, we add the parent item's lookead to the end of the item's continuation. E.g. in [T1 -> ... | <rest>, x] the "continuation" is <rest> x. (code)

    So for each item, we know its dependencies:

    • If an item is in form [T -> ... t | ...], then it's generated by step (1) from an item in the previous state.
    • If an item is in form [T -> | ...], then it's generated by step (2) from an item in the current state.

    It's easy to add parent item to an item when generating it. (TODO: There should be cycles in some cases, find an example?)

    With this dependency graph, when we want to know lookahead of a reduction item (i.e. when it's involved in a shift/reduce or reduce/reduce conflict), we go backwards and try to obtain its lookahead.

    • If the current item is generated by step (1), we recurse to find the parent's lookahead
    • If the current item is generated by step (2), then we find the parent's "first" set after the current item's non-terminal. When the first set contains empty string, we will recurse to the parent's parent to find its lookahead.

    I think we might want to add lookaheads to items as we visit them during this recursive process, as the lookahead of an item may be needed to resolve multiple conflicts. (TODO: Find an example?)

    Note that this recursive process will return a set of lookaheads, not just a single lookahead. Also, because the dependency graph may have cycles, we need to keep track of visited items to avoid looping. The state at the beginning of this process should have {} as the lookahead set. At the end of the process the set should be non-empty.

    That's it. I think this is probably the same as the lane table method phase 1. Since the number of states will be the same as LR(0) sets of the grammar, this algorithm should give us LALR(1). Lane table method phase 2 is about splitting some of the states/items to recover LR(1).

    One of the things that I don't understand about the lane table method is for LR(1) grammars it somehow generates less number of states than Knuth's LR(1) algorithm (which is probably the one we have in parsegen). That suggests that, starting with Knuth's algorithm, we may be able to reduce the number of states without losing power. I don't remember this mentioned before, I thought LR(1) is LR(1) -- the sets are minimal (and huge).

    opened by osa1 6
  • Combine LR1 items with same productions

    Combine LR1 items with same productions

    LR1 items: https://github.com/osa1/parsegen/blob/f368fac8b352f01358dd7ad40152da14983019f9/src/lr1.rs#L10-L17

    LR1 sets: https://github.com/osa1/parsegen/blob/f368fac8b352f01358dd7ad40152da14983019f9/src/lr1.rs#L80-L84

    This representation generates a lot of redundant (duplicate) info in states:

    1571: {
      [LetExpr -> "let" "(" Binder CommaBinder1_Rev ")" "=" SeqExpr "in" SeqExpr|, "("]
      [LetExpr -> "let" "(" Binder CommaBinder1_Rev ")" "=" SeqExpr "in" SeqExpr|, "bool"]
      [LetExpr -> "let" "(" Binder CommaBinder1_Rev ")" "=" SeqExpr "in" SeqExpr|, "-"]
      [LetExpr -> "let" "(" Binder CommaBinder1_Rev ")" "=" SeqExpr "in" SeqExpr|, "-."]
      [LetExpr -> "let" "(" Binder CommaBinder1_Rev ")" "=" SeqExpr "in" SeqExpr|, "+"]
      [LetExpr -> "let" "(" Binder CommaBinder1_Rev ")" "=" SeqExpr "in" SeqExpr|, "+."]
      [LetExpr -> "let" "(" Binder CommaBinder1_Rev ")" "=" SeqExpr "in" SeqExpr|, "*."]
      [LetExpr -> "let" "(" Binder CommaBinder1_Rev ")" "=" SeqExpr "in" SeqExpr|, "/."]
      [LetExpr -> "let" "(" Binder CommaBinder1_Rev ")" "=" SeqExpr "in" SeqExpr|, "="]
      [LetExpr -> "let" "(" Binder CommaBinder1_Rev ")" "=" SeqExpr "in" SeqExpr|, "<>"]
      [LetExpr -> "let" "(" Binder CommaBinder1_Rev ")" "=" SeqExpr "in" SeqExpr|, "<="]
      [LetExpr -> "let" "(" Binder CommaBinder1_Rev ")" "=" SeqExpr "in" SeqExpr|, "<"]
      [LetExpr -> "let" "(" Binder CommaBinder1_Rev ")" "=" SeqExpr "in" SeqExpr|, ">="]
      [LetExpr -> "let" "(" Binder CommaBinder1_Rev ")" "=" SeqExpr "in" SeqExpr|, ">"]
      [LetExpr -> "let" "(" Binder CommaBinder1_Rev ")" "=" SeqExpr "in" SeqExpr|, ","]
      [LetExpr -> "let" "(" Binder CommaBinder1_Rev ")" "=" SeqExpr "in" SeqExpr|, ";"]
      [LetExpr -> "let" "(" Binder CommaBinder1_Rev ")" "=" SeqExpr "in" SeqExpr|, "id"]
      [LetExpr -> "let" "(" Binder CommaBinder1_Rev ")" "=" SeqExpr "in" SeqExpr|, "int"]
      [LetExpr -> "let" "(" Binder CommaBinder1_Rev ")" "=" SeqExpr "in" SeqExpr|, "float"]
    }
    

    All of the items in this state have the same production. We can combine all of these items into a single one with a lookahead set:

    struct LR1Item {
        non_terminal_idx: NonTerminalIdx,
        production_idx: ProductionIdx,
        cursor: usize,
        lookahead: FxHashSet<TerminalIdx>,
    }
    

    We can either add one more field for whether EOF is a valid lookahead, or we can assign EOF a TerminalIdx.

    See #1 for optimizing terminal sets. (FxHashSet<TerminalIdx> in the code above)

    opened by osa1 0
  • Implement associativity attributes

    Implement associativity attributes

    It's tedious to parse expressions without associativity attributes.

    Left associative = reduce in a shift/reduce conflict Right associative = shift in a shift/reduce conflict

    I don't like ocamlyacc/menhir/happy style associativity lines before non-terminals/productions. I think we should probably use Rust attribute syntax and annotate productions with the associativity attributes. Examples:

    #[assoc(left)]
    <e1:Expr> "+" <e2:Expr> => ...
    
    #[assoc(right)]
    <e1:Expr> "?" <e2:Expr> ":" <e3:Expr> => ...
    
    opened by osa1 1
  • Remove (or make optional) semicolons after non-terminals

    Remove (or make optional) semicolons after non-terminals

    I don't remember why I required a semicolon after a non-terminal (maybe I just wanted the syntax to be compatible with LALRPOP?), but I think it shouldn't be necessary.

    We currently have two syntax for non-terminals:

    • With single production:

      <non-terminal name> : <non-terminal type> = <production>,
      
    • With multiple productions:

      <non-terminal name> : <non-terminal type> = {
        <production>,*
      };
      

    The semicolon in second syntax can be removed.

    opened by osa1 0
Owner
Ömer Sinan Ağacan
Ömer Sinan Ağacan
Soon to be AsciiDoc parser implemented in rust!

pagliascii "But ASCII Doc, I am Pagliascii" Soon to be AsciiDoc parser implemented in rust! This project is the current implementation of the requeste

Lukas Wirth 49 Dec 11, 2022
Parsing Expression Grammar (PEG) parser generator for Rust

Parsing Expression Grammars in Rust Documentation | Release Notes rust-peg is a simple yet flexible parser generator that makes it easy to write robus

Kevin Mehall 1.2k Dec 30, 2022
LR(1) parser generator for Rust

LALRPOP LALRPOP is a Rust parser generator framework with usability as its primary goal. You should be able to write compact, DRY, readable grammars.

null 2.4k Jan 7, 2023
A typed parser generator embedded in Rust code for Parsing Expression Grammars

Oak Compiled on the nightly channel of Rust. Use rustup for managing compiler channels. You can download and set up the exact same version of the comp

Pierre Talbot 138 Nov 25, 2022
Yet Another Parser library for Rust. A lightweight, dependency free, parser combinator inspired set of utility methods to help with parsing strings and slices.

Yap: Yet another (rust) parsing library A lightweight, dependency free, parser combinator inspired set of utility methods to help with parsing input.

James Wilson 117 Dec 14, 2022
Website for Microformats Rust parser (using 'microformats-parser'/'mf2')

Website for Microformats Rust parser (using 'microformats-parser'/'mf2')

Microformats 5 Jul 19, 2022
Parsing and inspecting Rust literals (particularly useful for proc macros)

litrs: parsing and inspecting Rust literals litrs offers functionality to parse Rust literals, i.e. tokens in the Rust programming language that repre

Lukas Kalbertodt 31 Dec 26, 2022
A procedural macro for defining nom combinators in simple DSL

A procedural macro for defining nom combinators in simple DSL

Andy Lok 22 Dec 12, 2022
A native Rust port of Google's robots.txt parser and matcher C++ library.

robotstxt A native Rust port of Google's robots.txt parser and matcher C++ library. Native Rust port, no third-part crate dependency Zero unsafe code

Folyd 72 Dec 11, 2022
Rust parser combinator framework

nom, eating data byte by byte nom is a parser combinators library written in Rust. Its goal is to provide tools to build safe parsers without compromi

Geoffroy Couprie 7.6k Jan 7, 2023
url parameter parser for rest filter inquiry

inquerest Inquerest can parse complex url query into a SQL abstract syntax tree. Example this url: /person?age=lt.42&(student=eq.true|gender=eq.'M')&

Jovansonlee Cesar 25 Nov 2, 2020
A fast monadic-style parser combinator designed to work on stable Rust.

Chomp Chomp is a fast monadic-style parser combinator library designed to work on stable Rust. It was written as the culmination of the experiments de

Martin Wernstål 228 Oct 31, 2022
A parser combinator library for Rust

combine An implementation of parser combinators for Rust, inspired by the Haskell library Parsec. As in Parsec the parsers are LL(1) by default but th

Markus Westerlind 1.1k Dec 28, 2022
The Elegant Parser

pest. The Elegant Parser pest is a general purpose parser written in Rust with a focus on accessibility, correctness, and performance. It uses parsing

pest 3.5k Jan 8, 2023
Rust query string parser with nesting support

What is Queryst? This is a fork of the original, with serde and serde_json updated to 0.9 A query string parsing library for Rust inspired by https://

Stanislav Panferov 67 Nov 16, 2022
A fast, extensible, command-line arguments parser

parkour A fast, extensible, command-line arguments parser. Introduction ?? The most popular argument parser, clap, allows you list all the possible ar

Ludwig Stecher 18 Apr 19, 2021
A rusty, dual-wielding Quake and Half-Life texture WAD parser.

Ogre   A rusty, dual-wielding Quake and Half-Life texture WAD parser ogre is a rust representation and nom parser for Quake and Half-Life WAD files. I

Josh Palmer 16 Dec 5, 2022
A modern dialogue executor and tree parser using YAML.

A modern dialogue executor and tree parser using YAML. This crate is for building(ex), importing/exporting(ex), and walking(ex) dialogue trees. convo

Spencer Imbleau 27 Aug 3, 2022
A friendly parser combinator crate

Chumsky A friendly parser combinator crate that makes writing LL-1 parsers with error recovery easy. Example Here follows a Brainfuck parser. See exam

Joshua Barretto 2.4k Jan 8, 2023