Rust grammar tool libraries and binaries

Software Development Team

Last update: Dec 26, 2022

Related tags

Parsing rust parser generator grammar lex lexer yacc error-recovery lr

Overview

Grammar and parsing libraries for Rust

grmtools is a suite of Rust libraries and binaries for parsing text, both at compile-time, and run-time. Most users will probably be interested in the compile-time Yacc feature, which allows traditional .y files to be used (mostly) unchanged in Rust.

Quickstart

A minimal example using this library consists of two files (in addition to the grammar and lexing definitions). First we need to create a file build.rs in the root of our project with the following content:

use cfgrammar::yacc::YaccKind;
use lrlex::CTLexerBuilder;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    CTLexerBuilder::new()
        .lrpar_config(|ctp| {
            ctp.yacckind(YaccKind::Grmtools)
                .grammar_in_src_dir("calc.y")
                .unwrap()
        })
        .lexer_in_src_dir("calc.l")?
        .build()?;
    Ok(())
}

This will generate and compile a parser and lexer, where the definitions for the lexer can be found in src/calc.l:

%%
[0-9]+ "INT"
\+ "+"
\* "*"
\( "("
\) ")"
[\t ]+ ;

and where the definitions for the parser can be found in src/calc.y:

%start Expr
%avoid_insert "INT"
%%
Expr -> Result<u64, ()>:
      Expr '+' Term { Ok($1? + $3?) }
    | Term { $1 }
    ;

Term -> Result<u64, ()>:
      Term '*' Factor { Ok($1? * $3?) }
    | Factor { $1 }
    ;

Factor -> Result<u64, ()>:
      '(' Expr ')' { $2 }
    | 'INT'
      {
          let v = $1.map_err(|_| ())?;
          parse_int($lexer.span_str(v.span()))
      }
    ;
%%
// Any functions here are in scope for all the grammar actions above.

fn parse_int(s: &str) -> Result<u64, ()> {
    match s.parse::<u64>() {
        Ok(val) => Ok(val),
        Err(_) => {
            eprintln!("{} cannot be represented as a u64", s);
            Err(())
        }
    }
}

We can then use the generated lexer and parser within our src/main.rs file as follows:

use std::env;

use lrlex::lrlex_mod;
use lrpar::lrpar_mod;

// Using `lrlex_mod!` brings the lexer for `calc.l` into scope. By default the
// module name will be `calc_l` (i.e. the file name, minus any extensions,
// with a suffix of `_l`).
lrlex_mod!("calc.l");
// Using `lrpar_mod!` brings the parser for `calc.y` into scope. By default the
// module name will be `calc_y` (i.e. the file name, minus any extensions,
// with a suffix of `_y`).
lrpar_mod!("calc.y");

fn main() {
    // Get the `LexerDef` for the `calc` language.
    let lexerdef = calc_l::lexerdef();
    let args: Vec<String> = env::args().collect();
    // Now we create a lexer with the `lexer` method with which we can lex an
    // input.
    let lexer = lexerdef.lexer(&args[1]);
    // Pass the lexer to the parser and lex and parse the input.
    let (res, errs) = calc_y::parse(&lexer);
    for e in errs {
        println!("{}", e.pp(&lexer, &calc_y::token_epp));
    }
    match res {
        Some(r) => println!("Result: {:?}", r),
        _ => eprintln!("Unable to evaluate expression.")
    }
}

For more information on how to use this library please refer to the grmtools book, which also includes a more detailed quickstart guide.

Examples

lrpar contains several examples on how to use the lrpar/lrlex libraries, showing how to generate parse trees and ASTs, or execute code while parsing.

Documentation

Latest release	master
grmtools book	grmtools book
cfgrammar	cfgrammar
lrpar	lrpar
lrlex	lrlex
lrtable	lrtable

Documentation for all past and present releases

Comments

Unused symbol
I wasn't having much luck with Unused due to conflicts, which I believe i'm starting to understand why bison's production of that warning is limited (to Shift/Reduce conflicts). So here is a prototype for warnings, which adds an unused symbol warning for yacc which are not due to conflicts.

It tries to include a mechanism through which we can also produce warnings as errors, and largely follows from the all the YaccGrammarError work that has been done minus the Error trait.

A few things are noteworthy spanskind wanted to be on YaccGrammarWarningKind as well as YaccGrammarWarning, so that YaccGrammarError could leverage it when treating warnings as errors, SpansKind::Error is also an awkward choice of naming here.

There is some subsequent patches left to do:

Integrating this into various tools

~~I noticed that complete_and_validate is still returning a YaccGrammarError, and that we should probably make it that Vec<YaccGrammarError>.~~ Realized this is just pub(crate) not pub, so can do this whenever.

How or if we should pass a warnings_as_errors flag to new_with_storaget, and whether the place I implemented warnings() on YaccGrammarError is actually the right place, or if this is something that callers should do using the From impl given.

When we complete_and_validate() and there are errors, this code is only callable too late to actually have both warnings and errors, but implementing it to be callable earlier seemed tougher since it relies on PIdx's a bit.

As such I'm going to mark this as draft pre-emptively, since it seems like there is some thought that needs to go into what the best thing to do here is, and whether this is even worth the additional types/impls?
opened by ratmice 53

Implement actions

This PR is under construction and should not be merged in its current state. The goal of this PR is to implement grammar actions similar to YACC, for example:

%start Expr
%%
Expr: Term 'PLUS' Expr { add($0, $2) }
    | Term { $0 }
    ;
Term: Factor 'MUL' Term { mul ($0, $2) }
    | Factor { $0 }
    ;
Factor: 'LBRACK' Expr 'RBRACK' { $1 }
      | 'INT' { $0 }
      ;
%%
type TYPE = u64;

fn add(arg1: u64, arg2: u64) -> u64 {
    arg1 + arg2
}
fn mul(arg1: u64, arg2: u64) -> u64 {
    arg1 * arg2
}

opened by ptersilie 38

Started work on conflict API

** PR in progress **

The goal of this PR is to provide a nice API for the user (i.e. the language developer) to inspect shift/reduce and reduce/reduce conflicts and handle them manually.

At the moment the API is very basic, but I thought I get some early thoughts on the direction before continuing. An example usage of the API can be found in nimbleparse, where it just pretty prints the conflicts to the terminal.

My main question would be, where the user would be expected to handle these errors. For example within the calc examples, would that be in main.rs or build.rs? Or am I off track again, and we want to allow the user to handle the conflict as it happens and they then can manually change how the parser generator resolves those conflicts?

opened by ptersilie 37
Add a name_span field to Rule.

This adds a name_span field to lexerdef's rule, I assume it is okay to depend on span here from lrpar in lrlex since the depdency is already there.

Notes about the impl: The span refers to inside the quotes, of the quoted name, and in the case of an anonymous rule with no name, It points to the empty string at the semicolon.

This is why name is an option but span is not, this still needs to be documented and tested, But I wanted to post this up for comments before writing docs/tests in case of feedback on the impl causes different behavior. So for now I merely checked it just with iter_rules() manually.

It doesn't add spans for re_str (re_span?), which is private and I'm not sure it would be alright to have a public span for that private field, and i'm not sure if I would actually need it -- I still need to play around with adding this info to diagnostics to get a feel for that.

opened by ratmice 34
Use packed vectors to store state tables

This PR replaces the state table HashMaps (goto and actions) with packed vectors. This sacrifices memory usage for performance, as it makes table lookups O(1).

Previously, states were stored as (stIdx, ridx) => stidx, where stidx is a state id, and ridx is a rule id. Now, lookups are done by calculating an index into the vector from stidx and ridx which reveals the state, e.g. goto[stidx * nt_len + ridx], where nt_len is the amount of nonterminals in the grammar.

We then use https://github.com/gabi-250/packed-vec to pack the vectors to reduce memory usage.

opened by ptersilie 33
WIP: %parse-param

This merely adds some basic scaffolding for %parse-param, issue #212, It contains some ugly hacks/commented out code to get tests compiling with the unimplemented feature. Mostly wanted to ask if this looks like an ok syntax for lifetimes.

opened by ratmice 30
Add a span field to LexBuildError

This currently uses the offset given to build an empty span at (off, off). I've commented in the tests, some spans that I believe might be the right thing, but without really digging into the parser code for each case and checking out the results, I'm not certain of the accuracy of these comments.

I mainly wanted to push this before starting any work on that, because that will probably take a bit of work, increase the size of the patch, and that work should be isolated private API.

My feeling is the next step would be to remove the line/col fields from the structure and the error text, moving that to callers using the span, in this patch we still need those, because by the time we format the message we no longer have the text to count newlines. Does that sound reasonable?

opened by ratmice 29
Allow different rules to have different action types.
This falls under the "why didn't this occur to me earlier?" heading.

Our previous %actiontype solution is, more-or-less, useless in practise. It works in C because the union type that's created allows C programmers to treat chunks of memory as dynamically typed (happily seg-faulting if the programmer fished the wrong thing out of the union). Consider this (highly contrived) grammar:

%start S %% S: A | S A; A: 'a';

i.e. match one or more "a" characters. Let's assume that, for each "a" we match we want to produce some value. How should we write this? Our only option is "%actiontype Vec<...>" because "S" must return a Vec, even though "A" only sensibly returns a singular value. In this case, this feels inelegant, but it works.

However, what happens if you've got a real programming language building up an AST? Consider a simple language which allow assignments and expressions:

Assignments: Assignment | Assignments Assignment; Assignment: "ID" "=" Expr Expr: "INT" "+" Expr | "INT"

"Assignments" should return "Vec" and "Expr" should return an "Expr". What should our %actiontype be now? Well, we can make an enum:

enum AssignsOrExprs { Assigns(Vec<Assign>), Expr(Expr) }

and then we can have "%actiontype AssignsOrExprs". But we now have to scale this up to every AST type in our entire program which is horrible, and also turns our seemingly statically typed Rust into dynamically typed code: the number of "match" statements doesn't bear thinking about!

How do other grammar generators deal with this? Well they do the obvious (in retrospect) thing of allowing each different rule's actions to return different types (note: each production in a rule must return the same type!). That's what this commit does. Interestingly, this requires a simple tweak to cfgrammar; a big change to ctbuilder.rs; and not much else.

First we add YaccKind::Grmtools, a new variant on Yacc syntax. %actiontype is not valid in this grammar type. Instead, each rule in this grammar type must have a type after its name:

%start S %% S::Vec<A>: A { vec![$1] } | S A { $1.push($2); $1 } ; A::A: 'a' { A } ; %% pub struct A;

How does this work? Before we translated each production's action into a rule which (for the first production in the grammar above) looked roughly like this:

fn action_prod0(..., args: Drain<...>) -> Vec<A> { let arg1 = match args.next().unwrap() { AStackType::ActionType(x) => x, _ => unreachable!() }; vec![arg1] }

The crucial detail now is that we know the types of all rules in advance. So we generate an enum of all action types, and translate each action into a wrapper and an action:

enum ActionsKind { AKS(Vec<A>), AKA(A) } fn wrapper_prod0(..., args: Drain<...>) -> Vec<A> { let arg1 = match args.next().unwrap() { AStackType::ActionType(x) => x, _ => unreachable!() }; action_prod0(arg1) } fn action_prod0(..., mut arg1: A) -> Vec<A> { vec![arg1] } fn wrapper_prod1(..., args: Drain<...>) -> Vec<A> { let arg1 = match args.next().unwrap() { AStackType::ActionType(x) => x, _ => unreachable!() }; let arg2 = match args.next().unwrap() { AStackType::ActionType(x) => x, _ => unreachable!() }; action_prod0(arg1, arg2) } fn action_prod1(..., mut arg1: Vec<A>, mut arg2: A) -> Vec<A> { arg1.push(arg2); arg1 } fn wrapper_prod2(..., args: Drain<...>) -> Vec<A> { action_prod2() } fn action_prod2(...) -> A { A }

The cunning thing about this is that we don't have to change the way the parser works: it receives "action" functions that are wrappers, all of which have the same function signature (i.e. from the parser's perspective it's a bit like we wrote %actiontype ActionsKind). The wrapper functions then unpack the Drain, and call the "actual" action functions which contain user code. The actual action functions have their function's arguments and return types statically typed, so Rust statically guarantees that you can't, for example, mix your Classes and Imports. This does incur a mild additional overhead because of the ActionsKind enum (one extra machine word, at worst, per ActionType we're holding), but not enough to hugely worry us. And now we can write grammars which generate ASTs with ease!

As a happy bonus, I realised that we can make arguments to action functions be mutable (hence the mut arg1: Vec<A> above), which makes doing thing like flattening lists a lot more ergonomic.
opened by ltratt 25
Tentatively add a $span pseudo-variable
This allows you to tell how much input the current production has matched, which can be useful for giving better debugging information to users. Its type is (usize, usize) where these represent (start, end) offsets in the input.

For example if you have a rule:

Expr: Term '+' Expr { println!("{:?}", $span); } ;

and input along the lines of "2+3", you will get the output "(0, 3)".

Interestingly, users can mostly calculate this same information themselves (by inspecting tokens start/end positions), except for epsilon rules where there's no way of knowing where in the input we are. So this production can't be made to work sensibly except with $span:

R: { println!("{:?}", $span); }

This is a bit tentative, because I haven't used this enough yet to know if it's the right design: feedback is welcome! The major commit is https://github.com/softdevteam/grmtools/commit/2b64530eb381c3f009d78aaf18b49c739ac43bff with documentation in https://github.com/softdevteam/grmtools/commit/b81b095a0901b6953a4ebe03637926cd3b11825a; the other commits are mostly shuffling a few things around.
opened by ltratt 23
More flexible input lifetime
This is based on https://github.com/softdevteam/grmtools/pull/174: it tries to decouple the lifetime of a lexer from its input. That PR is a work of near genius: I simply would not have imagined that the outcome it achieves is possible without the PR as a proof-of-existence. The only minor problem was that I couldn't work out how it achieved its effect. I therefore tried to simplify the PR, but didn't get very far. I then tried reimplementing it, and didn't get very far with that either.

This PR is the result of me taking a different approach. First I simplified and unified the existing lifetimes in grmtools, because there were several inconsistencies, which I thought might be responsible for some of the pain in #174. Once that was done I could then add an explicit 'input lifetime in https://github.com/softdevteam/grmtools/commit/91dab3fd5f21347e87441bb7c1b34493d519ac58 which I think achieves the same effect as #174. Certainly it is enough to allow this program to now compile:

fn main() { let lexerdef = t_l::lexerdef(); let input = "a"; let t = { let lexer = lexerdef.lexer(&input); let lx = lexer.iter().next().unwrap().unwrap(); lexer.span_str(lx.span()) }; }

where previously rustc would complain:

error[E0597]: `lexer` does not live long enough --> src/main.rs:12:9 | 9 | let t = { | - borrow later stored here ... 12 | lexer.span_str(lx.span()) | ^^^^^ borrowed value does not live long enough 13 | }; | - `lexer` dropped here while still borrowed error: aborting due to previous error

However, I am not sure if this PR is able to handle all the same cases as #174. I'm hoping that @valarauca will be able to let me know if this solves the problem that led him to create #174.
opened by ltratt 21
Add badges linking to crates.io
#133

The crates.io badges are practical in a way that they

make it obvious at first glance which components are contained in this project

identify the current version numbers

link to the documentation of the individual projects and their READMEs
opened by pablosichert 20
Adding a %grammar-kind declaration?
Before I try and come up with a patch, I figured it would be good to discuss this in an issue, I was considering potentially adding a declaration %grammar-kind Original(NoAction), etc

One of the problems with this is that it is likely that we want to parse the value by just using Deserialize on the YaccGrammarKind, this would at least be the easiest way. But it brings about a few issues

cfgrammar has Optional deserialization support, so if we deserialized that way %grammar-kind would only work with serde feature enabled. Alternately we could just implement this by hand instead of serde?

Some declarations depend upon a specific %grammar-kind, we may have to move some checks from parse_declarations to ast.complete_and_validate.

But it could potentially reduce the number of places that YaccGrammarKind needs to be specified (build.rs, nimbleparse command line, etc). So it seemed like it might be worth considering.
enhancement
opened by ratmice 0
Permit stack operations on start conditions

In #318, start state logic was added for start states defined by name. In the POSIX lex standard, start states can be used by numeric id

Q: Should this include support for expanding the target start state logic to support increment and decrementing the current start state, as well as setting to an explicit target?

opened by SMartinScottLogic 13
Remove debug formatting in non-debug locations

In a couple of places (e.g. https://github.com/softdevteam/grmtools/blob/master/lrlex/src/lib/ctbuilder.rs#L419) we use debug formatting in a non-debug location. This feels somewhat unsatisfactory, particularly as there are fewer guarantees about stability.

opened by ltratt 2
Error span improvements
In pr #299 which adds spans to various Error types, the Spans returned are based off of the existing offset data from which we can derive a line & column. As it is we currently always return a span where start == end, since it is just getting us to the desired semver ABI.

SemVer compatible changes (after we add Spans to Errors): After that PR we could include in the error more information from the parse functions into YaccParserError and LexBuildError. This may require some reorganization of the various private parse functions.

Potential SemVer incompatible changes (after we add Spans to Errors): YaccErrorKind, and LexErrorKind could sometimes have useful additional spans, for instance LexErrorKind::DuplicateName Could have a span pointing to the first occurrence of the duplicate entry.

[ ] SemVer compatible improvements

[x] SemVer incompatible improvements
opened by ratmice 0
Apparently infinite recursive rule
One of the "fun" things about my project is running the parser on strange, half edited/incomplete changes. Here is one such case I encountered that way, and have minimized.

given the input character a, this will cause an infinite loop pushing to pstack between https://github.com/softdevteam/grmtools/blob/master/lrpar/src/lib/parser.rs#L297 https://github.com/softdevteam/grmtools/blob/master/lrtable/src/lib/statetable.rs#L461-L466 adding a case like: Some(i) if i == usize::from(stidx) + 1 => None, to goto fixes it, (i.e. the return value of goto == prior).

Filing this as a bug report rather than sending a PR though, because I haven't yet tested it against valid parsers, or as of yet tried to surmise if this case can only and always lead to infinite recursion or if it ever actually comes up in a valid way.

%% a "a" [\t\n ] ;

%% Start: Bar; Foo: "a" | ; Bar: Foo | Foo Bar;
opened by ratmice 5
Expose more than one rule?
Question / Feature Request: Is there any way to parse a specific rule as the starting parser? For example, if I have:

%start Expr %% Expr -> ...; Int -> ...; %%

I also want to be able to parse a string as Int, not just Expr.

(I'm trying to port my parser from LALRPOP to lrpar (mainly because of the operator precedence feature) which exposes a parser for any rule prefixed with the keyword pub.)
enhancement
opened by utkarshkukreti 3

Owner

Software Development Team

GitHub

Parse BNF grammar definitions

bnf A library for parsing Backus–Naur form context-free grammars. What does a parsable BNF grammar look like? The following grammar from the Wikipedia

188 Dec 26, 2022

LR(1) grammar parser of simple expression

LR(1)语法分析程序实验内容编写LR(1)语法分析程序,实现对算术表达式的语法分析。要求所分析算数表达式由如下的文法产生: E -> E+T | E-T | T T -> T*F | T/F | F F -> (E) | num 程序设计与实现使用方式：运行.\lr1-parser.exe

1 Nov 24, 2021

Pure, simple and elegant HTML parser and editor.

HTML Editor Pure, simple and elegant HTML parser and editor. Examples Parse HTML segment/document let document = parse("<!doctype html><html><head></h

16 Nov 8, 2022

A native Rust port of Google's robots.txt parser and matcher C++ library.

robotstxt A native Rust port of Google's robots.txt parser and matcher C++ library. Native Rust port, no third-part crate dependency Zero unsafe code

72 Dec 11, 2022

JsonPath engine written in Rust. Webassembly and Javascript support too

jsonpath_lib Rust 버전 JsonPath 구현으로 Webassembly와 Javascript에서도 유사한 API 인터페이스를 제공 한다. It is JsonPath JsonPath engine written in Rust. it provide a simil

95 Dec 29, 2022

Parsing and inspecting Rust literals (particularly useful for proc macros)

litrs: parsing and inspecting Rust literals litrs offers functionality to parse Rust literals, i.e. tokens in the Rust programming language that repre

31 Dec 26, 2022

A Rust crate for RDF parsing and inferencing.

RDF-rs This crate provides the tools necessary to parse RDF graphs. It currently contains a full (with very few exceptions) Turtle parser that can par