Left Recursive PEG for rust

Sean Young

Last update: Jun 13, 2022

Related tags

Cryptography lrpeg

Overview

Left Recursive Parsing Expression Grammar (PEG)

lrpeg allows left recursive rules, and uses ratpack parsing for speed. I wrote a blog post to introduce the ideas of lrpeg.

The existing PEG parser generators for rust do not allow left recursion, which makes it very awkward to write grammars. It is possible to write a PEG parser generator which allows for left recursion, just as python now uses.

See IRP Grammar for a complete lrpeg grammar and handling for IRP.

How to use lrpeg

Add lrpeg to your Cargo.toml in build-dependencies:

[build-dependencies]
lrpeg = "0"

[dependencies]
regex = "1"
unicode-xid = "0.2"

Now add a build.rs to the root of your project, containing:

use std::path::PathBuf;

fn main() {
    lrpeg::process_files(&PathBuf::from("src"));
}

Write your peg grammar, and put it in a file which ends with .peg, for example src/calculator.peg:

calculator <- expr EOI;

expr <- expr ("+" / "-") WHITESPACE term
    / term;

term <- term ("*" / "/" / "%") WHITESPACE  factor
    / factor;

factor <- "(" WHITESPACE  expr ")" WHITESPACE
    / num;

num <- r"[0-9]+" WHITESPACE;

When your run cargo build, src/calculator.rs will be generated. You need to include this module in your project, and then you can instantiate the PEG like so:

mod calculator;

use calculator::{Node, Rule};

fn main() {
    let mut parser = calculator::PEG::new();

    let args: Vec<String> = std::env::args().collect();

    if args.len() != 2 {
        eprintln!("Usage {} EXPRESSION", &args[0]);
        std::process::exit(2);
    }

    let input = &args[1];

    println!("parsing: {}", input);

    match parser.parse(input) {
        Ok(node) => {
            fn walk(node: &Node, input: &str) -> u64 {
                match node.rule {
                    Rule::num => u64::from_str_radix(node.children[0].as_str(input), 10).unwrap(),
                    Rule::expr => {
                        if node.alternative == Some(0) {
                            let left = walk(&node.children[0], input);
                            let right = walk(&node.children[3], input);

                            match node.children[1].as_str(input) {
                                "+" => left + right,
                                "-" => left - right,
                                _ => unreachable!(),
                            }
                        } else {
                            walk(&node.children[0], input)
                        }
                    }
                    Rule::term => {
                        if node.alternative == Some(0) {
                            let left = walk(&node.children[0], input);
                            let right = walk(&node.children[3], input);

                            match node.children[1].as_str(input) {
                                "*" => left * right,
                                "/" => left / right,
                                "%" => left % right,
                                _ => unreachable!(),
                            }
                        } else {
                            walk(&node.children[0], input)
                        }
                    }
                    Rule::factor => {
                        if node.alternative == Some(0) {
                            walk(&node.children[2], input)
                        } else {
                            walk(&node.children[0], input)
                        }
                    }
                    Rule::calculator => walk(&node.children[0], input),
                    _ => {
                        unreachable!()
                    }
                }
            }

            println!("result: {}", walk(&node, input));
        }
        Err(pos) => {
            eprintln!("parser error at offset {}", pos + 1);
        }
    }
}

This example is available in lrpeg-example.

How to write grammar

PEG grammars are a set of rules. In lrpeg, each rule must end with a ";". Parsing starts at the top rule.

Each rule must start with an identifier, followed by <- and then the terms and terminated by ;
A term can be repeated with *, + or optional ?.
The list of terms cannot be empty (but a term can be optional)
A text literal can be encoded in single or double quotes ("foo" or 'bar')
A regex must be written as r"[0-9]+". Backslashed must be escaped r"\\d+" (this really needs fixing)
Alternatives are denoted with /. For example foo <- "a" / "b";
There are special terms WHITESPACE, EOI, and XID_IDENTIFIER (for unicode identifiers)

How lrpeg is bootstrapped

First of all, we need to bootstrap the PEG. We do this with a very simple lalrpop LR(1) PEG grammar. This grammar has some limitations, but it is good enough.

TODO

More tests
Better parse error information (now only the error offset is returned)
Better documentation
Make generator into rust macro
Detect unused rules
Detect unreachable alternatives

Comments

ambiguous tokens

since the operator "~" doesn't seem to be supported and the regex crate can't do look-ahead operations what is the best way to disambiguate between identifiers and keywordsI have tried to reorder them but it's not ideal because it will either fail to parse or end up in the wrong branch.

also thank you for making lrpeg available!

opened by advancedtw 7

Idea: Support `node_tag` and `branch_tag`

Add two fields to identify nodes

#[derive(Clone, Debug)]
pub struct Node<'i> {
    pub rule: Rule,
    pub start: usize,
    pub end: usize,
    pub node_tag: Option<&'i str>,   // The lifetime is the same as the input text
    pub branch_tag: Option<&'i str>, // The lifetime is the same as the input text
    pub children: Vec<Node<'i>>,
    pub alternative: Option<u16>,
}

Basically like this:

expr <-
    "(" expr ")"            #Priority
  / lhs=expr "*" rhs=expr   #Mul
  / lhs=expr "/" rhs=expr   #Div
  / lhs=expr "+" rhs=expr   #Add
  / lhs=expr "-" rhs=expr   #Sub
  / num                     #Atom
  ;
num <- re#[0-9]+#;

PEG.js has this feature

enhancement

opened by oovm 6

Make 'ast' and 'parser' modules public

Hi Sean,

I need to access the internal modules 'ast' and 'parser' so I've added optional features to expose them.

Please let me know if you prefer to handle this in a different way.

Thanks, Tal

opened by talliberman 5

Are rules related to order?

// everything ok
IGNORE <- WHITESPACE / COMMENT;

COMMENT <- "//" re#[^\r\n]# .;

If you change the order, it will go wrong:

// failed to run custom build command
COMMENT <- "//" re#[^\r\n]# .;

IGNORE <- WHITESPACE / COMMENT;

error: failed to run custom build command for `lrpeg-test v0.1.0 (D:\Rust\lrpeg\lrpeg-test)`

opened by oovm 4

Optional `;`
https://github.com/seanyoung/lrpeg/blob/01baa55940033bafb2e68d40b76ab11d81a6ec11/lrpeg/src/peg.peg#L3

Here I can see ; should be optional, but in fact it is not, ; is required.

https://github.com/seanyoung/lrpeg/blob/01baa55940033bafb2e68d40b76ab11d81a6ec11/lrpeg/src/peg.peg#L7

It seems that sequence will eat the beginning of next rule.

x <- a b c y <- d // equivalent to x <- a b c y <- d // should equivalent to x <- a b c; y <- d

In principle, can this problem be solved?
opened by oovm 1

Bad case report of indirect left recursion

Input text

x = a | b | c

    Finished test [unoptimized + debuginfo] target(s) in 0.10s
     Running tests\main.rs (target\debug\deps\main-c0d26e7fa1d49033.exe)

called `Result::unwrap()` on an `Err` value: (2, 6)
thread 'peg::assign' panicked at 'called `Result::unwrap()` on an `Err` value: (2, 6)', projects\ygg-bootstrap\tests\peg\mod.rs:5:38
stack backtrace:
   0: std::panicking::begin_panic_handler
             at /rustc/d7c97a02d1215e4ef26c31cb72dbaf16fd548b2c\/library\std\src\panicking.rs:517
   1: core::panicking::panic_fmt
             at /rustc/d7c97a02d1215e4ef26c31cb72dbaf16fd548b2c\/library\core\src\panicking.rs:100
   2: core::result::unwrap_failed
             at /rustc/d7c97a02d1215e4ef26c31cb72dbaf16fd548b2c\/library\core\src\result.rs:1616
   3: enum$<core::result::Result<yggdrasil_bootstrap::ygg::ygg::Node,tuple$<usize,usize> > >::unwrap<yggdrasil_bootstrap::ygg::ygg::Node,tuple$<usize,usize> >
             at /rustc/d7c97a02d1215e4ef26c31cb72dbaf16fd548b2c\library\core\src\result.rs:1298
   4: main::peg::peg_assert
             at .\tests\peg\mod.rs:5
   5: main::peg::assign
             at .\tests\peg\mod.rs:47
   6: main::peg::assign::closure$0
             at .\tests\peg\mod.rs:42
   7: core::ops::function::FnOnce::call_once<main::peg::assign::closure$0,tuple$<> >
             at /rustc/d7c97a02d1215e4ef26c31cb72dbaf16fd548b2c\library\core\src\ops\function.rs:227
   8: core::ops::function::FnOnce::call_once
             at /rustc/d7c97a02d1215e4ef26c31cb72dbaf16fd548b2c\library\core\src\ops\function.rs:227
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

error: test failed, to rerun pass '--test main'

Process finished with exit code 101

Bad PEG

Forgive me for not minimizing the problem, there are some other problems I am figuring out.

program  <- IGNORE (statement IGNORE)* EOI;

statement  <- empty_statement
  / grammar_statement IGNORE eos?
  / fragment_statement IGNORE eos?
  / import_statement IGNORE eos?
  / ignore_statement IGNORE eos?
  / assign_statement IGNORE eos?
  / macro_define IGNORE eos?
;

empty_statement <- eos;

eos  <- ";";


grammar_statement  <-
    grammar IGNORE symbol IGNORE "{" IGNORE (string IGNORE ("," IGNORE string)* IGNORE ","?)? IGNORE "}"
  / grammar IGNORE symbol IGNORE string?
;
grammar  <- "grammar!";

fragment_statement <- fragment IGNORE symbol;
fragment <- "fragment!";


import_statement  <-
    import IGNORE string IGNORE "{" IGNORE (symbol_alias IGNORE ("," IGNORE symbol_alias)* IGNORE ","?)? IGNORE "}"
  / import IGNORE string
;
import  <- "import!";

symbol_alias <- (symbol IGNORE "as" IGNORE)? symbol;

ignore_statement  <-
    ignore IGNORE symbol
  / ignore IGNORE "{" IGNORE (symbol IGNORE ("," IGNORE symbol)* IGNORE ","?)? IGNORE "}"
;
ignore  <- "ignore!";



assign_statement  <- symbol IGNORE assign_kind IGNORE ("|"/"/")? IGNORE expr;
assign_kind  <- ("^"/"_"/"@")*  "=";

expr  <- expr0;
expr0 <- symbol IGNORE mark_type? IGNORE "<-" IGNORE expr1
    / expr1;
expr1 <- choice_expr IGNORE ("|"/"/") IGNORE choice_expr
    / expr2;
expr2 <- expr2 IGNORE "~" IGNORE expr3
    / expr3;
expr3 <- expr3 IGNORE slice
    / expr4;
expr4 <- expr4 IGNORE suffix
    / expr5;
expr5 <- prefix IGNORE expr5
    / expr6;
expr6 <- "(" IGNORE "/"? IGNORE expr IGNORE ")"
    / data;

choice_expr  <- expr1 IGNORE (mark_branch IGNORE mark_type?)?;
mark_branch  <- ("^"/"!"/"<"/">")? "#" symbol;
mark_type  <- ":" IGNORE symbol;
prefix  <- "!" / "&" / "^" / "*" / "%";
suffix  <- "?" / "+" / "-"/ "*";

data   <- macro_call
  / regex_range
  / list
  / symbol_path
  / integer
  / special
  ;

list   <- "{" IGNORE (data IGNORE ("," IGNORE data)* IGNORE ","?)? IGNORE "}";
slice  <- "{" IGNORE integer IGNORE "," IGNORE integer IGNORE  "}";

regex_range  <- "[" IGNORE (!"]" IGNORE . / "\\" IGNORE .)* IGNORE "]";


macro_call  <-
    "@" symbol_path IGNORE "(" IGNORE (macro_kv IGNORE ("," IGNORE macro_kv)* IGNORE ","?)? IGNORE")"
;
macro_kv  <-  symbol IGNORE " <-" IGNORE expr / expr;



macro_define  <-
    "macro!" IGNORE symbol IGNORE  "(" IGNORE(macro_arg IGNORE ("," IGNORE macro_arg)* IGNORE ","?)? IGNORE ")" IGNORE block
;
macro_arg  <- symbol IGNORE (":" IGNORE symbol)? IGNORE ("=" IGNORE expr)?;
block  <- "{" IGNORE "return"? IGNORE expr IGNORE "}";


string  <-
    re#'([^\\']|\\.)*'#
  / re#"([^\\"]|\\.)*"#
;

integer  <- "0" / re#[1-9]# ("_"? re#[0-9]#)*;

special  <- "true"/"false"/"null";

symbol_path  <-
    symbol IGNORE (("::"/".") IGNORE symbol)*
;
symbol  <- XID_IDENTIFIER;


IGNORE <-  (SPACE / NEWLINE / COMMENT)?;

SPACE <- re#[\s]+#;
NEWLINE <- re#[\r\n]+#;

COMMENT <- "//" re#[^\r\n]*#

Expected behavior

Inline choice_expr into expr1 then everything is ok

expr1 <- expr1 IGNORE (mark_branch IGNORE mark_type?)? IGNORE ("|"/"/") IGNORE choice_expr
    / expr2;

It should be handled correctly without inline here.

opened by oovm 1

Syntax highlighting

I wrote a very basic VSCode extension to enable syntax highlighting for the project:

https://github.com/sheganinans/lrpeg-rust-vsc/

https://marketplace.visualstudio.com/items?itemName=sheganinans.lrpeg-rust

Would be good if it was mentioned in the docs, thanks!

opened by sheganinans 0
Stack based matching
I have encountered some difficulties, and it doesn’t seem to be easy when pair matching is required.

Such as these situations:

HTML/XML

<a> push into the stack , wait until </a> pop out the stack

TeX

\begin{name} push into the stack , wait until \end{name} pop out the stack.

Markdown

`×4 push into the stack ````markdown do something ```` wait until `×4 pop out the stack
opened by oovm 0

Bad case report of missing label

Simplified Example

calculator <- <atom:atom> EOI;

atom <-
    Num:/ <num:num>
    Str:/ <str:str>;

num <- re#0|[1-9][0-9]*#;
str <- re#"([^\\"]|\\.)*"#;

Bad Result

(omit EOI)

Node {
    rule: calculator,
    start: 0,
    end: 1,
    children: [
        Node {
            rule: atom,
            start: 0,
            end: 1,
            children: [
                Node {
                    rule: num,
                    start: 0,
                    end: 1,
                    children: [],
                    label: None,
                    alternative: None,
                },
            ],
            label: Some("atom"),
            alternative: Some("Num"),
        },
    ],
    label: None,
    alternative: None,
}

Excepted Result

(omit EOI)

Node {
    rule: calculator,
    start: 0,
    end: 1,
    children: [
        Node {
            rule: atom,
            start: 0,
            end: 1,
            children: [
                Node {
                    rule: num,
                    start: 0,
                    end: 1,
                    children: [],
                    label: Some("num"),
                    alternative: None,
                },
            ],
            label: Some("atom"),
            alternative: Some("Num"),
        },
    ],
    label: None,
    alternative: None,
}

opened by oovm 0

NodeAPI suggestion

This is a relatively private request, only for my project.

All the interfaces I need are listed here

/// It's a node contained in the Lossless Concrete Syntax Tree
/// All subsequent required information will be retained
/// Including spaces, line breaks, and comments or other semantically irrelevant content.
/// Macros and formatting can start at this level
pub trait CSTNode where Self: Sized {
    /// get str of the node
    fn get_str(&self) -> &str;
    /// Provide basic location information
    /// (start_offset, end_offset)
    fn get_span(&self) -> (usize, usize);
    /// Provide detailed row and column ranges
    /// (start_line, start_col, end_line, end_col)
    fn get_range(&self) -> (usize, usize, usize, usize);
    /// Get the tag of the current node
    fn get_node_tag(&self) -> Option<&'static str>;
    /// Get the tag of the current branch
    fn get_branch_tag(&self) -> Option<&'static str>;
    /// Find node tags in all of the children
    /// Then collect then into the vec, and store in hashmap with tag name
    fn get_tag_map(&self) -> HashMap<String, Vec<Self>>;
}

This version is used in pest, lrpeg itself does not store the input reference, so it may look like this

pub trait CSTNode where Self: Sized {
    fn get_str<'i>(&self,input: &'i str) -> &'i str;
    fn get_span(&self, input:&str) -> (usize, usize);
    fn get_range(&self, input:&str) -> (usize, usize, usize, usize);
    fn get_node_tag(&self) -> Option<&'static str>;
    fn get_branch_tag(&self) -> Option<&'static str>;
    fn get_tag_map(&self) -> HashMap<String, Vec<Self>>;
}

Hope this helps refer to which field in Node should keep.

Why do I need two ranges (span, range)?

because the language server protocol is like this, most interfaces are rows and columns, and a few cases are offsets.

enhancement

opened by oovm 1

Suppress output of unnecessary nodes

This is my grammar file,

https://github.com/oovm/lrpeg-test/blob/master/projects/lrpeg/src/ygg.peg

This is my parsing output of x=0

https://github.com/oovm/lrpeg-test/blob/95b032c9ca0fd46c39ea31edf480d41eaecdec1b/projects/lrpeg/tests/assign.yaml#L23-L34

I don't understand why there are two Terminals here, and the statement nodes I need are wrapped in children.

In my understanding, Terminal is ε or string or regex, it should have no children.
enhancement

opened by oovm 10

Owner

Sean Young

Compiler engineer and linux kernel infrared maintainer. Working on the Solang Solidity Compiler.

GitHub

STARK - SNARK recursive zero knowledge proofs, combinaison of the Winterfell library and the Circom language

STARK - SNARK recursive proofs The point of this library is to combine the SNARK and STARK computation arguments of knowledge, namely the Winterfell l

68 Dec 5, 2022

Uses Plonky2 proof system to build recursive circuits for Merkle Trees.

ProvableMerkleTrees Introduction This repo provides Rust code to build Merkle Trees, equipped with a Provable interface to generate Zero Knowledge pro

5 Aug 18, 2023

Package used by the cosmos-rust-interface. Makes direct use of cosmos-rust.

Package used by the cosmos-rust-interface. Makes direct use of cosmos-rust (cosmos‑sdk‑proto, osmosis-proto, cosmrs).

4 Dec 26, 2022

Rust project for working with ETH - Ethereum transactions with Rust on Ganache and also deploy smart contracts :)

Just a test project to work with Ethereum but using Rust. I'm using plain Rust here, not Foundry. In future we will use Foundry. Hope you're already f

2 Dec 20, 2022

An open source Rust high performance cryptocurrency trading API with support for multiple exchanges and language wrappers. written in rust(🦀) with ❤️

Les.rs - Rust Cryptocurrency Exchange Library An open source Rust high performance cryptocurrency trading API with support for multiple exchanges and

4 Jan 9, 2023

Simple node and rust script to achieve an easy to use bridge between rust and node.js

Node-Rust Bridge Simple rust and node.js script to achieve a bridge between them. Only 1 bridge can be initialized per rust program. But node.js can h

5 Apr 30, 2023

Marvin-Blockchain-Rust: A Rust-based blockchain implementation, part of the Marvin blockchain project.

Marvin Blockchain - Rust Implementation Welcome to the Rust implementation of the Marvin Blockchain. This project is part of a comparative study on bu

3 Sep 6, 2024

A Rust library for working with Bitcoin SV

Rust-SV A library to build Bitcoin SV applications in Rust. Documentation Features P2P protocol messages (construction and serialization) Address enco

51 Oct 13, 2022

Coinbase pro client for Rust

Coinbase pro client for Rust Supports SYNC/ASYNC/Websocket-feed data support Features private and public API sync and async support websocket-feed sup

126 Dec 30, 2022

Custom Ethereum vanity address generator made in Rust

ethaddrgen Custom Ethereum address generator Get a shiny ethereum address and stand out from the crowd! Disclaimer: Do not use the private key shown i

153 Dec 27, 2022

The new, performant, and simplified version of Holochain on Rust (sometimes called Holochain RSM for Refactored State Model)

Holochain License: This repository contains the core Holochain libraries and binaries. This is the most recent and well maintained version of Holochai

741 Jan 5, 2023

Left Recursive PEG for rust

Related tags

Overview

Left Recursive Parsing Expression Grammar (PEG)

How to use lrpeg

How to write grammar

How lrpeg is bootstrapped

TODO

Comments

Input text

Bad PEG

Expected behavior

HTML/XML

TeX

Markdown

Simplified Example

Bad Result

Excepted Result

Why do I need two ranges (span, range)?

Owner

Sean Young

STARK - SNARK recursive zero knowledge proofs, combinaison of the Winterfell library and the Circom language

Uses Plonky2 proof system to build recursive circuits for Merkle Trees.

Package used by the cosmos-rust-interface. Makes direct use of cosmos-rust.

Rust project for working with ETH - Ethereum transactions with Rust on Ganache and also deploy smart contracts :)

An open source Rust high performance cryptocurrency trading API with support for multiple exchanges and language wrappers. written in rust(🦀) with ❤️

Simple node and rust script to achieve an easy to use bridge between rust and node.js

Marvin-Blockchain-Rust: A Rust-based blockchain implementation, part of the Marvin blockchain project.

A Rust library for working with Bitcoin SV

Coinbase pro client for Rust

Custom Ethereum vanity address generator made in Rust

The new, performant, and simplified version of Holochain on Rust (sometimes called Holochain RSM for Refactored State Model)

IBC modules and relayer - Formal specifications and Rust implementation

A Rust implementation of BIP-0039

Rust Ethereum 2.0 Client

Official Rust implementation of the Nimiq protocol

Rust implementation of Zcash protocol

rust client libraries to deal with the current cardano mainnet (byron / cardano-sl)

Tendermint in Rust!

A Rust library for generating cryptocurrency wallets