Left Recursive PEG for rust

Related tags

Cryptography lrpeg
Overview

crates.io CI license

Left Recursive Parsing Expression Grammar (PEG)

lrpeg allows left recursive rules, and uses ratpack parsing for speed. I wrote a blog post to introduce the ideas of lrpeg.

The existing PEG parser generators for rust do not allow left recursion, which makes it very awkward to write grammars. It is possible to write a PEG parser generator which allows for left recursion, just as python now uses.

See IRP Grammar for a complete lrpeg grammar and handling for IRP.

How to use lrpeg

Add lrpeg to your Cargo.toml in build-dependencies:

[build-dependencies]
lrpeg = "0"

[dependencies]
regex = "1"
unicode-xid = "0.2"

Now add a build.rs to the root of your project, containing:

use std::path::PathBuf;

fn main() {
    lrpeg::process_files(&PathBuf::from("src"));
}

Write your peg grammar, and put it in a file which ends with .peg, for example src/calculator.peg:

calculator <- expr EOI;

expr <- expr ("+" / "-") WHITESPACE term
    / term;

term <- term ("*" / "/" / "%") WHITESPACE  factor
    / factor;

factor <- "(" WHITESPACE  expr ")" WHITESPACE
    / num;

num <- r"[0-9]+" WHITESPACE;

When your run cargo build, src/calculator.rs will be generated. You need to include this module in your project, and then you can instantiate the PEG like so:

mod calculator;

use calculator::{Node, Rule};

fn main() {
    let mut parser = calculator::PEG::new();

    let args: Vec<String> = std::env::args().collect();

    if args.len() != 2 {
        eprintln!("Usage {} EXPRESSION", &args[0]);
        std::process::exit(2);
    }

    let input = &args[1];

    println!("parsing: {}", input);

    match parser.parse(input) {
        Ok(node) => {
            fn walk(node: &Node, input: &str) -> u64 {
                match node.rule {
                    Rule::num => u64::from_str_radix(node.children[0].as_str(input), 10).unwrap(),
                    Rule::expr => {
                        if node.alternative == Some(0) {
                            let left = walk(&node.children[0], input);
                            let right = walk(&node.children[3], input);

                            match node.children[1].as_str(input) {
                                "+" => left + right,
                                "-" => left - right,
                                _ => unreachable!(),
                            }
                        } else {
                            walk(&node.children[0], input)
                        }
                    }
                    Rule::term => {
                        if node.alternative == Some(0) {
                            let left = walk(&node.children[0], input);
                            let right = walk(&node.children[3], input);

                            match node.children[1].as_str(input) {
                                "*" => left * right,
                                "/" => left / right,
                                "%" => left % right,
                                _ => unreachable!(),
                            }
                        } else {
                            walk(&node.children[0], input)
                        }
                    }
                    Rule::factor => {
                        if node.alternative == Some(0) {
                            walk(&node.children[2], input)
                        } else {
                            walk(&node.children[0], input)
                        }
                    }
                    Rule::calculator => walk(&node.children[0], input),
                    _ => {
                        unreachable!()
                    }
                }
            }

            println!("result: {}", walk(&node, input));
        }
        Err(pos) => {
            eprintln!("parser error at offset {}", pos + 1);
        }
    }
}

This example is available in lrpeg-example.

How to write grammar

PEG grammars are a set of rules. In lrpeg, each rule must end with a ";". Parsing starts at the top rule.

  • Each rule must start with an identifier, followed by <- and then the terms and terminated by ;
  • A term can be repeated with *, + or optional ?.
  • The list of terms cannot be empty (but a term can be optional)
  • A text literal can be encoded in single or double quotes ("foo" or 'bar')
  • A regex must be written as r"[0-9]+". Backslashed must be escaped r"\\d+" (this really needs fixing)
  • Alternatives are denoted with /. For example foo <- "a" / "b";
  • There are special terms WHITESPACE, EOI, and XID_IDENTIFIER (for unicode identifiers)

How lrpeg is bootstrapped

First of all, we need to bootstrap the PEG. We do this with a very simple lalrpop LR(1) PEG grammar. This grammar has some limitations, but it is good enough.

TODO

  • More tests
  • Better parse error information (now only the error offset is returned)
  • Better documentation
  • Make generator into rust macro
  • Detect unused rules
  • Detect unreachable alternatives
Comments
  • ambiguous tokens

    ambiguous tokens

    since the operator "~" doesn't seem to be supported and the regex crate can't do look-ahead operations what is the best way to disambiguate between identifiers and keywordsI have tried to reorder them but it's not ideal because it will either fail to parse or end up in the wrong branch.

    also thank you for making lrpeg available!

    opened by advancedtw 7
  • Idea: Support `node_tag` and `branch_tag`

    Idea: Support `node_tag` and `branch_tag`

    Add two fields to identify nodes

    #[derive(Clone, Debug)]
    pub struct Node<'i> {
        pub rule: Rule,
        pub start: usize,
        pub end: usize,
        pub node_tag: Option<&'i str>,   // The lifetime is the same as the input text
        pub branch_tag: Option<&'i str>, // The lifetime is the same as the input text
        pub children: Vec<Node<'i>>,
        pub alternative: Option<u16>,
    }
    

    Basically like this:

    expr <-
        "(" expr ")"            #Priority
      / lhs=expr "*" rhs=expr   #Mul
      / lhs=expr "/" rhs=expr   #Div
      / lhs=expr "+" rhs=expr   #Add
      / lhs=expr "-" rhs=expr   #Sub
      / num                     #Atom
      ;
    num <- re#[0-9]+#;
    

    PEG.js has this feature

    enhancement 
    opened by oovm 6
  • Make 'ast' and 'parser' modules public

    Make 'ast' and 'parser' modules public

    Hi Sean,

    I need to access the internal modules 'ast' and 'parser' so I've added optional features to expose them.

    Please let me know if you prefer to handle this in a different way.

    Thanks, Tal

    opened by talliberman 5
  • Are rules related to order?

    Are rules related to order?

    // everything ok
    IGNORE <- WHITESPACE / COMMENT;
    
    COMMENT <- "//" re#[^\r\n]# .;
    

    If you change the order, it will go wrong:

    // failed to run custom build command
    COMMENT <- "//" re#[^\r\n]# .;
    
    IGNORE <- WHITESPACE / COMMENT;
    
    error: failed to run custom build command for `lrpeg-test v0.1.0 (D:\Rust\lrpeg\lrpeg-test)`
    
    opened by oovm 4
  • Optional `;`

    Optional `;`

    https://github.com/seanyoung/lrpeg/blob/01baa55940033bafb2e68d40b76ab11d81a6ec11/lrpeg/src/peg.peg#L3

    Here I can see ; should be optional, but in fact it is not, ; is required.

    https://github.com/seanyoung/lrpeg/blob/01baa55940033bafb2e68d40b76ab11d81a6ec11/lrpeg/src/peg.peg#L7

    It seems that sequence will eat the beginning of next rule.

    x <- a b c 
    y <- d
    // equivalent to
    x <- a b c y <- d
    // should equivalent to
    x <- a b c; y <- d
    

    In principle, can this problem be solved?

    opened by oovm 1
  • Bad case report of indirect left recursion

    Bad case report of indirect left recursion

    Input text

    x = a | b | c
    
        Finished test [unoptimized + debuginfo] target(s) in 0.10s
         Running tests\main.rs (target\debug\deps\main-c0d26e7fa1d49033.exe)
    
    called `Result::unwrap()` on an `Err` value: (2, 6)
    thread 'peg::assign' panicked at 'called `Result::unwrap()` on an `Err` value: (2, 6)', projects\ygg-bootstrap\tests\peg\mod.rs:5:38
    stack backtrace:
       0: std::panicking::begin_panic_handler
                 at /rustc/d7c97a02d1215e4ef26c31cb72dbaf16fd548b2c\/library\std\src\panicking.rs:517
       1: core::panicking::panic_fmt
                 at /rustc/d7c97a02d1215e4ef26c31cb72dbaf16fd548b2c\/library\core\src\panicking.rs:100
       2: core::result::unwrap_failed
                 at /rustc/d7c97a02d1215e4ef26c31cb72dbaf16fd548b2c\/library\core\src\result.rs:1616
       3: enum$<core::result::Result<yggdrasil_bootstrap::ygg::ygg::Node,tuple$<usize,usize> > >::unwrap<yggdrasil_bootstrap::ygg::ygg::Node,tuple$<usize,usize> >
                 at /rustc/d7c97a02d1215e4ef26c31cb72dbaf16fd548b2c\library\core\src\result.rs:1298
       4: main::peg::peg_assert
                 at .\tests\peg\mod.rs:5
       5: main::peg::assign
                 at .\tests\peg\mod.rs:47
       6: main::peg::assign::closure$0
                 at .\tests\peg\mod.rs:42
       7: core::ops::function::FnOnce::call_once<main::peg::assign::closure$0,tuple$<> >
                 at /rustc/d7c97a02d1215e4ef26c31cb72dbaf16fd548b2c\library\core\src\ops\function.rs:227
       8: core::ops::function::FnOnce::call_once
                 at /rustc/d7c97a02d1215e4ef26c31cb72dbaf16fd548b2c\library\core\src\ops\function.rs:227
    note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
    
    error: test failed, to rerun pass '--test main'
    
    Process finished with exit code 101
    

    Bad PEG

    Forgive me for not minimizing the problem, there are some other problems I am figuring out.

    program  <- IGNORE (statement IGNORE)* EOI;
    
    statement  <- empty_statement
      / grammar_statement IGNORE eos?
      / fragment_statement IGNORE eos?
      / import_statement IGNORE eos?
      / ignore_statement IGNORE eos?
      / assign_statement IGNORE eos?
      / macro_define IGNORE eos?
    ;
    
    empty_statement <- eos;
    
    eos  <- ";";
    
    
    grammar_statement  <-
        grammar IGNORE symbol IGNORE "{" IGNORE (string IGNORE ("," IGNORE string)* IGNORE ","?)? IGNORE "}"
      / grammar IGNORE symbol IGNORE string?
    ;
    grammar  <- "grammar!";
    
    fragment_statement <- fragment IGNORE symbol;
    fragment <- "fragment!";
    
    
    import_statement  <-
        import IGNORE string IGNORE "{" IGNORE (symbol_alias IGNORE ("," IGNORE symbol_alias)* IGNORE ","?)? IGNORE "}"
      / import IGNORE string
    ;
    import  <- "import!";
    
    symbol_alias <- (symbol IGNORE "as" IGNORE)? symbol;
    
    ignore_statement  <-
        ignore IGNORE symbol
      / ignore IGNORE "{" IGNORE (symbol IGNORE ("," IGNORE symbol)* IGNORE ","?)? IGNORE "}"
    ;
    ignore  <- "ignore!";
    
    
    
    assign_statement  <- symbol IGNORE assign_kind IGNORE ("|"/"/")? IGNORE expr;
    assign_kind  <- ("^"/"_"/"@")*  "=";
    
    expr  <- expr0;
    expr0 <- symbol IGNORE mark_type? IGNORE "<-" IGNORE expr1
        / expr1;
    expr1 <- choice_expr IGNORE ("|"/"/") IGNORE choice_expr
        / expr2;
    expr2 <- expr2 IGNORE "~" IGNORE expr3
        / expr3;
    expr3 <- expr3 IGNORE slice
        / expr4;
    expr4 <- expr4 IGNORE suffix
        / expr5;
    expr5 <- prefix IGNORE expr5
        / expr6;
    expr6 <- "(" IGNORE "/"? IGNORE expr IGNORE ")"
        / data;
    
    choice_expr  <- expr1 IGNORE (mark_branch IGNORE mark_type?)?;
    mark_branch  <- ("^"/"!"/"<"/">")? "#" symbol;
    mark_type  <- ":" IGNORE symbol;
    prefix  <- "!" / "&" / "^" / "*" / "%";
    suffix  <- "?" / "+" / "-"/ "*";
    
    data   <- macro_call
      / regex_range
      / list
      / symbol_path
      / integer
      / special
      ;
    
    list   <- "{" IGNORE (data IGNORE ("," IGNORE data)* IGNORE ","?)? IGNORE "}";
    slice  <- "{" IGNORE integer IGNORE "," IGNORE integer IGNORE  "}";
    
    regex_range  <- "[" IGNORE (!"]" IGNORE . / "\\" IGNORE .)* IGNORE "]";
    
    
    macro_call  <-
        "@" symbol_path IGNORE "(" IGNORE (macro_kv IGNORE ("," IGNORE macro_kv)* IGNORE ","?)? IGNORE")"
    ;
    macro_kv  <-  symbol IGNORE " <-" IGNORE expr / expr;
    
    
    
    macro_define  <-
        "macro!" IGNORE symbol IGNORE  "(" IGNORE(macro_arg IGNORE ("," IGNORE macro_arg)* IGNORE ","?)? IGNORE ")" IGNORE block
    ;
    macro_arg  <- symbol IGNORE (":" IGNORE symbol)? IGNORE ("=" IGNORE expr)?;
    block  <- "{" IGNORE "return"? IGNORE expr IGNORE "}";
    
    
    string  <-
        re#'([^\\']|\\.)*'#
      / re#"([^\\"]|\\.)*"#
    ;
    
    integer  <- "0" / re#[1-9]# ("_"? re#[0-9]#)*;
    
    special  <- "true"/"false"/"null";
    
    symbol_path  <-
        symbol IGNORE (("::"/".") IGNORE symbol)*
    ;
    symbol  <- XID_IDENTIFIER;
    
    
    IGNORE <-  (SPACE / NEWLINE / COMMENT)?;
    
    SPACE <- re#[\s]+#;
    NEWLINE <- re#[\r\n]+#;
    
    COMMENT <- "//" re#[^\r\n]*#
    

    Expected behavior

    Inline choice_expr into expr1 then everything is ok

    expr1 <- expr1 IGNORE (mark_branch IGNORE mark_type?)? IGNORE ("|"/"/") IGNORE choice_expr
        / expr2;
    

    It should be handled correctly without inline here.

    opened by oovm 1
  • Syntax highlighting

    Syntax highlighting

    I wrote a very basic VSCode extension to enable syntax highlighting for the project:

    https://github.com/sheganinans/lrpeg-rust-vsc/

    https://marketplace.visualstudio.com/items?itemName=sheganinans.lrpeg-rust

    Would be good if it was mentioned in the docs, thanks!

    opened by sheganinans 0
  • Stack based matching

    Stack based matching

    I have encountered some difficulties, and it doesn’t seem to be easy when pair matching is required.

    Such as these situations:

    HTML/XML

    <a> push into the stack , wait until </a> pop out the stack

    TeX

    \begin{name} push into the stack , wait until \end{name} pop out the stack.

    Markdown

    `×4 push into the stack
    ````markdown
    do something
    ````
    wait until  `×4 pop out the stack
    
    opened by oovm 0
  • Bad case report of missing label

    Bad case report of missing label

    Simplified Example

    calculator <- <atom:atom> EOI;
    
    atom <-
        Num:/ <num:num>
        Str:/ <str:str>;
    
    num <- re#0|[1-9][0-9]*#;
    str <- re#"([^\\"]|\\.)*"#;
    

    Bad Result

    (omit EOI)

    Node {
        rule: calculator,
        start: 0,
        end: 1,
        children: [
            Node {
                rule: atom,
                start: 0,
                end: 1,
                children: [
                    Node {
                        rule: num,
                        start: 0,
                        end: 1,
                        children: [],
                        label: None,
                        alternative: None,
                    },
                ],
                label: Some("atom"),
                alternative: Some("Num"),
            },
        ],
        label: None,
        alternative: None,
    }
    

    Excepted Result

    (omit EOI)

    Node {
        rule: calculator,
        start: 0,
        end: 1,
        children: [
            Node {
                rule: atom,
                start: 0,
                end: 1,
                children: [
                    Node {
                        rule: num,
                        start: 0,
                        end: 1,
                        children: [],
                        label: Some("num"),
                        alternative: None,
                    },
                ],
                label: Some("atom"),
                alternative: Some("Num"),
            },
        ],
        label: None,
        alternative: None,
    }
    
    opened by oovm 0
  • NodeAPI suggestion

    NodeAPI suggestion

    This is a relatively private request, only for my project.

    All the interfaces I need are listed here

    /// It's a node contained in the Lossless Concrete Syntax Tree
    /// All subsequent required information will be retained
    /// Including spaces, line breaks, and comments or other semantically irrelevant content.
    /// Macros and formatting can start at this level
    pub trait CSTNode where Self: Sized {
        /// get str of the node
        fn get_str(&self) -> &str;
        /// Provide basic location information
        /// (start_offset, end_offset)
        fn get_span(&self) -> (usize, usize);
        /// Provide detailed row and column ranges
        /// (start_line, start_col, end_line, end_col)
        fn get_range(&self) -> (usize, usize, usize, usize);
        /// Get the tag of the current node
        fn get_node_tag(&self) -> Option<&'static str>;
        /// Get the tag of the current branch
        fn get_branch_tag(&self) -> Option<&'static str>;
        /// Find node tags in all of the children
        /// Then collect then into the vec, and store in hashmap with tag name
        fn get_tag_map(&self) -> HashMap<String, Vec<Self>>;
    }
    

    This version is used in pest, lrpeg itself does not store the input reference, so it may look like this

    pub trait CSTNode where Self: Sized {
        fn get_str<'i>(&self,input: &'i str) -> &'i str;
        fn get_span(&self, input:&str) -> (usize, usize);
        fn get_range(&self, input:&str) -> (usize, usize, usize, usize);
        fn get_node_tag(&self) -> Option<&'static str>;
        fn get_branch_tag(&self) -> Option<&'static str>;
        fn get_tag_map(&self) -> HashMap<String, Vec<Self>>;
    }
    

    Hope this helps refer to which field in Node should keep.


    Why do I need two ranges (span, range)?

    because the language server protocol is like this, most interfaces are rows and columns, and a few cases are offsets.

    enhancement 
    opened by oovm 1
  • Suppress output of unnecessary nodes

    Suppress output of unnecessary nodes

    This is my grammar file,

    https://github.com/oovm/lrpeg-test/blob/master/projects/lrpeg/src/ygg.peg

    This is my parsing output of x=0

    https://github.com/oovm/lrpeg-test/blob/95b032c9ca0fd46c39ea31edf480d41eaecdec1b/projects/lrpeg/tests/assign.yaml#L23-L34

    I don't understand why there are two Terminals here, and the statement nodes I need are wrapped in children.

    In my understanding, Terminal is ε or string or regex, it should have no children.

    enhancement 
    opened by oovm 10
Owner
Sean Young
Compiler engineer and linux kernel infrared maintainer. Working on the Solang Solidity Compiler.
Sean Young
STARK - SNARK recursive zero knowledge proofs, combinaison of the Winterfell library and the Circom language

STARK - SNARK recursive proofs The point of this library is to combine the SNARK and STARK computation arguments of knowledge, namely the Winterfell l

Victor Colomb 68 Dec 5, 2022
Uses Plonky2 proof system to build recursive circuits for Merkle Trees.

ProvableMerkleTrees Introduction This repo provides Rust code to build Merkle Trees, equipped with a Provable interface to generate Zero Knowledge pro

null 5 Aug 18, 2023
Package used by the cosmos-rust-interface. Makes direct use of cosmos-rust.

Package used by the cosmos-rust-interface. Makes direct use of cosmos-rust (cosmos‑sdk‑proto, osmosis-proto, cosmrs).

Philipp 4 Dec 26, 2022
Rust project for working with ETH - Ethereum transactions with Rust on Ganache and also deploy smart contracts :)

Just a test project to work with Ethereum but using Rust. I'm using plain Rust here, not Foundry. In future we will use Foundry. Hope you're already f

Akhil Sharma 2 Dec 20, 2022
An open source Rust high performance cryptocurrency trading API with support for multiple exchanges and language wrappers. written in rust(🦀) with ❤️

Les.rs - Rust Cryptocurrency Exchange Library An open source Rust high performance cryptocurrency trading API with support for multiple exchanges and

Crabby AI 4 Jan 9, 2023
Simple node and rust script to achieve an easy to use bridge between rust and node.js

Node-Rust Bridge Simple rust and node.js script to achieve a bridge between them. Only 1 bridge can be initialized per rust program. But node.js can h

Pure 5 Apr 30, 2023
A Rust library for working with Bitcoin SV

Rust-SV A library to build Bitcoin SV applications in Rust. Documentation Features P2P protocol messages (construction and serialization) Address enco

Brenton Gunning 51 Oct 13, 2022
Coinbase pro client for Rust

Coinbase pro client for Rust Supports SYNC/ASYNC/Websocket-feed data support Features private and public API sync and async support websocket-feed sup

null 126 Dec 30, 2022
Custom Ethereum vanity address generator made in Rust

ethaddrgen Custom Ethereum address generator Get a shiny ethereum address and stand out from the crowd! Disclaimer: Do not use the private key shown i

Jakub Hlusička 153 Dec 27, 2022
The new, performant, and simplified version of Holochain on Rust (sometimes called Holochain RSM for Refactored State Model)

Holochain License: This repository contains the core Holochain libraries and binaries. This is the most recent and well maintained version of Holochai

Holochain 741 Jan 5, 2023
IBC modules and relayer - Formal specifications and Rust implementation

ibc-rs Rust implementation of the Inter-Blockchain Communication (IBC) protocol. This project comprises primarily four crates: The ibc crate defines t

Informal Systems 296 Dec 31, 2022
A Rust implementation of BIP-0039

bip39-rs A Rust implementation of BIP0039 Changes See the changelog file, or the Github releases for specific tags. Documentation Add bip39 to your Ca

Infincia LLC 49 Dec 9, 2022
Rust Ethereum 2.0 Client

Lighthouse: Ethereum 2.0 An open-source Ethereum 2.0 client, written in Rust and maintained by Sigma Prime. Documentation Overview Lighthouse is: Read

Sigma Prime 2.1k Jan 6, 2023
Official Rust implementation of the Nimiq protocol

Nimiq Core implementation in Rust (core-rs) Rust implementation of the Nimiq Blockchain Core Nimiq is a frictionless payment protocol for the web. Thi

Nimiq 72 Sep 23, 2022
Rust implementation of Zcash protocol

The Parity Zcash client. Gitter Blog: Parity teams up with Zcash Foundation for Parity Zcash client Installing from source Installing the snap Running

Parity Technologies 183 Sep 8, 2022
rust client libraries to deal with the current cardano mainnet (byron / cardano-sl)

Rust implementation of Cardano primitives, helpers, and related applications Cardano Rust is a modular toolbox of Cardano’s cryptographic primitives,

Input Output 275 Oct 9, 2022
Tendermint in Rust!

tendermint.rs Tendermint in Rust with TLA+ specifications. Tendermint is a high-performance blockchain consensus engine for Byzantine fault tolerant a

Informal Systems 447 Jan 7, 2023
A Rust library for generating cryptocurrency wallets

Table of Contents 1. Overview 2. Build Guide 2.1 Install Rust 2.2a Build from Homebrew 2.2b Build from Crates.io 2.2c Build from Source Code 3. Usage

Aleo 552 Dec 29, 2022
Safe, fast, small crypto using Rust

THE SOFTWARE IS PROVIDED "AS IS" AND BRIAN SMITH AND THE AUTHORS DISCLAIM ALL WARRANTIES WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES

Brian Smith 3k Jan 2, 2023