The Elegant Parser

Related tags

rust parsing peg
Overview

pest. The Elegant Parser

Join the chat at https://gitter.im/dragostis/pest Book Docs

Build Status codecov Fuzzit Status Crates.io Crates.io

pest is a general purpose parser written in Rust with a focus on accessibility, correctness, and performance. It uses parsing expression grammars (or PEG) as input, which are similar in spirit to regular expressions, but which offer the enhanced expressivity needed to parse complex languages.

Getting started

The recommended way to start parsing with pest is to read the official book.

Other helpful resources:

  • API reference on docs.rs
  • play with grammars and share them on our fiddle
  • leave feedback, ask questions, or greet us on Gitter

Example

The following is an example of a grammar for a list of alpha-numeric identifiers where the first identifier does not start with a digit:

alpha = { 'a'..'z' | 'A'..'Z' }
digit = { '0'..'9' }

ident = { (alpha | digit)+ }

ident_list = _{ !digit ~ ident ~ (" " ~ ident)+ }
          // ^
          // ident_list rule is silent which means it produces no tokens

Grammars are saved in separate .pest files which are never mixed with procedural code. This results in an always up-to-date formalization of a language that is easy to read and maintain.

Meaningful error reporting

Based on the grammar definition, the parser also includes automatic error reporting. For the example above, the input "123" will result in:

thread 'main' panicked at ' --> 1:1
  |
1 | 123
  | ^---
  |
  = unexpected digit', src/main.rs:12

while "ab *" will result in:

thread 'main' panicked at ' --> 1:1
  |
1 | ab *
  |    ^---
  |
  = expected ident', src/main.rs:12

Pairs API

The grammar can be used to derive a Parser implementation automatically. Parsing returns an iterator of nested token pairs:

extern crate pest;
#[macro_use]
extern crate pest_derive;

use pest::Parser;

#[derive(Parser)]
#[grammar = "ident.pest"]
struct IdentParser;

fn main() {
    let pairs = IdentParser::parse(Rule::ident_list, "a1 b2").unwrap_or_else(|e| panic!("{}", e));

    // Because ident_list is silent, the iterator will contain idents
    for pair in pairs {
        // A pair is a combination of the rule which matched and a span of input
        println!("Rule:    {:?}", pair.as_rule());
        println!("Span:    {:?}", pair.as_span());
        println!("Text:    {}", pair.as_str());

        // A pair can be converted to an iterator of the tokens which make it up:
        for inner_pair in pair.into_inner() {
            match inner_pair.as_rule() {
                Rule::alpha => println!("Letter:  {}", inner_pair.as_str()),
                Rule::digit => println!("Digit:   {}", inner_pair.as_str()),
                _ => unreachable!()
            };
        }
    }
}

This produces the following output:

Rule:    ident
Span:    Span { start: 0, end: 2 }
Text:    a1
Letter:  a
Digit:   1
Rule:    ident
Span:    Span { start: 3, end: 5 }
Text:    b2
Letter:  b
Digit:   2

Other features

  • Precedence climbing
  • Input handling
  • Custom errors
  • Runs on stable Rust

Projects using pest

Special thanks

A special round of applause goes to prof. Marius Minea for his guidance and all pest contributors, some of which being none other than my friends.

Issues
  • Extract the `parser` and `validator` from the `pest_derive` crate into `pest_meta`

    Extract the `parser` and `validator` from the `pest_derive` crate into `pest_meta`

    This is the first step towards #158 . It provides the new crate pest_meta, which grants access to the implementations of the parser and validator used by pest_derive. However, in order to declare them usable, there needs to be some experimentation outside their original intended use.

    This PR is a WIP mainly to get early feedback and to prevent working inside the proverbial cave for too long.

    Open questions:

    • [x] I am not sure how versioning should work. I just put the same version for pest_meta as for all the other crates in the pest workspace but I am not sure if that's what I was supposed to do :smiley:

    Once the design & implementation kinks are ironed out and any new ideas are implemented (or postponed, dropped), this PR still needs:

    • [ ] documentation for pest_meta targeted at people who want to use it for building tools for pest (syntax highlighting, linting, completion, etc.)
    opened by Victor-Savu 42
  • Add lifetimes — refs into StrInput<'i> are bound by &'i

    Add lifetimes — refs into StrInput<'i> are bound by &'i

    Here's a preliminary attempt at resolving #141. We bind the input Input by a lifetime; for StrInput, that's the lifetime of the string it references.

    It works, happily! If you have pair: Pair<'i, Rule, StrInput<'i>> , then pair.as_str() correctly returns a &'i str (rather than &str bound by pair's lifetime). There isn't actually a test case added yet that demonstrates this, but I'd add one if we were merging. (Right now testing against a local library that I'm using pest with.)

    There's one problem I haven't resolved and without which this cannot be merged: what to do about StringInput? Right now I've added two hacky transmute calls just so it'd compile and I could get on with the work ([1], [2]), but this needs to be resolved as it's currently super-unsound.

    Thoughts welcome! If this isn't the direction you'd like to go (SO MANY LIFETIME REFERENCES), that's of course understandable; I just wanted to give this a hack, and no hard feelings if you don't merge!

    (Fixes #141. Closes #6.)

    /cc @sunng87 @dragostis

    opened by kivikakk 36
  • add some basic types to pest so it improves grammar readability

    add some basic types to pest so it improves grammar readability

    I think having some already defined types would help to enhance readability. alpha, decdigit, hexdigit, octdigit, alphanumeric, space, signedint, unsignedint, signedfloat, unsignedfloat, singlequoted, doublequoted

    and maybe a couple others, but these should cover most common needs.

    in-progress 
    opened by lwandrebeck 29
  • Make pest no_std compatible.

    Make pest no_std compatible.

    This is another attempt to fix #240 and make pest no_std compatible, allowing it to be used for example in web applications via WASM. I started slowly by converting the core pest crate only. All tests pass.

    There is however one breaking change: In my new version, pest's Error type does not implement std::error::Error, because this trait is part of std. If implementing this trait is important, it could be implemented behind a feature flag.

    opened by 01mf02 27
  • Removed position::new and replaced it with position::Position::new,

    Removed position::new and replaced it with position::Position::new,

    Addresses issue #301 . Replaced unsafe code with Options and unwrapping where appropriate (which is everywhere unsafe code was being used).

    I've also changed the position::new function to position::Position::new, since it is now safe. Moreover, I've encountered an issue in my own parser project - I wanted to use the position API to find the line + column of an error but couldn't because I can't create my own position object. This pull request should fix that.

    opened by jkarns275 25
  • Handling indentation

    Handling indentation

    Hi! I don’t know if this is a question or a feature request, but I want to parse reStructuredText. It allows nesting blocks, and indentation works in a special way:

    Several constructs begin with a marker, and the body of the construct must be indented relative to the marker. For constructs using simple markers (bullet lists, enumerated lists, footnotes, citations, hyperlink targets, directives, and comments), the level of indentation of the body is determined by the position of the first line of text, which begins on the same line as the marker. For example, bullet list bodies must be indented by at least two columns relative to the left edge of the bullet:

    - This is the first line of a bullet list
      item's paragraph.  All lines must align
      relative to the first line.  [1]_
    
          This indented paragraph is interpreted
          as a block quote.
    
    Because it is not sufficiently indented,
    this paragraph does not belong to the list
    item.
    
    .. [1] Here's a footnote.  The second line is aligned
       with the beginning of the footnote label.  The ".."
       marker is what determines the indentation.
    

    For constructs using complex markers (field lists and option lists), where the marker may contain arbitrary text, the indentation of the first line after the marker determines the left edge of the body. For example, field lists may have very long markers (containing the field names):

    :Hello: This field has a short field name, so aligning the field
            body with the first line is feasible.
    
    :Number-of-African-swallows-required-to-carry-a-coconut: It would
        be very difficult to align the field body with the left edge
        of the first line.  It may even be preferable not to begin the
        body on the same line as the marker.
    

    How can I parse this? Would I parse it manually, I’d define an “indent stack” (like [2, 4] for the first example). Is the best idea to create an iterator that handles this indent stack and only feeds the blocks it found into pest? Or is there another way?

    enhancement grammar 
    opened by flying-sheep 25
  • Factor out `.pest` parser for use by other tools

    Factor out `.pest` parser for use by other tools

    I was wondering if it would make sense to factor out the part of the code which (bootstrappingly?) parses the .pest file so that other tools such as syntax highlighters and completion engines could use it.

    maintenance 
    opened by Victor-Savu 21
  • New Travis matrix

    New Travis matrix

    This PR has been limited to the Travis matrix update, with most jobs allowed to fail temporarily. I will resubmit the fmt, clippy, and misc. test improvements changes removed from this PR momentarily as a new PR.


    Original text:

    Closes #294, closes #295, supersedes #310

    This has a rather large diff since I did a cargo fmt across the whole workspace, but it should enable mindless rustfmt from here on. I can tweak the rustfmt.toml as well if there are other configurations we want to change; I've deliberately pinned rustfmt to a nightly version.

    Clippy is pinned to 1.29.0 as that's the first rust toolchain to contain clippy-preview. As we'd like to take advantage of the new edition eventually, #300, I'm thinking that we won't officially commit to old compiler versions until 1.31 and edition 2018. At that point we can probably pin both rustfmt and clippy to the final 1.31 nightly to reduce the number of build jobs.

    opened by CAD97 20
  • Case insensitive keywords

    Case insensitive keywords

    Is it possible to handle case insensitive keywords (like in SQL) ? Thanks.

    enhancement 
    opened by gwenn 19
  • Unmatched expressions modify the stack

    Unmatched expressions modify the stack

    While working with pest, I discovered that expressions that aren't matched still can modify the stack with push or pop, which seems like a bug. Consider:

    grammar.pest

    foo = ${ soi ~ push("") ~ (aba | a) ~ eoi }
    a = @{ peek ~ "a" }
    aba = @{ push("a") ~ "b" ~ pop }
    

    main.rs

    #[macro_use]
    extern crate pest_derive;
    extern crate pest;
    
    use pest::Parser;
    
    #[cfg(debug_assertions)]
    const _GRAMMAR: &'static str = include_str!("grammar.pest");
    
    #[derive(Parser)]
    #[grammar = "grammar.pest"]
    pub struct MyParser;
    
    fn parse_string(s: &str) {
        match MyParser::parse_str(Rule::foo, s) {
            Ok(o) => println!("ok: {:?}", o),
            Err(e) => println!("err: {}", e),
        }
    }
    
    fn main() {
        parse_string("a");
        parse_string("aa");
    }
    

    The output when running this is

    err:  --> 1:1
      |
    1 | a
      | ^---
      |
      = expected foo
    ok: Pairs { pairs: [Pair { rule: foo, span: Span { start: 0, end: 2 }, inner: Pairs { pairs: [Pair { rule: a, span: Span { start: 0, end: 2 }, inner: Pairs { pairs: [] } }] } }] }
    

    I would expect The string "a" to parse correctly, and the string "aa" to fail parsing, but the parser does the opposite. What I think happens is this:

    1. Starting with the foo rule, soi ~ push("") matches, so the stack is now [""].
    2. The next step is ~ (aba | a), so the parser tries rule aba first.
    3. The parser sees that push("a") matches, so the stack is now ["", "a"].
    4. The parser sees that ~ "b" doesn't match, so aba doesn't match, so it goes back up to ~ (aba | a) and tries a next. Note that even though aba didn't match, it modified the stack.
    5. We're in rule a now. Since the top of the stack is "a", peek ~ "a" can only match the string "aa", not "a".
    6. In the "aa" example, the final part is ~ eoi, which matches.

    The surprising part is that even though aba didn't match, it still modified the stack. This seems like a bug. I would expect the stack to be left unmodified by expressions that don't match (i.e. restore the stack after the expression fails to match).

    In addition to (aba | a), aba in other expressions exhibits the same behavior, such as (aba? ~ a) and (!aba ~ a).

    Assuming this is a bug, here are a couple of strategies to fix this:

    1. One solution is to clone the stack at the beginning of each expression in a branch (expr?, expr1 | expr2, etc.) or !expr in the parser, and then if the expression fails to match, restore the stack from the clone. I wouldn't expect the clone to be very expensive because the stack only contains Span items (which are small), unless you're parsing a very deeply nested document which builds a large stack.

    2. Instead of cloning, another solution is to use a persistent stack that makes it possible to store past versions of the stack without cloning the whole thing. This is often implemented as a singly linked list, which comes with its own issues but would work.

    If the existing behavior is intentional, it should be better documented, and I would consider adding push_maybe and pop_maybe commands that only modify the stack if the expressions containing them match.

    I can spend some time on a fix or better documentation.

    opened by jturner314 17
  • Inconsistent grabbing of whitespace

    Inconsistent grabbing of whitespace

    Consider the following grammar:

    integer = { ASCII_DIGIT+ }
    d = { "d" }
    WHITESPACE = _{ " " }
    diceroll = { integer ~ d ~ integer }
    

    When parsing the input 2 d 3 as a diceroll, its bottom-level tokens are integer 2, the d d, and the integer 3. Note how spaces are captured in the integers but not in the d.

    If the d rule is changed to d = { "d"+ }, suddenly it starts capturing the space after the d too. The same doesn't happen for d = { "d"* }, which continues to capture only d.

    If you increase the number of spaces in a given position, they'll all be captured; so, given the input:

    2     d 3
    

    The first integer will be captured as:

    2     
    

    (Apologies for the full-line code blocks there; inline ones collapse multiple spaces down to just one in traditional HTML fashion, whereas full-line ones don't.)

    If the integers are more than one digit long, spaces are no longer captured after them. 22 d 3's first integer is 22, no space included. This holds irrespective of how many spaces are present in the source text; the number always turns to zero, it doesn't just decrement by one per digit or anything along those lines.

    This behavior all seems to ultimately flow down from the WHITESPACE rule; with the rules as defined here, parsing 2 as an integer yields 2, but with the WHITESPACE rule removed, it instead yields just 2.

    I'm not at all sure what the intended behavior here is—whether the intention is that the whitespace be consistently captured in the token to its left, or that it be consistently not-captured—but I'm almost certain this isn't the intended behavior. There are too many weird inconsistencies, with the spaces only being captured after one-character inputs whose rules use plus signs. But I figure it's worth raising this issue as an alert of "things are probably not working the way they're supposed to", even in the absence of knowledge of exactly how they are supposed to work.

    opened by LunarTulip 0
  • Parse inner text with compulsory surrounding brackets

    Parse inner text with compulsory surrounding brackets

    Is it possible to make closing bracket required if we have starting bracket before new line?

    Grammar:

    ident = { (!("[" | "]") ~ ANY)+ }
    entity = { ("[" ~ ident ~ "]") }
    

    Input:

    [Person]
    skiptext
    

    Output:

    - entity > ident: "Person"
    

    If I remove closing bracket:

    [Person
    skiptext
    

    the output error shows starting bracket:

     --> 1:1
      |
    1 | [Person␊
      | ^---
      |
      = expected entity
    

    I'd like to show error on the end of line, is it possible with pest? Can I get directly entity > "Person" in output when syntax is correct?

    opened by bartekupartek 1
  • Add input() method to get the input field of a Span

    Add input() method to get the input field of a Span

    Hi! Thanks for making this great tool.

    I'm writing a compiler, and I use Spans to keep track of error messages, warnings, source trees, etc. When it comes to rendering those messages, being able to get the reference to the original source directly from the Span is a really great feature -- otherwise I have to keep a mapping from Span to the original text and manage the lifetime differently.

    If there is not a reason to exclude this method, could it be included in Span's API?

    Thanks, Alex

    opened by sezna 1
  • Capture groups?

    Capture groups?

    Is there an elegant way to capture a small snippet in a rule without having to create an entire new rule? Example:

    foo_bar = { "\\baz{" ~ ASCII_ALPHANUMERIC+ ~ "}" ~ ("\r" | "\n") }
    

    I want to grab whatever is represented by ASCII_ALPHANUMERIC+ without creating a whole new rule.

    question 
    opened by vedantroy 1
  • Make pest_generator / pest_derive no_std compatible.

    Make pest_generator / pest_derive no_std compatible.

    This is the follow-up pull request to #498. It adds "std" feature flags for both pest_generator and pest_derive. Currently, the correct function of this can only be tested by running cargo test --no-default-features in the derive directory. Running cargo test --no-default-features in the root directory unfortunately does not run the tests in no_std mode.

    opened by 01mf02 4
  • Using PrecClimber requires grammar modifications

    Using PrecClimber requires grammar modifications

    Right now, PrecClimber can only determine operator precedence based on Rule enums. This partly defeats the point of PrecClimber, because ideally adding an operator shouldn't require changing the grammar, but as it stands at least one new rule needs to be added to the grammar for each operator.

    Instead of PrecClimber taking a Vec<Operator<Rule>>, it could take a function from operator pair to precedence, FnMut(Pair<Rule>) -> Option<Precedence>, where struct Precedence(i32, Assoc).

    This would allow having a rule in the grammar like operator = {"+" | "-" | "*" | "/" }, and a corresponding function:

    fn prec(op: Pair<Rule>) -> Precedence {
        match op.as_str() {
            "+" | "-" => Some(Precedence(1, Assoc::Left)),
            "*" | "/" => Some(Precedence(2, Assoc::Left)),
            _ => None,
        }
    }
    
    opened by GrantMoyer 0
  • dose pest support multi file ?

    dose pest support multi file ?

    dose pest support multi file?

    question 
    opened by zsytssk 7
  • unable to bypass negatives

    unable to bypass negatives

    I have the following grammar

    sequence_body = {sequence_begin ~ NEWLINE+ ~ (!sequence_end ~ sequence_line ~ NEWLINE+)* ~ sequence_end}
    

    I expect it to match the following:

    sequence_begin
    sequence_line
    sequence_line
    sequence_line
    sequence_line
    ...
    sequence_end
    

    because sequence_line starts with ASCII_ALPHA+, and sequence_end starts with "end", which is also ASCII_ALPHA+, I added !sequence_end

    And I got the error:

    unexpected sequence_end

    it refers to the actual sequence_end

    Error { variant: ParsingError { positives: [], negatives: [sequence_group_end] }, location: Pos(317), line_col: Pos((15, 9)), path: None, line: "        end␊", continued_line: None }'
    

    (!sequence_end ~ line) * ~ sequence_end

    if the parser hits the first !sequence_end, I would expect it to finish the ()* part and then match the second positive sequence_end.

    but instead, it just generates an error.

    plantumlparser.zip

    the reproducible project is attached too.

    opened by shi-yan 1
  • RFC: Random input generator valid for a grammar defined in .pest file

    RFC: Random input generator valid for a grammar defined in .pest file

    TL;DR

    Domain-aware input generator helps fuzzing tests for programs using uses .pest files. Same idea with Testing Random, Valid SQL in CockroachDB

    Motivation

    I'm developing a RDBMS whose query language is similar to a subset of SQL. The grammar of the query language is defined in .pest file.

    My RDBMS has fuzzing tests to check crash-safety with random text inputs, although most inputs are syntactically invalid. In other words, the test cases can rarely passes lexer, so parsers and backend components are not fully tested.

    A fuzzer generating syntactically (and hopefully semantically) valid SQLs would be highly useful to find edge cases human being may not come up with.

    Theoretically, such fuzzer (random grammatically-valid text generator) could be implemented when .pest file is given. My implementation idea is almost the same as this article: Testing Random, Valid SQL in CockroachDB.

    RFC (Request For Comments)

    • Agree with my motivation?
    • If such random input generator would be useful, what crate should have the feature?
      • New crate inside this repository?
      • New crate outside of this repository (depending on pest crate)?

    Thanks,

    opened by laysakura 1
  • how to require at least one space between 2 tokens?

    how to require at least one space between 2 tokens?

    I want to match

    "aaa bbb"

    there must be at least one space between the two tokens "aaa" and "bbb"

    the following works,

    WHITESPACE = _{ " " | "\t" }
    accept = ${"aaa" ~ WHITESPACE+  ~ "bbb" ~ NEWLINE}
    

    but is very inconvenient, because $ disables implicit whitespaces. So I have to manually add whitespace+ between all tokens. (this is fine, but I also need to add comments /**/ , as according to the doc, $ disables both )

    Then I tried this,

    WHITESPACE = _{ " " | "\t" }
    accept = {"aaa"  ~ "bbb" ~ NEWLINE}
    

    this works too, however it also matches "aaabbb". I want to make sure there is at least one space.

    In my opinion, the correct way should be

    WHITESPACE = _{ " " | "\t" }
    accept = {"aaa" ~ WHITESPACE  ~ "bbb" ~ NEWLINE}
    

    I thought this would enforce at least one space between "aaa" and "bbb".

    however I got parsing error:

    let pairs = IdentParser::parse(Rule::accept, "aaa bbb
    ").unwrap_or_else(|e| panic!("{}", e));
    
    thread 'main' panicked at ' --> 1:1
      |
    1 | aaa bbb␊
      | ^---
      |
      = expected accept', src/main.rs:29:23
    

    This I don't understand?

    opened by shi-yan 4
Releases(v2.1.3-generator)
Owner
pest
The Elegant Parser
pest
The Elegant Parser

pest. The Elegant Parser pest is a general purpose parser written in Rust with a focus on accessibility, correctness, and performance. It uses parsing

pest 2.4k Jun 13, 2021
A parser combinator library for Rust

combine An implementation of parser combinators for Rust, inspired by the Haskell library Parsec. As in Parsec the parsers are LL(1) by default but th

Markus Westerlind 927 Jun 16, 2021
A rusty, dual-wielding Quake and Half-Life texture WAD parser.

Ogre   A rusty, dual-wielding Quake and Half-Life texture WAD parser ogre is a rust representation and nom parser for Quake and Half-Life WAD files. I

Josh Palmer 8 May 7, 2021
Parsing Expression Grammar (PEG) parser generator for Rust

Parsing Expression Grammars in Rust Documentation | Release Notes rust-peg is a simple yet flexible parser generator that makes it easy to write robus

Kevin Mehall 894 Jun 15, 2021
A fast monadic-style parser combinator designed to work on stable Rust.

Chomp Chomp is a fast monadic-style parser combinator library designed to work on stable Rust. It was written as the culmination of the experiments de

Martin Wernstål 205 Jun 12, 2021
LR(1) parser generator for Rust

LALRPOP LALRPOP is a Rust parser generator framework with usability as its primary goal. You should be able to write compact, DRY, readable grammars.

null 1.7k Jun 12, 2021
A native Rust port of Google's robots.txt parser and matcher C++ library.

robotstxt A native Rust port of Google's robots.txt parser and matcher C++ library. Native Rust port, no third-part crate dependency Zero unsafe code

Folyd 61 Jun 3, 2021
An LR parser generator, implemented as a proc macro

parsegen parsegen is an LR parser generator, similar to happy, ocamlyacc, and lalrpop. It currently generates canonical LR(1) parsers, but LALR(1) and

Ömer Sinan Ağacan 3 May 21, 2021
A typed parser generator embedded in Rust code for Parsing Expression Grammars

Oak Compiled on the nightly channel of Rust. Use rustup for managing compiler channels. You can download and set up the exact same version of the comp

Pierre Talbot 130 Feb 14, 2021
A fast, extensible, command-line arguments parser

parkour A fast, extensible, command-line arguments parser. Introduction ?? The most popular argument parser, clap, allows you list all the possible ar

Ludwig Stecher 18 Apr 19, 2021
Soon to be AsciiDoc parser implemented in rust!

pagliascii "But ASCII Doc, I am Pagliascii" Soon to be AsciiDoc parser implemented in rust! This project is the current implementation of the requeste

Lukas Wirth 34 Jun 2, 2021
Parse BNF grammar definitions

bnf A library for parsing Backus–Naur form context-free grammars. What does a parsable BNF grammar look like? The following grammar from the Wikipedia

Shea Newton 129 Jun 9, 2021
url parameter parser for rest filter inquiry

inquerest Inquerest can parse complex url query into a SQL abstract syntax tree. Example this url: /person?age=lt.42&(student=eq.true|gender=eq.'M')&

Jovansonlee Cesar 25 Nov 2, 2020
Rust query string parser with nesting support

What is Queryst? This is a fork of the original, with serde and serde_json updated to 0.9 A query string parsing library for Rust inspired by https://

Stanislav Panferov 54 Mar 29, 2021