Provably optimal zero-copy parsers using nondeterministic finite automata.

Will Sturgeon

Last update: Oct 4, 2023

Related tags

Miscellaneous inator

Overview

`inator`: an evil parsing library

You supply the evil plan; we supply the -inator!

or, Provably Optimal Zero-Copy Parsers with Nondeterministic Finite Automata

🚧 Development still ongoing! 🚧

Why?

Computability theory has known for ages that a certain class of languages have a unique and provably optimal representation as a graph. This is the insight behind regular-expression searchers like grep, but anything more than accepting or rejecting a string is difficult to reason about and impossible to optimize.

How?

This library uses the same high-level machinery as tools like grep but augmented with some extra goods:

Given a specification like "parse A after B" or "put this thing in parentheses" (and infinite combinations of similar things), we write a graph with a set of initial nodes, a set of accepting nodes, and some directed edges marked with input characters. From this, the computability theory stuff kicks in: we can minimize this to a graph with the provably minimal number of states that will tell us "is this valid syntax or not?"

Then, although we can't provably optimize your code, we let you write functions that effectively hitch a ride on the optimal decision graph we just made. Specifically, each edge—that is, each time you said "parse this token"—you can call a function on the data so far, using that token, that returns the next state.

Last of all, you can convert this optimal graph to a Rust source file: not only can we prove the implementation is optimal, we can leverage Rust's binary optimizer to make it run literally as fast as possible.

What does this whole process look like?

Surprisingly, it looks a lot like just writing down what you want. Here's the definition for "put this thing in parentheses":

pub fn parenthesized(p: Parser<char>) -> Parser<char> {
    ignore('(') + p + ignore(')')
}

So, if p is a parser that accepts ABC, then parenthesized(p) will accept (ABC). Simple as that.

The key idea here is that parsers are data, and you can pass them around, modify them, and combine them just like anything else.

See examples/tuple for a well-annotated crate that reads characters from a tuple representation ((), (A,), or (A, B, C, ...)), ignoring whitespace. In that example, the entire parser could be defined in a single line, but I split it up to illustrate, first, that you can do that—you don't have to have a once-and-for-all megaparser—and, two, to explain as much about using the library as I can in detail.

Other crates in examples extend the same technique to more complex parsers (e.g. phone numbers and email addresses).

Anything else cool you can do?

Yes! Since we're really just riding on top of a decision-problem automaton, you can take a specification and invert it to produce an infinite stream of random strings that are all guaranteed to be parsed correctly. If you're writing a language, this means automatically generating all possible valid source files.

And, you guessed it, this gets compiled down to Rust source as well, so your property-tests can be ridiculously effective.

Why not other parsing libraries?

Please try other parsing libraries! This is my favorite, mostly becuase it's pure Rust with zero macros, no new syntax, zero input copying, parsers as data, automatic input generation, and—well–I wrote it, but I'm not too familiar with other libraries, so I can't in good faith recommend this one too highly.

My primary goal was a personal tool, but it turned out much better than I expected, so I'd love to see you use it and get your feedback!

Acknowledgments

First off, to Haskell's parsing libraries (and UPenn's Haskell course) for showing me that parsers can even work this way. Most of the syntax is inspired by Haskell.

Second, to Rajeev Alur and Penn's CIS 262 for formally introducing me to nondeterministic finite automata.

Third, to Rust.

Private swaps for Secret Network using a private entropy pool & differential privacy.

WIP SecretSwap: Anon Edition Private swaps for Secret Network! Uses private entropy pool for differential privacy when reporting pools sizes. Swap amo

5 Apr 5, 2022

A template for kick starting a Cloudflare worker project using workers-rs.

Getting Started A template for kick starting a Cloudflare worker project using workers-rs. This template is designed for compiling Rust to WebAssembly

1 Oct 13, 2021

twilight-interactions is a set of macros and utilities to work with Discord Interactions using twilight.

Twilight interactions twilight-interactions is a set of macros and utilities to work with Discord Interactions using twilight. Note: This crate is not

24 Dec 26, 2022

Modrinth API is a simple library for using, you guessed it, the Modrinth API in Rust projects

Modrinth API is a simple library for using, you guessed it, the Modrinth API in Rust projects. It uses reqwest as its HTTP(S) client and deserialises responses to typed structs using serde.

21 Jan 1, 2023

A translation of Brendan Galea's Vulkan tutorial into Rust using the ash crate

Rust Light Vulkan Engine This is a translation of Brendan Galea's Vulkan tutorial into rust using the Ash crate. Original tutorial: Brendan Galea's Yo

6 Dec 25, 2022

Some tools for streaming frames to rpi-rgb-led-matrix using ZeroMQ, written in Rust.

led_matrix_zmq Some tools for streaming frames to rpi-rgb-led-matrix using ZeroMQ, written in Rust. This repository includes: Rust client and server l

2 Sep 6, 2022

Cross-platform GUI written in Rust using ADB to debloat non-rooted android devices

Cross-platform GUI written in Rust using ADB to debloat non-rooted android devices. Improve your privacy, the security and battery life of your device.

6.8k Jan 2, 2023

This contract implements simple vote for the best coffe in indonesia using near protocol.

vote-coffe-near Description This contract implements simple vote for the best coffe in indonesia using near protocol. Contract in contract/src/lib.rs

1 Nov 15, 2021

Telegram bot for searching in Arch User Repository ( AUR ); Implemented using rust.

AurSearchBot A Telegram Inline Search Bot Written in Rust Introduction Telegram Bot that can search AUR ( Arch User Repository ) in inline mode. This

3 Feb 15, 2022

Comments

Remove UB from Lazy::PostponedReference

I know that you already reverted the commits, but if you want to try again this would probably be a better starting point. This is pretty much the same as your earlier approach, but without using raw pointers.

opened by Seppel3210 1

Provably optimal zero-copy parsers using nondeterministic finite automata.

Related tags

Overview

`inator`: an evil parsing library

You supply the evil plan; we supply the -inator!

or, Provably Optimal Zero-Copy Parsers with Nondeterministic Finite Automata

Why?

How?

What does this whole process look like?

Anything else cool you can do?

Why not other parsing libraries?

Acknowledgments

You might also like...

Private swaps for Secret Network using a private entropy pool & differential privacy.

A template for kick starting a Cloudflare worker project using workers-rs.

twilight-interactions is a set of macros and utilities to work with Discord Interactions using twilight.

Modrinth API is a simple library for using, you guessed it, the Modrinth API in Rust projects

A translation of Brendan Galea's Vulkan tutorial into Rust using the ash crate

Some tools for streaming frames to rpi-rgb-led-matrix using ZeroMQ, written in Rust.

Cross-platform GUI written in Rust using ADB to debloat non-rooted android devices

This contract implements simple vote for the best coffe in indonesia using near protocol.

Telegram bot for searching in Arch User Repository ( AUR ); Implemented using rust.

Comments

Remove UB from Lazy::PostponedReference

Owner

Will Sturgeon

Frame is a markdown language for creating state machines (automata) in 7 programming languages as well as generating UML documentation.

A massively parallel, optimal functional runtime in Rust

A zero-copy parser for the contents of the __unwind_info section of a mach-O binary.

Parsers based on lady-deirdre project

The most primitive and the fastest implementation of a fixed-size last-in-first-out stack on stack in Rust, for Copy-implementing types

A following of the book: Zero to Production In Rust

dm-jitaux is a Rust-based JIT compiler using modified auxtools, dmasm and Inkwell LLVM wrapper for boosting Byond DM performance without any hassle!

A Simple, But amazing telegram bot, Made using the Rust language!

Rust bindings to Cloudflare Worker KV Stores using wasm-bindgen and js-sys.

Showing how to deploy a Terra smart contract using Chainlink Data Feeds