Org mode structural parser/emitter with an emphasis on modularity and avoiding edits unrelated to changes.

Related tags

Parsing starsector
Overview

Introduction

Org mode structural parser/emitter with an emphasis on modularity and avoiding edits unrelated to changes.

The goal of this library is to parse very efficiently, and edit acceptably efficiently, to support workflows that parse a large file, make a few changes, then save, without spurious deltas in the file.

Features

  • Fast minimal structural parser that splits file into a tree of headlines.

  • Every UTF-8 string is a valid input, and emit(parse(text)) == text for all UTF-8 strings.

  • Unmodified headlines will be emitted exactly as they were input, even if other headlines were changed. (Note that there are edge cases related to which section a newline is part of).

  • With headline-parser flag adds a parser/generator for headlines (tags, keyword, priority, etc) that functions on top the structural tree.

  • With orgize-integration flag, uses orgize to parse/generate properties drawer and planning (deadline, scheduled, etc).

  • Headlines are represented in memory as text, making both parsing and emitting very fast, and permitting a two-way mapping between text offset and each headline that remains valid even as the in-memory document is modified.

  • Arena allocator provides fast performance and precise control over memory.

  • Copy-on-write text storage using Ropey.

  • Reparse based edit model ensures that the few tree invariants we have are never broken, that the in-memory format cannot represent invalid state, and that edits which could change the tree structure unintentionally are not allowed. For example, changing the text of a node such that it parses into multiple nodes is rejected, but any other change is allowed. Should such a change be desirable, functions are provided that manipulate the tree structure directly.

    This helps limit the blast radius of bugs to the headline(s) affected, even if the bug itself results in adding new text that could be parsed as a headline.

Limitations

  • Parses only a small subset of Org mode.

    I have no plans to extend this, except possibly adding native parsing of properties, planning, and timestamps rather than relying on Orgize. I recommend using orgize to parse section contents.

  • Since sections are stored as text, every change to a headline requires reparsing the entire headline. A builder is provided to batch such changes if desired.

  • The fuzz test produces many uninteresting cases where Orgize and Starsector parse the same differently. There is some logic to filter out knows differences, but the fuzz test remains fairly noisy, and requires going through it manually to determine whether a difference is actually new. It's still very useful, but it's not expected that it will produce no violations.

Getting Started

See examples/edit.rs for a comprehensive example on parsing and editing.

Arena

Text is stored using a rope, which allows sharing with other Arenas as well as other code. This also allows multiple versions of a document to be stored efficiently. Section::clone_subtree is helpful here.

The API is currently built around IndexTree to model the tree structure. This means that nodes refer to other nodes by identifier, rather than by content, and that you can change the text of a document by changing only a node within it. Multiple documents may be stored in a given Arena, and it can be thought of as a sort of "builder" for trees.

As such, the only mutable state is stored in the Arena. Specific nodes are referred to with the type Section (if you're familiar with IndexTree, this is a wrapper around NodeId), which consists of an identifier into the tree. Most functions are called on Section, and take its Arena by reference (or mutable reference).

Although we may reuse IndexTree nodes internally, any Section provided to client code is guaranteed to remain valid as long as the Arena lives. This means that, e.g., you can remove a Section from a document, but it will remain valid, so you could later attach it to another document, or elsewhere in the same one.

This means that over time, the Arena will accumulate nodes. They are quite small so this is unlikely to be a problem, but it may be necessary with long lived Arenas that undergo many edits (inotify-based reparsing, etc) to periodically emit text, create a new arena, and reparse. If this is an inconvenience, we could look at adding a convenience function for this -- the main tricky part is that Sections all need to be re-numbered, since preserving existing ones would require copying all data, defeating the purpose.

Layered Parsing

There is no one specification for the Org format. The draft spec, org-element.el, and Org mode commands disagree on how to handle certain edge cases. Different Org mode commnads may even be inconsistent among themselves. Yet in practice, the behavior is usually consistent, and in the cases where it varies, it is unlikely that the user would notice or care. See the Orgize issue tracker (open and closed) for examples.

Rather than attempt to produce a single parse tree that agrees on all edge cases, this project takes a layered approach consisting of a structure parser for the entire file, a headline parser , and properties/planning parser that currently uses Orgize. Orgize can also be used to fully parse the contents of a headline.

Org mode itself does not operate on a parse tree. Commands are written to operate on raw text, which makes it possible for different commands to interpet the grammar differently. While frustrating to a parser, this approach does provide strong abstraction. Org mode grew out of a text editor, and in many ways its commands can be thought of as highly specialized editing commands that, being invoked by a user, can take context into account. It's not how I'd design it, but I have to admit there is an elegance to it.

Hence, we take a similar approach: Parse the structure into a tree of chunks of text, and let client code decide what to do with it.

Structural Parser

The structural parser uses the bare minimum grammar necessary to split the file into a tree of headlines. We refer to each headline as a section consisting of the line with the stars itself and all text below that line until the next headline (or end of file). Sections are organized into a tree structure, with child headlines represented as children of their parent section. There is also a special section at the root of the document that does not correspond to a headline, with level 0. Level refers to the number of stars in the headline.

The semantics were chosen to match Org mode as closely as possible. In particular, a newline refers only to '\n', and literal ASCII space ' ' must follow the stars. Despite this, Unicode should be fully supported, albeit with a specific interpretation of significant whitespace.

The subtree rooted at any section can be emitted as an Org file by calling to_rope. Since sections are stored as text, this just traverses the tree in order and concatenates each section with a newline. This will produce identical text to the input except for three newline edge cases.

The document itself can also be emitted as an Org file, but it handles those three edge cases by storing additional state from the original parse, such that emit(parse(text)) == text for all UTF-8 input. If the document is modified, it will have the same three edge cases.

This design allows us to model all edge cases in the document, meaning that headlines can be freely added, moved, deleted, and edited while maintaining the "just concatenate the chunks of text" invariant.

Sections are stored as plain text. The text may be modified directly with set_level and set_raw, provided such modification does not break the tree invarant. For example, removing star from a headline would only be allowed if the new section still has strictly more stars than the level above it. Likewise, changing the text to become multiple sections is not permitted by editing the raw text, structural editing commands (append, prepend, etc) that operate on the tree structure must be used instead.

This restriction is a feature, since it means that client code which operates on a section cannot cause changes in any other section, nor can they corrupt the tree structure if a bug accidentally introduces a line starting with star into the body. This makes it easier to write programs which safely read and write large or complex org mode files frequently, by isolating their changes.

Headline Parser

In most cases, operating on the raw text will be inconvenient. Often we wish to operate only on the text under the headline, or only on the headline itself to change priority, keyword, tags, etc.

When the headline-parser feature flag is enabled (default), headline editing commands become available. These commands are built on top of the structural parser, and parse a single headline at a time. Each time an accessor is called, we parse the headline and return it. To modify a headline property, we parse the headline, change the property, emit the new headline as text, and then replace the text of the section with the headline.

We choose this approach so that the headline parser does not need to satisfy the identity invariant we provide for the overall file. Additionally, headline parsing brings many edge cases that vary between implementations, and even if they were consistent. This design means that headlines which client code change will be interpreted and reformatted in a standardized way, but only modified headlines will be affected.

As a convenience, individual properties may be changed by calling set_keyword, set_priority, etc on the section. It is also possible to get a Headline by calling parse_headline. This is a value type which provides access to the headline properties (including the body text). Calling to_builder provides a HeadlineBuilder which may be used to change multiple properties at once, before building a new headline by calling headline on it. You can then call set_headline on the Section.

As with changing the section's raw text, edits which break tree invariants will fail.

Planning/Properties Parser

With orgize-integration feature flag (enabled by default), functions that get and set planning (scheduled, deadline, closed) and properties become available on both Section and HeadlineBuilder. These work similarly to the headline parser, except that they rely on Orgize to parse the headline.

I'd like to implement my own parser for these as I've done for headlines, since this has the potential to reformat the entire headline, and there are some inputs Orgize can't handle (e.g., timestamps missing the day of week).

Future Plans

No promises, but I'd like to clean this up a bit and publish it to crates.io.

Other than that, I would like to implement my own parsing for properties drawers, planning, and timestamps to integrate them better. I have no plans to replicate any other Orgize functionality.

A copy-on-write API for editing trees would be interesting, but I've been unhappy with previous prototypes along those lines. Ropey seems to make it work well, however, so it can be done.

Test coverage is quite solid for the structural parser, and adequate for the headline parser, but the APIs built on top of them could use more coverage (possibly doubling as documentation/examples).

You might also like...
An IRC (RFC1459) parser and formatter, built in Rust.

ircparser An IRC (RFC1459) parser and formatter, built in Rust. ircparser should work on basically any Rust version, but the earliest version checked

Lexer and parser collections.

laps Lexer and parser collections. With laps, you can build parsers by just defining ASTs and deriving Parse trait for them. Usage Add laps to your pr

A WIP svelte parser written in rust. Designed with error recovery and reporting in mind

Svelte(rs) A WIP parser for svelte files that is designed with error recovery and reporting in mind. This is mostly a toy project for now, with some v

Rust parser combinator framework

nom, eating data byte by byte nom is a parser combinators library written in Rust. Its goal is to provide tools to build safe parsers without compromi

url parameter parser for rest filter inquiry
url parameter parser for rest filter inquiry

inquerest Inquerest can parse complex url query into a SQL abstract syntax tree. Example this url: /person?age=lt.42&(student=eq.true|gender=eq.'M')&

Parsing Expression Grammar (PEG) parser generator for Rust

Parsing Expression Grammars in Rust Documentation | Release Notes rust-peg is a simple yet flexible parser generator that makes it easy to write robus

A fast monadic-style parser combinator designed to work on stable Rust.

Chomp Chomp is a fast monadic-style parser combinator library designed to work on stable Rust. It was written as the culmination of the experiments de

A parser combinator library for Rust

combine An implementation of parser combinators for Rust, inspired by the Haskell library Parsec. As in Parsec the parsers are LL(1) by default but th

LR(1) parser generator for Rust

LALRPOP LALRPOP is a Rust parser generator framework with usability as its primary goal. You should be able to write compact, DRY, readable grammars.

Comments
  • Update Cargo.toml

    Update Cargo.toml

    Mentioning "0" would mean cargo would pick the latest dependency which is versioned "0.*" which would include versions incompatible with the one that this crate was written with and may break your create.

    opened by Dylan-DPC 1
Owner
Alex Roper
Alex Roper
Watch for changes on a webpage and do anything with it!

Sukurappa Watch for changes on a webpage and do anything with it! Install With cargo: cargo install sukurappa Or use the install-script and add $HOME/

Jean-Philippe Bidegain 2 Sep 4, 2022
Website for Microformats Rust parser (using 'microformats-parser'/'mf2')

Website for Microformats Rust parser (using 'microformats-parser'/'mf2')

Microformats 5 Jul 19, 2022
Pure, simple and elegant HTML parser and editor.

HTML Editor Pure, simple and elegant HTML parser and editor. Examples Parse HTML segment/document let document = parse("<!doctype html><html><head></h

Lomirus 16 Nov 8, 2022
A native Rust port of Google's robots.txt parser and matcher C++ library.

robotstxt A native Rust port of Google's robots.txt parser and matcher C++ library. Native Rust port, no third-part crate dependency Zero unsafe code

Folyd 72 Dec 11, 2022
A rusty, dual-wielding Quake and Half-Life texture WAD parser.

Ogre   A rusty, dual-wielding Quake and Half-Life texture WAD parser ogre is a rust representation and nom parser for Quake and Half-Life WAD files. I

Josh Palmer 16 Dec 5, 2022
A modern dialogue executor and tree parser using YAML.

A modern dialogue executor and tree parser using YAML. This crate is for building(ex), importing/exporting(ex), and walking(ex) dialogue trees. convo

Spencer Imbleau 27 Aug 3, 2022
🕑 A personal git log and MacJournal output parser, written in rust.

?? git log and MacJournal export parser A personal project, written in rust. WORK IN PROGRESS; NOT READY This repo consolidates daily activity from tw

Steven Black 4 Aug 17, 2022
Sqllogictest parser and runner in Rust.

Sqllogictest-rs Sqllogictest parser and runner in Rust. License Licensed under either of Apache License, Version 2.0 (LICENSE-APACHE or http://www.apa

Singularity Data Inc. 101 Dec 21, 2022
Parser for Object files define the geometry and other properties for objects in Wavefront's Advanced Visualizer.

format of the Rust library load locad blender obj file to Rust NDArray. cargo run test\t10k-images.idx3-ubyte A png file will be generated for the fi

Nasser Eddine Idirene 1 Jan 3, 2022
A CSS parser, transformer, and minifier written in Rust.

@parcel/css A CSS parser, transformer, and minifier written in Rust. Features Extremely fast – Parsing and minifying large files is completed in milli

Parcel 3.1k Jan 9, 2023