sgmlish is a library for parsing, manipulating and deserializing SGML.

Related tags

Utilities rust serde sgml
Overview

sgmlish

Build status Version badge Docs badge

sgmlish is a library for parsing, manipulating and deserializing SGML.

It's not intended to be a full-featured implementation of the SGML spec; in particular, DTDs are not supported. That means case normalization and entities must be configured before parsing, and any desired validation or normalization, like inserting omitted tags, must be either performed through a built-in transform or implemented manually.

Still, its support is complete enough to successfully parse SGML documents for common applications, like OFX 1.x, and with little extra work it's ready to delegate to Serde.

Non-goals

  • Parsing HTML. Even though the HTML 4 spec was defined as an SGML DTD, browsers of that era were never close to conformant to all the intricacies of SGML, and websites were built with nearly zero regard for that anyway.

    Attempting to use an SGML parser to understand real-world HTML is a losing battle; the HTML5 spec was thus built with that in mind, describing how to handle all the ways web pages can be malformed in the best possible manner, based on how old browsers understood it.

    If you need to parse HTML, even old HTML, please use something like html5ever.

  • Parsing XML. This space is well-served by existing libraries like xml-rs. serde-xml-rs offers a very similar deserialization experience to this library.

  • The following SGML features are hard to properly implement without full doctype awareness during parsing, and are therefore currently considered beyond the scope of this library:

    • NET (Null End Tag) forms:
    • Custom definitions of character sets, like SEPCHAR or LCNMSTRT

Usage

This is a quick guide on deriving deserialization of data structures with Serde.

First, add sgmlish and serde to your dependencies:

# Cargo.toml
[dependencies]
serde = { version = "1.0", features = ["derive"] }
sgmlish = "0.2"

Defining your data structures is similar to using any other Serde library:

use serde::Deserialize;

#[derive(Deserialize)]
struct Example {
  name: String,
  version: Option<String>,
}

Usage is typically performed in three steps:

sgmlish 0.2 "##; // Step 1: configure parser, then parse string let sgml = sgmlish::Parser::build() .lowercase_names() .parse(input)?; // Step 2: normalization/validation let sgml = sgmlish::transforms::normalize_end_tags(sgml)?; // Step 3: deserialize into the desired type let example = sgmlish::from_fragment:: (sgml)?; ">
let input = r##"
    
    
        
    
     sgmlish
    
        
    
     0.2
    
    
"##;
// Step 1: configure parser, then parse string
let sgml = sgmlish::Parser::build()
    .lowercase_names()
    .parse(input)?;
// Step 2: normalization/validation
let sgml = sgmlish::transforms::normalize_end_tags(sgml)?;
// Step 3: deserialize into the desired type
let example = sgmlish::from_fragment::
    (sgml)?;
   
  1. Parsing: configure a sgmlish::Parser as desired — for example, by normalizing tag names or defining how entities (&example;) should be resolved. Once it's configured, feed it the SGML string.

  2. Normalization/validation: as the parser is not aware of DTDs, it does not know how to insert implied end tags, if those are accepted in your use case, or how to handle other more esoteric SGML features, like empty tags. This must be fixed before proceding with deserialization.

    A normalization transform is offered with this library: normalize_end_tags. It assumes end tags are only omitted when the element cannot contain child elements. This algorithm is good enough for many SGML applications, like OFX.

  3. Deserialization: once the event stream is normalized, pass on to Serde and let it do its magic.

Interpretation when deserializing

  • Primitives and strings: values can be either an attribute directly on the container element, or a child element with text content.

    The following are equivalent to the deserializer:

    bar ">
    <example foo="bar">example>
    <example><foo>barfoo>example>
  • Booleans: the strings true, false, 1 and 0 are accepted, both as attribute values and as text content.

    In the case of attributes, HTML-style flags are also accepted: an empty value (explicit or implicit) and a value equal to the attribute name (case insensitive) are treated as true.

    The following all set checked to true:

    true ">
    <example checked>example>
    <example checked="">example>
    <example checked="1">example>
    <example checked="checked">example>
    <example checked="true">example>
    <example><checked>truechecked>example>
  • Structs: the tag name comes from the parent struct's field, not from the value type!

    #[derive(Deserialize)]
    struct Root {
      // Expects a 
         
           element, not 
          
         
      config: MyConfiguration,
    }

    If you want to capture the text content of an element, you can make use of the special name $value:

    #[derive(Deserialize)]
    struct Example {
      foo: String,
      #[serde(rename = "$value")]
      content: String,
    }

    When $value is used, all other fields must come from attributes in the container element.

  • Sequences: sequences are read from a contiguous series of elements with the same name. Similarly to structs, the tag name comes from the parent struct's field.

    , } ">
    #[derive(Deserialize)]
    struct Example {
      // Expects a series of 
         
           elements, not 
          
         
      #[serde(rename = "host")]
      hosts: Vec<Hostname>,
    }
  • Enums: for externally tagged enums, fieldless enums (that is, enums where none of the variants have any data) can be read either as strings (from element text or an attribute value) or from tag names:

    #[derive(Deserialize)]
    struct Transaction {
      operation: Operation,
    }
    
    #[derive(Deserialize)]
    #[serde(rename_all = "lowercase")]
    enum Operation {
      Credit,
      Debit,
    }
    credit ">
    
    <transaction operation="credit">transaction>
    <transaction><operation>creditoperation>transaction>
    <transaction><operation><credit>credit>operation>transaction>

    Enums with fields must always use the element form:

    #[derive(Deserialize)]
    struct Example {
      background: Background,
    }
    
    #[derive(Deserialize)]
    #[serde(rename_all = "lowercase")]
    enum Value {
      Color(String),
      Gradient { from: String, to: String },
    }
    black gold ">
    
    <example>
      <background>
        <color>redcolor>
      background>
    example>
    
    <example>
      <background>
        <gradient from="blue" to="navy">gradient>
      background>
    example>
    
    <example>
      <background>
        <gradient>
          <from>blackfrom>
          <to>goldto>
        gradient>
      background>
    example>

Crate features

  • serde — includes support for Serde deserialization.

    Since this is the main use case for this library, this feature is enabled by default. To disable it, set default-features = false in your Cargo.toml file.

You might also like...
Error context library with support for type-erased sources and backtraces, targeting full support of all features on stable Rust

Error context library with support for type-erased sources and backtraces, targeting full support of all features on stable Rust, and with an eye towards serializing runtime errors using serde.

An unofficial and incomplete no_std Rust library for implementing the ElectricUI Binary Protocol
An unofficial and incomplete no_std Rust library for implementing the ElectricUI Binary Protocol

An unofficial and incomplete no_std Rust library for implementing the ElectricUI Binary Protocol

Rust library to scan files and expand multi-file crates source code as a single tree

syn-file-expand This library allows you to load full source code of multi-file crates into a single syn::File. Features: Based on syn crate. Handling

A cross-platform serial port library in Rust. Provides a blocking I/O interface and port enumeration including USB device information.

Note: This is a fork of the original serialport-rs project on GitLab. Please note there have been some changes to both the supported targets and which

syncmap is a fast, concurrent cache library built with a focus on performance and correctness.

syncmap syncmap syncmap is a fast, concurrent cache library syncmap is a fast, concurrent cache library built with a focus on performance and correctn

Rust library to generate word cloud images from text and images !
Rust library to generate word cloud images from text and images !

wordcloud-rs A Rust library to generate word-clouds from text and images! Example Code use std::collections::HashMap; use std::fs; use lazy_static::la

A 3d model, animation and generalized game ripping library written in Rust.

porterlib A 3d model, animation and generalized game ripping library written in Rust. 15k LOC Rust in one weekend and I don't think they can get any b

A simple, efficient Rust library for handling asynchronous job processing and task queuing.

job_queue Setup cargo add job_queue Usage Create a job use job_queue::{Error, Job, typetag, async_trait, serde}; #[derive(Debug, serde::Deserialize,

A Rust library that simplifies YAML serialization and deserialization using Serde.

Serde YML: Seamless YAML Serialization for Rust Serde YML is a Rust library that simplifies YAML serialization and deserialization using Serde. Effort

Releases(v0.2.0)
  • v0.2.0(Oct 25, 2021)

    • Radically changed API — case normalization, marked section expansion and entity expansion are now performed at parse time
    • No more Data enum — all data is stored in events in their expanded (formerly Data::CData) form
    • Most SgmlEvent variants now have named fields
    • Terminology change:
      • SgmlEvent::Data is now SgmlEvent::Character
      • deserialize feature is now serde
    Source code(tar.gz)
    Source code(zip)
Owner
Daniel Luz
Daniel Luz
mollusc is a collection of pure-Rust libraries for parsing, interpreting, and analyzing LLVM.

mollusc is a collection of pure-Rust libraries for parsing, interpreting, and analyzing LLVM.

William Woodruff 50 Dec 2, 2022
A library to compile USDT probes into a Rust library

sonde sonde is a library to compile USDT probes into a Rust library, and to generate a friendly Rust idiomatic API around it. Userland Statically Defi

Ivan Enderlin 40 Jan 7, 2023
A Diablo II library for core and simple client functionality, written in Rust for performance, safety and re-usability

A Diablo II library for core and simple client functionality, written in Rust for performance, safety and re-usability

null 4 Nov 30, 2022
UnTeX is both a library and an executable that allows you to manipulate and understand TeX files.

UnTeX UnTeX is both a library and an executable that allows you to manipulate and understand TeX files. Usage Executable If you wish to use the execut

Jérome Eertmans 1 Apr 5, 2022
An opinionated, practical color management library for games and graphics.

colstodian An opinionated color management library built on top of kolor. Introduction colstodian is a practical color management library for games an

Gray Olson 27 Dec 7, 2022
A low-level I/O ownership and borrowing library

This library introduces OwnedFd, BorrowedFd, and supporting types and traits, and corresponding features for Windows, which implement safe owning and

Dan Gohman 74 Jan 2, 2023
miette is a diagnostic library for Rust. It includes a series of traits/protocols that allow you to hook into its error reporting facilities, and even write your own error reports!

miette is a diagnostic library for Rust. It includes a series of traits/protocols that allow you to hook into its error reporting facilities, and even write your own error reports!

Kat Marchán 1.2k Jan 1, 2023
Membrane is an opinionated crate that generates a Dart package from a Rust library. Extremely fast performance with strict typing and zero copy returns over the FFI boundary via bincode.

Membrane is an opinionated crate that generates a Dart package from a Rust library. Extremely fast performance with strict typing and zero copy returns over the FFI boundary via bincode.

Jerel Unruh 70 Dec 13, 2022
A rust library that makes reading and writing memory of the Dolphin emulator easier.

dolphin-memory-rs A crate for reading from and writing to the emulated memory of Dolphin in rust. A lot of internals here are directly based on aldela

Madison Barry 4 Jul 19, 2022
This article is about the unsound api which I found in owning_ref. Owning_ref is a library that has 11 million all-time downloads and 60 reverse dependencies.

Unsoundness in owning_ref This article is about the unsound api which I found in owning_ref. Owning_ref is a library that has 11 million all-time down

Noam Ta Shma 20 Aug 3, 2022