serde-like serialization and deserialization of static Rust types in XML

Overview

static-xml

static-xml is a serde-like serialization and deserialization library for XML, currently written as a layer on top of xml-rs.

Status: in early development, docs mostly unwritten. Not yet recommended for use.

Design notes

This library is divided into two crates:

  1. static-xml has the main logic for driving serialization and deserialization, the basic traits, and a few helpers for Serialize/Deserialize impls to use to reduce code size.

    static-xml is meant to support any useful format. Its current implementation has some limitations. E.g. it doesn't handle processing instructions. In theory, though, it could be extended to exactly round-trip any series of XML events via the underlying XML library.

  2. static-xml-derive has macros for automatically deriving Serialize and Deserialize impls on struct and enum types. It's designed to work with simple Rust types that can convey the semantic meaning of typical XML types but often will not round-trip to the exact same events or bytes. You may need to bypass it for some types. For example, the structs it supports don't have any way of conveying order between their fields. To make this concrete, when fed an XHTML document, deserialization would lose the distinction between <p>foo<i>bar</i>baz</p> and <p>foobar<i>baz</i></p>.

xml-rs vs quick-xml or other alternatives

This library is written on top of xml-rs's stream-of-events interface. I considered other libraries like quick-xml but chose xml-rs for a couple of reasons:

  1. It's the most widely used, and there are companion crates like xmltree available.
  2. It aims to be standards-compliant, and others don't.

xml-rs is not without problems. E.g. its author wrote that "xml-rs has been first created ages ago, long before the first stable version of Rust was available. Therefore some details of its API are not really up-to-date. In particular, xml-rs allocates a lot. Ideally, it should work like quick-xml does, i.e. reading data to its internal buffer and give out references to it."

I'm open to porting static-xml to another library if there's one that aims for reasonably good standards compliance, or to a new xml-rs version if someone takes on the task of freshening that crate.

static-xml could even support multiple underlying XML crates via feature flags. The library makes use of &dyn Trait indirection internally, so additional code bloat should be minimal. Some interface choices borrowed from xml-rs would have to change to take the most advantage of an underlying library that allocates less.

DOM tree support

It's possible to deserialize Rust types from an in-memory DOM tree rather than XML events (and vice versa). This was suggested in this comment. I don't believe this simplifies the implementation much: it's beneficial anyway to use the program stack to represent the types during deserialization.

The streaming interface is strictly more general: just as it's possible for static-xml to support multiple underlying XML streaming libraries, it could also support traversing a DOM tree.

In the other direction, I plan to add an xmltree feature to static-xml which supplies a Serialize and Deserialize impl on xmltree::Element. This would allow retaining unknown field values easily:

#[derive(Deserialize, Serialize)]
struct Foo {
    known_field: String,
    #[static_xml(flatten)]
    unknown_fields: xmltree::Element,
}

Future work: table-driven Visitor impl.

Currently static-xml-derive writes explicit generated code. E.g., the Deserialize impl for Foo above looks roughly as follows:

const ELEMENTS: &[ExpandedNameRef; 1] = &[
    ExpandedNameRef { local_name: "known_field", namespace: "" },
];

impl Deserialize for Foo {
    fn deserialize(element: ElementReader<'_>) -> Result<Self, VisitorError> {
        let mut builder = FooVisitor {
            known_field: <String as DeserializeField>::init(),
            unknown_fields: <XmlTree as DeserializeField>::init(),
        };
        element.read_to(&mut builder)?;
        Self {
            known_field: <String as DeserializeField>::finalize(builder.known_field)?,
            unknown_fields: <XmlTree as DeserializeField>::finalize(builder.unknown_fields)?,
        }
    }
}

pub struct FooVisitor {
    known_field: <String as DeserializeField>::Builder,
    unknown_fields: <xmltree::Element as DeserializeFlatten>::Builder,
}

impl ElementVisitor for FooVisitor {
    fn element<'a>(
        &mut self,
        child: ElementReader<'a>
    ) -> Result<Option<ElementReader<'a>>, VisitorError> {
        match find(&child.expanded_name(), ELEMENTS) {
            Some(0usize) => {
                ::static_xml::de::DeserializeFieldBuilder::element(&mut self.known_field, child)?;
                return Ok(None);
            }
            _ => delegate_element(&mut [&mut self.unknown_fields], child),
        }
    }
}

I believe this is close to the minimal size with this approach. Next I'd like to experiment with a different approach in which the Visitor impl is replaced with a table that holds the offset within FooVisitor of each field, and a pointer to an element function. The generated code would use unsafe, but soundness only has to be proved once in the generator, and this seems worthwhile if it can achieve significant code size reduction.

Comparison with other crates

static-xml vs a serde data format

There are several XML serialization crates that plug into serde as a data format (serde::Deserializer and serde::Serializer impls), including:

This is an attractive idea: take advantage of serde's high-quality derive macro implementation and maybe even a few existing #[derive(Serialize) annotations in popular crates.

I discarded this approach because I found it frustrating to combine serde's generic data model and XML's complex, unique data model. The challenge is to make it possible to use serde attributes to describe an XML data format easily:

A few examples of the mismatch:

  • XML distinguishes between elements and attributes. serde-xml-rs doesn't support attributes. xml-serde uses a special $attr: rename prefix.
  • XML not only is namespaced but does so indirectly, by assigning prefixes to namespaces and referencing prefixes in element and attribute names. serde-xml-rs doesn't support namespaces. xml-serde uses a {namespace}prefix:element name for every field, which can be verbose both in the struct definition and the generated XML (not supporting binding a prefix at a higher level than it is used).
  • Even XML schema's "simple types" (the strings within text nodes and attribute values) can be quite complex:
    • They can represent a list of values separated by spaces. The deserializer might be able to hint it's expecting this by calling eg deserialize_seq rather than deserialize_string.
    • They can represent a "union": any of several possible subtypes. Now we need to support accumulating them in some buffer and backtracking. The buffer needs to also support these deserializer hints. The caller likely needs to request this buffering in some fashion, likely by wrapping with a type from this library, dropping the serde data format abstraction.
    • They support three modes of whitespace normalization. There's no way to pass this through serde, other than custom types or #[serde(deserialize_with)] functions.

These problems can likely be solved, but I find it much easier to understand a data model specific to XML. It can be extended to support as much of XML as necessary without wedging a square peg into a round hole.

static-xml vs yaserde

yaserde is conceptually similar to static-xml but suffers from poor implementation quality.

Error handling

yaserde's generated code will panic on invalid data, eg if a non-digit is found where an i32 is expected:

thread 'tests::basic_deserialization' panicked at 'called `Result::unwrap()` on an `Err` value: ParseIntError { kind: InvalidDigit }', schema/src/onvif.rs:4030:50

static-xml instead returns a nicely formatted error:

invalid digit found in string @ 14:25

XML element stack:
   4: <tt:Hour> @ 14:25
   3: <tt:Time> @ 13:21
   2: <tt:UTCDateTime> @ 12:17
   1: <tds:SystemDateAndTime> @ 6:13
   0: <tds:GetSystemDateAndTimeResponse> @ 3:9

Bugs

yaserde has several variations of unsolved bugs involving nested elements with the same name (eg #76). The root cause is that it doesn't have a well-defined contract for the deserialization interface and doesn't track the depth reliably.

static-xml is based on the proposal in #84 which solves these problems systematically, introducing a deserialization contract which is enforced by Rust's type system.

yaserde also has several bugs including namespaces, eg #126, and enum element name comparisons ignoring the namespace entirely. These are believed to be addressed by static-xml, although many tests have yet to be written.

Bloat

With large schemas, yaserde bloats binaries and compilation time. Using lumeohq/onvif-rs 247b90c and Rust, look at the code sizes below, particularly for the schema crate that contains yaserde's generated code.

$ cargo bloat --release --example camera --crates
...
 File  .text      Size Crate
 7.5%  21.4% 1016.4KiB schema
 4.8%  13.7%  654.1KiB std
 4.1%  11.7%  556.6KiB reqwest
 3.3%   9.3%  443.9KiB yaserde
 2.2%   6.2%  295.5KiB clap
 1.6%   4.4%  211.6KiB h2
 1.2%   3.4%  160.4KiB regex_syntax
 1.1%   3.1%  147.7KiB onvif
 1.0%   2.9%  139.0KiB tokio
 0.9%   2.6%  124.6KiB hyper
 0.9%   2.6%  122.7KiB tracing_subscriber
 0.6%   1.8%   83.7KiB regex_automata
 0.6%   1.7%   80.8KiB xml
 0.5%   1.5%   72.8KiB regex
 0.4%   1.1%   52.4KiB http
 0.4%   1.0%   48.9KiB url
 0.3%   0.9%   43.7KiB num_bigint
 0.3%   0.9%   42.9KiB chrono
 0.3%   0.8%   36.8KiB idna
 0.3%   0.7%   35.4KiB encoding_rs
 2.3%   6.4%  305.0KiB And 62 more crates. Use -n N to show more.
35.2% 100.0%    4.6MiB .text section size, the file size is 13.2MiB

Note: numbers above are a result of guesswork. They are not 100% correct and never will be.

Compare to numbers from a WIP branch based on static-xml (which are likely to further improve):

 File  .text     Size Crate
 5.0%  17.1% 655.0KiB std
 4.3%  14.5% 557.9KiB reqwest
 2.3%   7.7% 295.5KiB clap
 2.2%   7.4% 282.4KiB schema
 1.7%   5.7% 218.5KiB regex
 1.6%   5.5% 211.6KiB h2
 1.4%   4.8% 185.9KiB regex_syntax
 1.1%   3.7% 142.1KiB tokio
 1.0%   3.3% 125.1KiB tracing_subscriber
 1.0%   3.2% 124.3KiB hyper
 0.8%   2.7% 104.2KiB onvif
 0.6%   2.2%  82.8KiB regex_automata
 0.6%   1.9%  73.8KiB aho_corasick
 0.5%   1.8%  68.7KiB xml
 0.5%   1.7%  65.9KiB static_xml
 0.4%   1.4%  52.0KiB http
 0.4%   1.3%  48.9KiB url
 0.3%   1.1%  43.8KiB num_bigint
 0.3%   1.1%  42.9KiB chrono
 0.3%   1.0%  36.8KiB idna
 2.5%   8.5% 327.0KiB And 63 more crates. Use -n N to show more.
29.6% 100.0%   3.7MiB .text section size, the file size is 12.7MiB

On a powerful 12-core/24-thread AMD Ryzen 5900X machine, cargo bloat --release --example camera --times says the yaserde-based schema crate takes 97.99s to compile; the static-xml-based version takes 33.43s to compile. The difference is even more dramatic on older machines. On several of my SBC setups, the yaserde version fails to compile without enabling zramfs.

Compile-time errors

yaserde's derive macros will panic in some cases with an unhelpful error message. In others, they emit code that doesn't compile and doesn't have the proper spans. Eg, if a field doesn't implement the required YaSerialize interface, it describes the problem but doesn't pinpoint the offending line of code:

error[E0277]: the trait bound `Foo: YaSerialize` is not satisfied
   --> schema/src/common.rs:38:37
    |
38  | #[derive(Default, PartialEq, Debug, YaSerialize, YaDeserialize)]
    |                                     ^^^^^^^^^^^ the trait `YaSerialize` is not implemented for `Foo`
    |
note: required by a bound in `yaserde::YaSerialize::serialize`
   --> /home/slamb/.cargo/registry/src/github.com-1ecc6299db9ec823/yaserde-0.7.1/src/lib.rs:106:19
    |
106 |   fn serialize<W: Write>(&self, writer: &mut ser::Serializer<W>) -> Result<(), String>;
    |                   ^^^^^ required by this bound in `yaserde::YaSerialize::serialize`
    = note: this error originates in the derive macro `YaSerialize` (in Nightly builds, run with -Z macro-backtrace for more info)

static-xml's derive macros always try to add a relevant span.

error[E0277]: the trait bound `Foo: ParseText` is not satisfied
  --> schema/src/common.rs:49:9
   |
49 |     pub foo: Foo,
   |         ^^^ the trait `ParseText` is not implemented for `Foo`
   |
   = note: required because of the requirements on the impl of `Deserialize` for `Foo`
   = note: required because of the requirements on the impl of `DeserializeFieldBuilder` for `Foo`
   = help: see issue #48214

Flexibility

yaserde requires that every deserializable type implement Default, which is particularly awkward for enums. It also doesn't support required fields or distinguishing between absent fields and ones set to the default value.

static-xml avoids this by defining a builder type matching each deserializable type. Some caveats apply: currently the builders' finalize methods are a significant source of code bloat, so there's a direct knob to turn them off. I'd like to see if I can reduce the bloat without giving up the builders' advantages.

License

Your choice of MIT or Apache; see LICENSE-MIT.txt or LICENSE-APACHE, respectively.

Comments
  • document static_xml_derive

    document static_xml_derive

    This is the main interface to the crate, but it's not documented at all. Write up all the different modes of operation and the attributes, with some examples.

    opened by scottlamb 1
  • Add option for writing document declaration

    Add option for writing document declaration

    By default, xml-rs always writes a document declaration (e.g. <?xml version="1.0" encoding="utf-8"?>) before outputting any serialized elements. This PR adds a flag to the serializer to (optionally) turn this off.

    opened by fwcd 0
  • `static_xml::de::read` should check the root element name

    `static_xml::de::read` should check the root element name

    static_xml::de::read currently ignores the name of the root element, as described here:

    https://github.com/scottlamb/static-xml/blob/ca51ec148462d60acd3595c71545863374306537/static-xml/src/de/mod.rs#L250-L254

    This is no good. At the very least we should check that matches the one expected element. It'd also be nice to introduce a way to deserialize to something to an enum, getting names from the variants.

    opened by scottlamb 0
  • add tests of compilation failures

    add tests of compilation failures

    The README promises good compilation errors when you make a mistake using the derive macros. There are no tests of this, though. It should be possible to add them:

    opened by scottlamb 0
  • support derive macros on types with generics

    support derive macros on types with generics

    There's some cargo-culted code for the generics bounds and where clauses in the #[derive(Deserialize)] struct support. I'd be surprised if it actually worked. Nothing at all in the enum support or the ParseText/ToText iirc. Add tests, make them work.

    opened by scottlamb 0
  • try out table-driven `Visitor::element` and `Visitor::attribute` to reduce code size

    try out table-driven `Visitor::element` and `Visitor::attribute` to reduce code size

    Try out the approach to reduce the per-type overhead of these methods described in the README:

    Currently static-xml-derive writes explicit generated code.

    E.g., with this type:

    #[derive(Deserialize, Serialize)]
    struct Foo {
        known_field: String,
        #[static_xml(flatten)]
        unknown_fields: xmltree::Element,
    }
    

    The Deserialize macro produces code roughly as follows:

    const ELEMENTS: &[ExpandedNameRef; 1] = &[
        ExpandedNameRef { local_name: "known_field", namespace: "" },
    ];
    
    impl Deserialize for Foo {
        fn deserialize(element: ElementReader<'_>) -> Result<Self, VisitorError> {
            let mut builder = FooVisitor {
                known_field: <String as DeserializeField>::init(),
                unknown_fields: <XmlTree as DeserializeField>::init(),
            };
            element.read_to(&mut builder)?;
            Self {
                known_field: <String as DeserializeField>::finalize(builder.known_field)?,
                unknown_fields: <XmlTree as DeserializeField>::finalize(builder.unknown_fields)?,
            }
        }
    }
    
    pub struct FooVisitor {
        known_field: <String as DeserializeField>::Builder,
        unknown_fields: <xmltree::Element as DeserializeFlatten>::Builder,
    }
    
    impl ElementVisitor for FooVisitor {
        fn element<'a>(
            &mut self,
            child: ElementReader<'a>
        ) -> Result<Option<ElementReader<'a>>, VisitorError> {
            match find(&child.expanded_name(), ELEMENTS) {
                Some(0usize) => {
                    ::static_xml::de::DeserializeFieldBuilder::element(&mut self.known_field, child)?;
                    return Ok(None);
                }
                _ => delegate_element(&mut [&mut self.unknown_fields], child),
            }
        }
    }
    

    I believe this is close to the minimal size with this approach. Next I'd like to experiment with a different approach in which the Visitor impl is replaced with a table that holds the offset within FooVisitor of each field, and a pointer to an element function. The generated code would use unsafe, but soundness only has to be proved once in the generator, and this seems worthwhile if it can achieve significant code size reduction.

    opened by scottlamb 6
  • decide approach for `xs:choice`

    decide approach for `xs:choice`

    Decide how to best support types like the one represented by this XML schema fragment:

    <xs:complexType name="Outer">
      <xs:sequence>
        <xs:element name="FixedField" />
        <xs:choice>
          <xs:element name="ChoiceA" />
          <xs:element name="ChoiceB" />
        </xs:choice>
      </xs:sequence>
    </xs:complexType>
    

    Currently I'm following the lead of yaserde and xsd-parser-rs and representing this like so:

    #[static_xml(Deserialize, Serialize)]
    struct Outer {
        #[static_xml(rename="FixedField")]
        fixed_field: String,
        #[static_xml(flatten)]
        choice: Choice,
    }
    
    #[static_xml(Deserialize, Serialize)]
    enum Choice {
        ChoiceA(String),
        ChoiceB(String),
    }
    

    There are a couple things I don't like about this:

    • the indirection of the generated code. I'd prefer the outer type know the element names and do everything in the single match clause. I can imagine a type ending up with several different flattened enums.
    • that I don't support Option<Choice> or Vec<Choice>. You can do a Vec within the variant of the enum (like having maxOccurs="unbounded" on the elements within the choice) and it works (except with #[static_xml(direct) as noted in #1). But you can't have a choice: Vec<Choice> (like having maxOccurs="unbounded" on the choice itself).I can't just impl DeserializeFlattened for Vec<MyChoice> because of the orphan rule. Likewise Option.

    I could probably solve the latter problem by introducing a new trait, although I want to keep the number of traits reasonably low for understandability.

    The former I think requires not using flattened fields. I don't think it's possible for a derive macro to look at anything beyond the current type's definition, but maybe I could do something like this:

    #[static_xml(Deserialize, Serialize)]
    struct Outer {
        #[static_xml(rename="FixedField")]
        fixed_field: String,
        #[static_xml(variant(ChoiceA => variant attributes go here))]
        #[static_xml(variant(ChoiceB => variant attributes go here))]
        choice: Choice,
    }
    
    enum Choice {
        ChoiceA(String),
        ChoiceB(String),
    }
    

    and then it can be worked into the match clause for OuterVisitor::element. And it'd support Vec<Choice> and the like.

    This might be my preferred approach, despite the repetition of having the variants listed in two places.

    opened by scottlamb 1
Owner
Scott Lamb
Scott Lamb
A XML parser written in Rust

RustyXML Documentation RustyXML is a namespace aware XML parser written in Rust. Right now it provides a basic SAX-like API, and an ElementBuilder bas

null 97 Dec 27, 2022
An XML library in Rust

SXD-Document An XML library in Rust. Overview The project is currently broken into two crates: document - Basic DOM manipulation and reading/writing X

Jake Goulding 146 Nov 11, 2022
An XML library in Rust

xml-rs, an XML library for Rust Documentation xml-rs is an XML library for Rust programming language. It is heavily inspired by Java Streaming API for

Vladimir Matveev 417 Dec 13, 2022
An XPath library in Rust

SXD-XPath An XML XPath library in Rust. Overview The project is broken into two crates: document - Basic DOM manipulation and reading/writing XML from

Jake Goulding 107 Nov 11, 2022
A Rust OpenType manipulation library

fonttools-rs   This is an attempt to write an Rust library to read, manipulate and write TTF/OTF files. It is in the early stages of development. Cont

Simon Cozens 36 Nov 14, 2022
xml-rs is an XML library for Rust programming language

xml-rs, an XML library for Rust Documentation xml-rs is an XML library for Rust programming language. It is heavily inspired by Java Streaming API for

Vladimir Matveev 417 Jan 3, 2023
Msgpack serialization/deserialization library for Python, written in Rust using PyO3, and rust-msgpack. Reboot of orjson. msgpack.org[Python]

ormsgpack ormsgpack is a fast msgpack library for Python. It is a fork/reboot of orjson It serializes faster than msgpack-python and deserializes a bi

Aviram Hassan 139 Dec 30, 2022
Deser: an experimental serialization and deserialization library for Rust

deser: an experimental serialization and deserialization library for Rust Deser is an experimental serialization system for Rust. It wants to explore

Armin Ronacher 252 Dec 29, 2022
A HTTP Archive format (HAR) serialization & deserialization library, written in Rust.

har-rs HTTP Archive format (HAR) serialization & deserialization library, written in Rust. Install Add the following to your Cargo.toml file: [depende

Sebastian Mandrean 25 Dec 24, 2022
WebAssembly serialization/deserialization in rust

parity-wasm Low-level WebAssembly format library. Documentation Rust WebAssembly format serializing/deserializing Add to Cargo.toml [dependencies] par

Parity Technologies 391 Nov 17, 2022
Rust libraries and tools to help with interoperability and testing of serialization formats based on Serde.

The repository zefchain/serde-reflection is based on Facebook's repository novifinancial/serde-reflection. We are now maintaining the project here and

Zefchain Labs 46 Dec 22, 2022
Serialize/DeSerialize for Rust built-in types and user defined types (complex struct types)

Serialize/DeSerialize for Rust built-in types and user defined types (complex struct types)

null 2 May 3, 2022
Custom deserialization for fields that can be specified as multiple types.

serde-this-or-that Custom deserialization for fields that can be specified as multiple types. This crate works with Cargo with a Cargo.toml like: [dep

Ritvik Nag 7 Aug 25, 2022
A SOME/IP serialization format using the serde framework

serde_someip implements SOME/IP ontop of serde use serde::{Serialize, Deserialize}; use serde_someip::SomeIp; use serde_someip::options::ExampleOption

Morten Mey 3 Feb 14, 2022
axum-serde is a library that provides multiple serde-based extractors and responders for the Axum web framework.

axum-serde ?? Overview axum-serde is a library that provides multiple serde-based extractors / responses for the Axum web framework. It also offers a

GengTeng 3 Dec 12, 2023
Serialize/Deserialize tch-rs types with serde

tch-serde: Serialize/Deserialize tch-rs types with serde This crate provides {ser,de}ialization methods for tch-rs common types. docs.rs | crates.io U

null 4 Apr 3, 2022
serde support for http crate types Request, Response, Uri, StatusCode, HeaderMap

serde extensions for the http crate types Allows serializing and deserializing the following types from http: Response Request HeaderMap StatusCode Ur

Andrew Toth 3 Nov 1, 2023
Docker images for compiling static Rust binaries using musl-libc and musl-gcc, with static versions of useful C libraries. Supports openssl and diesel crates.

rust-musl-builder: Docker container for easily building static Rust binaries Source on GitHub Changelog UPDATED: Major updates in this release which m

Eric Kidd 1.3k Jan 1, 2023
Rust library for concurrent data access, using memory-mapped files, zero-copy deserialization, and wait-free synchronization.

mmap-sync mmap-sync is a Rust crate designed to manage high-performance, concurrent data access between a single writer process and multiple reader pr

Cloudflare 97 Jun 26, 2023
Static Web Server - a very small and fast production-ready web server suitable to serve static web files or assets

Static Web Server (or SWS abbreviated) is a very small and fast production-ready web server suitable to serve static web files or assets.

Jose Quintana 496 Jan 2, 2023