xml-rs is an XML library for Rust programming language

Overview

xml-rs, an XML library for Rust

Build Status crates.io docs

Documentation

xml-rs is an XML library for Rust programming language. It is heavily inspired by Java Streaming API for XML (StAX).

This library currently contains pull parser much like StAX event reader. It provides iterator API, so you can leverage Rust's existing iterators library features.

It also provides a streaming document writer much like StAX event writer. This writer consumes its own set of events, but reader events can be converted to writer events easily, and so it is possible to write XML transformation chains in a pretty clean manner.

This parser is mostly full-featured, however, there are limitations:

  • no other encodings but UTF-8 are supported yet, because no stream-based encoding library is available now; when (or if) one will be available, I'll try to make use of it;
  • DTD validation is not supported, declarations are completely ignored; thus no support for custom entities too; internal DTD declarations are likely to cause parsing errors;
  • attribute value normalization is not performed, and end-of-line characters are not normalized too.

Other than that the parser tries to be mostly XML-1.0-compliant.

Writer is also mostly full-featured with the following limitations:

  • no support for encodings other than UTF-8, for the same reason as above;
  • no support for emitting declarations;
  • more validations of input are needed, for example, checking that namespace prefixes are bounded or comments are well-formed.

What is planned (highest priority first, approximately):

  1. missing features required by XML standard (e.g. aforementioned normalization and proper DTD parsing);
  2. miscellaneous features of the writer;
  3. parsing into a DOM tree and its serialization back to XML text;
  4. SAX-like callback-based parser (fairly easy to implement over pull parser);
  5. DTD validation;
  6. (let's dream a bit) XML Schema validation.

Building and using

xml-rs uses Cargo, so just add a dependency section in your project's manifest:

[dependencies]
xml-rs = "0.8"

The package exposes a single crate called xml:

extern crate xml;

Reading XML documents

xml::reader::EventReader requires a Read instance to read from. When a proper stream-based encoding library is available, it is likely that xml-rs will be switched to use whatever character stream structure this library would provide, but currently it is a Read.

Using EventReader is very straightforward. Just provide a Read instance to obtain an iterator over events:

{ println!("{}+{}", indent(depth), name); depth += 1; } Ok(XmlEvent::EndElement { name }) => { depth -= 1; println!("{}-{}", indent(depth), name); } Err(e) => { println!("Error: {}", e); break; } _ => {} } } }">
extern crate xml;

use std::fs::File;
use std::io::BufReader;

use xml::reader::{EventReader, XmlEvent};

fn indent(size: usize) -> String {
    const INDENT: &'static str = "    ";
    (0..size).map(|_| INDENT)
             .fold(String::with_capacity(size*INDENT.len()), |r, s| r + s)
}

fn main() {
    let file = File::open("file.xml").unwrap();
    let file = BufReader::new(file);

    let parser = EventReader::new(file);
    let mut depth = 0;
    for e in parser {
        match e {
            Ok(XmlEvent::StartElement { name, .. }) => {
                println!("{}+{}", indent(depth), name);
                depth += 1;
            }
            Ok(XmlEvent::EndElement { name }) => {
                depth -= 1;
                println!("{}-{}", indent(depth), name);
            }
            Err(e) => {
                println!("Error: {}", e);
                break;
            }
            _ => {}
        }
    }
}

EventReader implements IntoIterator trait, so you can just use it in a for loop directly. Document parsing can end normally or with an error. Regardless of exact cause, the parsing process will be stopped, and iterator will terminate normally.

You can also have finer control over when to pull the next event from the parser using its own next() method:

match parser.next() {
    ...
}

Upon the end of the document or an error the parser will remember that last event and will always return it in the result of next() call afterwards. If iterator is used, then it will yield error or end-of-document event once and will produce None afterwards.

It is also possible to tweak parsing process a little using xml::reader::ParserConfig structure. See its documentation for more information and examples.

You can find a more extensive example of using EventReader in src/analyze.rs, which is a small program (BTW, it is built with cargo build and can be run after that) which shows various statistics about specified XML document. It can also be used to check for well-formedness of XML documents - if a document is not well-formed, this program will exit with an error.

Writing XML documents

xml-rs also provides a streaming writer much like StAX event writer. With it you can write an XML document to any Write implementor.

1 { XmlEvent::start_element(&line[1..]).into() } else if line.starts_with("-") { XmlEvent::end_element().into() } else { XmlEvent::characters(&line).into() }; w.write(event) } fn main() { let mut file = File::create("output.xml").unwrap(); let mut input = io::stdin(); let mut output = io::stdout(); let mut writer = EmitterConfig::new().perform_indent(true).create_writer(&mut file); loop { print!("> "); output.flush().unwrap(); let mut line = String::new(); match input.read_line(&mut line) { Ok(0) => break, Ok(_) => match handle_event(&mut writer, line) { Ok(_) => {} Err(e) => panic!("Write error: {}", e) }, Err(e) => panic!("Input error: {}", e) } } }">
extern crate xml;

use std::fs::File;
use std::io::{self, Write};

use xml::writer::{EventWriter, EmitterConfig, XmlEvent, Result};

fn handle_event
   (w: 
   &
   mut EventWriter
   
    , line: 
    String) -> 
    Result<()> {
    
    let line 
    = line.
    trim();
    
    let event: XmlEvent 
    = 
    if line.
    starts_with(
    "+") 
    && line.
    len() 
    > 
    1 {
        XmlEvent
    ::
    start_element(
    &line[
    1..]).
    into()
    } 
    else 
    if line.
    starts_with(
    "-") {
        XmlEvent
    ::
    end_element().
    into()
    } 
    else {
        XmlEvent
    ::
    characters(
    &line).
    into()
    };
    w.
    write(event)
}


    fn 
    main() {
    
    let 
    mut file 
    = File
    ::
    create(
    "output.xml").
    unwrap();

    
    let 
    mut input 
    = io
    ::
    stdin();
    
    let 
    mut output 
    = io
    ::
    stdout();
    
    let 
    mut writer 
    = EmitterConfig
    ::
    new().
    perform_indent(
    true).
    create_writer(
    &
    mut file);
    
    loop {
        
    print!(
    "> "); output.
    flush().
    unwrap();
        
    let 
    mut line 
    = 
    String
    ::
    new();
        
    match input.
    read_line(
    &
    mut line) {
            
    Ok(
    0) 
    => 
    break,
            
    Ok(_) 
    => 
    match 
    handle_event(
    &
    mut writer, line) {
                
    Ok(_) 
    => {}
                
    Err(e) 
    => 
    panic!(
    "Write error: {}", e)
            },
            
    Err(e) 
    => 
    panic!(
    "Input error: {}", e)
        }
    }
}
   
  

The code example above also demonstrates how to create a writer out of its configuration. Similar thing also works with EventReader.

The library provides an XML event building DSL which helps to construct complex events, e.g. ones having namespace definitions. Some examples:

XmlEvent::start_element("a:hello").attr("a:param", "value").ns("a", "urn:some:document") // XmlEvent::start_element("hello").attr("b:config", "value").default_ns("urn:defaul:uri") // XmlEvent::cdata("some unescaped text")">
// 
    
XmlEvent::start_element("a:hello").attr("a:param", "value").ns("a", "urn:some:document")

// 
    
XmlEvent::start_element("hello").attr("b:config", "value").default_ns("urn:defaul:uri")

// 
XmlEvent::cdata("some unescaped text")

Of course, one can create XmlEvent enum variants directly instead of using the builder DSL. There are more examples in xml::writer::XmlEvent documentation.

The writer has multiple configuration options; see EmitterConfig documentation for more information.

Other things

No performance tests or measurements are done. The implementation is rather naive, and no specific optimizations are made. Hopefully the library is sufficiently fast to process documents of common size. I intend to add benchmarks in future, but not until more important features are added.

Known issues

All known issues are present on GitHub issue tracker: http://github.com/netvl/xml-rs/issues. Feel free to post any found problems there.

License

This library is licensed under MIT license.


Copyright (C) Vladimir Matveev, 2014-2020

Comments
  • Proposal: Stream support reopen parser

    Proposal: Stream support reopen parser

    Hello,

    I'm working on an xmpp library and I need to be able to reopen the parser to consume new bytes.

    It's a first step to be able to reset some condition, I use it like that:

    loop {
        // My parser is an instance of EventReader<Buffer>,
        // Buffer is a custom struct that hold my buffer and expose some useful methods.
        // It implement Read 
        if self.parser.source().available_data() > 0 {
            self.parser.reopen_parser();
        }
        match self.parser.next() { ... }
    }
    

    What do you think?

    Signed-off-by: Freyskeyd [email protected]

    opened by Freyskeyd 13
  • very slow

    very slow

    i took the usage example from the project main page and apply it to my 3 MB xml file with this result:

    python: 00.968 sec (CPython/xml.etree, i have Java parser that are ~10x faster that this one) rust: 04.310 sec (without prints traversing only!)

    • Win7 64 x
    • cargo build --release (rust 1.1)
    opened by s-trooper 10
  • Fails to parse /> as part of XML body

    Fails to parse /> as part of XML body

    xml-rs fails to parse XML that contains /> as part of it's body.

    Per https://www.w3.org/TR/REC-xml/#syntax

    The ampersand character (&) and the left angle bracket (<) must not appear in their literal form, except when used as markup delimiters, or within a comment, a processing instruction, or a CDATA section. If they are needed elsewhere, they must be escaped using either numeric character references or the strings " & " and " < " respectively. The right angle bracket (>) may be represented using the string " > ", and must, for compatibility, be escaped using either " > " or a character reference when it appears in the string " ]]> " in content, when that string is not marking the end of a CDATA section.

    It seems like this should be allowed, as only the & and < are reserved and must be escaped. > is just a may be escaped. As an example in xml-rs, <b>></b> is valid XML, just <b>/></b> is invalid XML.

    Minimally reproducible example:

    cargo.toml

    [package]
    name = "xml-minimal-error"
    version = "0.1.0"
    authors = ["Jeff LaJoie <[email protected]>"]
    edition = "2018"
    
    # See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html
    
    [dependencies]
    xml-rs="0.8"
    

    main.rs

    use xml::reader::ParserConfig;
    
    fn main() {
        let xml_str: &[u8] = b"<b>/></b>";
        let parser_config = ParserConfig::default();
        let parser = parser_config.create_reader(xml_str);
    
        for event in parser {
            match event {
                Err(e) => {
                    println!("Error = {:?}", e);
                }
                _ => {},
            }
        }
    }
    

    Output

    ➜  xml-minimal-error git:(master) ✗ cargo run
       Compiling xml-minimal-error v0.1.0 (/Users/jlajoie/workspace/xml-minimal-error)
        Finished dev [unoptimized + debuginfo] target(s) in 0.23s
         Running `target/debug/xml-minimal-error`
    Error = Error { pos: 1:4, kind: Syntax("Unexpected token: />") }
    

    Additionally, this example does work in other language XML parsers.

    NodeJS Example

    package.json

    {
      "name": "xml-minimal",
      "version": "1.0.0",
      "description": "",
      "main": "index.js",
      "scripts": {
        "test": "echo \"Error: no test specified\" && exit 1"
      },
      "author": "",
      "license": "ISC",
      "dependencies": {
        "fast-xml-parser": "^3.19.0"
      }
    }
    

    index.js

    const parser = require('fast-xml-parser');
    
    let jsonObj = parser.parse("<b>/></b>");
    
    console.dir(jsonObj);
    

    output

    ➜  node-xml-minimal node index.js
    { b: '/>' }
    
    opened by jlajoie 7
  • Allow peeking into next event and getting current event

    Allow peeking into next event and getting current event

    While parsing a xml document it would be nice to know what the next event in the reader will be so I can dispatch it to the correct parser.

    Another way of doing this would be getting the current event (i.e. not having to call next to get an event).

    opened by pedrohjordao 7
  • EventWriter::flush()

    EventWriter::flush()

    It would be handy if EventWriter implemented a flush() method, that would just flush the underlying sink.

    My use-case is a long-running measurement application that occasionally saves a complex XML element into a file backed by a BufWriter. I'd like to flush the writer each time I save a new element so I have the relevant info stored in case of failure.

    opened by dvtomas 6
  • Compile error using nightly compiler

    Compile error using nightly compiler

    Looks like xml-rs stopped compiling with the nightly compiler from 2017-05-21. Nightly build from 2017-05-20 still worked.

    error: no rules expected the token `flags`
       --> /home/pg/.cargo/registry/src/github.com-1ecc6299db9ec823/xml-rs-0.1.26/src/writer/emitter.rs:115:5
        |
    115 |     flags IndentFlags: u8 {
        |     ^^^^^
    
    error: Could not compile `xml-rs`.
    
    opened by pgerber 6
  • Add a streaming API

    Add a streaming API

    When a document is received in chunks (on XMPP for example), it makes sense to initialize the parser on the first chunk, and then feed it data as it comes.

    xml::ParserConfig would get a new streaming boolean that would make it never emit xml::XmlEvent::EndDocument before the root tag is closed, a feed method to xml::EventReader, taking a string and emitting newer xml::XmlEvents as they are parsed, and finally add a method to abort the stream.

    opened by linkmauve 6
  • Move EventReader integration tests into a separate file and enable them

    Move EventReader integration tests into a separate file and enable them

    With these tests enabled people can hack on the parser more confidently. A drawback of using include_bytes! is that cargo doesn't notice changes to any included files (this might just be a cargo bug).

    opened by gkoz 6
  • Bump to 0.1.5 on crates.io for core::slice::Iter support

    Bump to 0.1.5 on crates.io for core::slice::Iter support

    Getting this error as of rust nightly 12-23:

    src/namespace.rs:2:5: 2:23 error: unresolved import `core::slice::Items`. There is no `Items` in `core::slice`
    src/namespace.rs:2 use core::slice::Items;
                           ^~~~~~~~~~~~~~~~~~
    

    Changed as of this commit: https://github.com/rust-lang/rust/commit/f8cfd2480b69a1cc266fc91d0b60c825a9dc18a7#diff-91f9d2237c7851d61911b0ca64792a88

    opened by tcr 6
  • Please expose a function to get the current position when parsing

    Please expose a function to get the current position when parsing

    This would be very helpful to construct error messages when figuring out where in a multi-megabyte XML monstrosity a parser goes off the tracks. It looks like it should be a relatively straightforward change too.

    feature desirable 
    opened by Blei 6
  • Attempted to fix the parsing as characters of ]]> and ?>

    Attempted to fix the parsing as characters of ]]> and ?>

    I was playing around with xml-rs in my own project (https://github.com/jdalberg/cwmp) and decided to use QuickCheck for validation. So a bunch of randomness into fields, generate some xml, parse it and compare with input with different permutation of content, and came across what I consider to be a bug in the parser. It would not recoqnice: "?>" as characters.

    Looking into xml-rs i could see that in line 127 of reader/lexer.rs the token for "?>" was missing from the function, looking at issue #32 i introduced two new testcases and the two tokens for "]]>" and "?>" in the list of token that are possibly characters.

    opened by jdalberg 5
  • EventReader never return Result::Err after document end

    EventReader never return Result::Err after document end

    EventReader never return Result::Err after document end. It returns Ok(EndDocument) over and over instead. It does not depend on the flag ignore_end_of_stream.

    So, next code stuck in an endless loop:

    use std::io::{BufReader, Cursor};
    use xml::{EventReader, ParserConfig};
    
    
    fn main() {
    
        let content = "<a></a>";
    
        let reader = BufReader::new(Cursor::new(content));
        let mut parser = EventReader::new_with_config(reader, ParserConfig::new().ignore_end_of_stream(true /* or false  */) );
    
        loop {
            if let Err(_) = parser.next() {
                break;
            }
        }
    }
    
    opened by chabapok 0
  • Parsing of comments <!-- <!-->

    Parsing of comments

    The tricky case of <!-- <!--> should be parsed as a single comment, ignoring <! in the comment. xml-rs parses this as two unclosed comments.

    opened by kornelski 0
  • Restricted XmlEvent?

    Restricted XmlEvent?

    Opening this to ask whether that's been considered (a search didn't turn up anything and discussions are not enabled): ParserConfig allows some fairly extensive customisation to the emittable XmlEvent variants, down to 6 (and up to 9, from a default of 8 if I'm reading everything right).

    However currently the user still has to "deal" with the un-emittable variants.

    Granted most applications will have a single loop processing the input events, but still, would a more type-heavy interface be an option?

    Sadly Rust still has no polymorphic or type-based variants, so the syntactic overhead would be fairly large, but it would also be fairly simple code, just annoying to write.

    opened by masklinn 0
Owner
Vladimir Matveev
Vladimir Matveev
Generic parser for competitive programming

This is a generic parser for competitive programming, it can be used to read structured data line-by-line or though derive macro in a higher level fashion.

Rémi Dupré 1 Nov 15, 2021
An interpreted LISP-like language. Made in Rust.

RISP Warning: The language is a work in progress. It's lacks many features and might not be stable. RISP is an interpreted LISP-like language. It's wr

Shivram 4 Sep 24, 2022
A native Rust port of Google's robots.txt parser and matcher C++ library.

robotstxt A native Rust port of Google's robots.txt parser and matcher C++ library. Native Rust port, no third-part crate dependency Zero unsafe code

Folyd 72 Dec 11, 2022
A parser combinator library for Rust

combine An implementation of parser combinators for Rust, inspired by the Haskell library Parsec. As in Parsec the parsers are LL(1) by default but th

Markus Westerlind 1.1k Dec 28, 2022
A Rust library for zero-allocation parsing of binary data.

Zero A Rust library for zero-allocation parsing of binary data. Requires Rust version 1.6 or later (requires stable libcore for no_std). See docs for

Nick Cameron 45 Nov 27, 2022
Rust library for parsing configuration files

configster Rust library for parsing configuration files Config file format The 'option' can be any string with no whitespace. arbitrary_option = false

The Impossible Astronaut 19 Jan 5, 2022
Yet Another Parser library for Rust. A lightweight, dependency free, parser combinator inspired set of utility methods to help with parsing strings and slices.

Yap: Yet another (rust) parsing library A lightweight, dependency free, parser combinator inspired set of utility methods to help with parsing input.

James Wilson 117 Dec 14, 2022
rbdt is a python library (written in rust) for parsing robots.txt files for large scale batch processing.

rbdt ?? ?? ?? ?? rbdt is a work in progress, currently being extracted out of another (private) project for the purpose of open sourcing and better so

Knuckleheads' Club 0 Nov 9, 2021
A library to display rich (Markdown) snippets and texts in a rust terminal application

A CLI utilities library leveraging Markdown to format terminal rendering, allowing separation of structure, data and skin. Based on crossterm so works

Canop 614 Dec 29, 2022
The Simplest Parser Library (that works) in Rust

The Simplest Parser Library (TSPL) TSPL is the The Simplest Parser Library that works in Rust. Concept In pure functional languages like Haskell, a Pa

HigherOrderCO 28 Mar 1, 2024
Universal configuration library parser

LIBUCL Table of Contents generated with DocToc Introduction Basic structure Improvements to the json notation General syntax sugar Automatic arrays cr

Vsevolod Stakhov 1.5k Jan 7, 2023
Rust parser combinator framework

nom, eating data byte by byte nom is a parser combinators library written in Rust. Its goal is to provide tools to build safe parsers without compromi

Geoffroy Couprie 7.6k Jan 7, 2023
Parsing Expression Grammar (PEG) parser generator for Rust

Parsing Expression Grammars in Rust Documentation | Release Notes rust-peg is a simple yet flexible parser generator that makes it easy to write robus

Kevin Mehall 1.2k Dec 30, 2022
A fast monadic-style parser combinator designed to work on stable Rust.

Chomp Chomp is a fast monadic-style parser combinator library designed to work on stable Rust. It was written as the culmination of the experiments de

Martin Wernstål 228 Oct 31, 2022
LR(1) parser generator for Rust

LALRPOP LALRPOP is a Rust parser generator framework with usability as its primary goal. You should be able to write compact, DRY, readable grammars.

null 2.4k Jan 7, 2023
A typed parser generator embedded in Rust code for Parsing Expression Grammars

Oak Compiled on the nightly channel of Rust. Use rustup for managing compiler channels. You can download and set up the exact same version of the comp

Pierre Talbot 138 Nov 25, 2022
Rust query string parser with nesting support

What is Queryst? This is a fork of the original, with serde and serde_json updated to 0.9 A query string parsing library for Rust inspired by https://

Stanislav Panferov 67 Nov 16, 2022
JsonPath engine written in Rust. Webassembly and Javascript support too

jsonpath_lib Rust 버전 JsonPath 구현으로 Webassembly와 Javascript에서도 유사한 API 인터페이스를 제공 한다. It is JsonPath JsonPath engine written in Rust. it provide a simil

Changseok Han 95 Dec 29, 2022
Rust badge maker

Badge-Maker Links are generated from cargo, view on docs page A fast and accurate badge maker for services like shields.io. Verified to match badge-ma

Chris Burgess 40 Nov 6, 2022