Rust high performance xml reader and writer

Johann Tuffe

Last update: Dec 31, 2022

Related tags

Encoding XML html serialization xml writer deserialization xml-parser pull-parser performance-xml

Overview

quick-xml

High performance xml pull reader/writer.

The reader:

is almost zero-copy (use of Cow whenever possible)
is easy on memory allocation (the API provides a way to reuse buffers)
support various encoding (with encoding feature), namespaces resolution, special characters.

docs.rs

Syntax is inspired by xml-rs.

Example

Reader

use quick_xml::Reader;
use quick_xml::events::Event;

let xml = r#"<tag1 att1 = "test">
                <tag2><!--Test comment-->Test</tag2>
                <tag2>
                    Test 2
                </tag2>
            </tag1>"#;

let mut reader = Reader::from_str(xml);
reader.trim_text(true);

let mut count = 0;
let mut txt = Vec::new();
let mut buf = Vec::new();

// The `Reader` does not implement `Iterator` because it outputs borrowed data (`Cow`s)
loop {
    match reader.read_event(&mut buf) {
        Ok(Event::Start(ref e)) => {
            match e.name() {
                b"tag1" => println!("attributes values: {:?}",
                                    e.attributes().map(|a| a.unwrap().value).collect::<Vec<_>>()),
                b"tag2" => count += 1,
                _ => (),
            }
        },
        Ok(Event::Text(e)) => txt.push(e.unescape_and_decode(&reader).unwrap()),
        Ok(Event::Eof) => break, // exits the loop when reaching end of file
        Err(e) => panic!("Error at position {}: {:?}", reader.buffer_position(), e),
        _ => (), // There are several other `Event`s we do not consider here
    }

    // if we don't keep a borrow elsewhere, we can clear the buffer to keep memory usage low
    buf.clear();
}

Writer

use quick_xml::Writer;
use quick_xml::Reader;
use quick_xml::events::{Event, BytesEnd, BytesStart};
use std::io::Cursor;
use std::iter;

let xml = r#"<this_tag k1="v1" k2="v2"><child>text</child></this_tag>"#;
let mut reader = Reader::from_str(xml);
reader.trim_text(true);
let mut writer = Writer::new(Cursor::new(Vec::new()));
let mut buf = Vec::new();
loop {
    match reader.read_event(&mut buf) {
        Ok(Event::Start(ref e)) if e.name() == b"this_tag" => {

            // crates a new element ... alternatively we could reuse `e` by calling
            // `e.into_owned()`
            let mut elem = BytesStart::owned(b"my_elem".to_vec(), "my_elem".len());

            // collect existing attributes
            elem.extend_attributes(e.attributes().map(|attr| attr.unwrap()));

            // copy existing attributes, adds a new my-key="some value" attribute
            elem.push_attribute(("my-key", "some value"));

            // writes the event to the writer
            assert!(writer.write_event(Event::Start(elem)).is_ok());
        },
        Ok(Event::End(ref e)) if e.name() == b"this_tag" => {
            assert!(writer.write_event(Event::End(BytesEnd::borrowed(b"my_elem"))).is_ok());
        },
        Ok(Event::Eof) => break,
	// you can use either `e` or `&e` if you don't want to move the event
        Ok(e) => assert!(writer.write_event(&e).is_ok()),
        Err(e) => panic!("Error at position {}: {:?}", reader.buffer_position(), e),
    }
    buf.clear();
}

let result = writer.into_inner().into_inner();
let expected = r#"<my_elem k1="v1" k2="v2" my-key="some value"><child>text</child></my_elem>"#;
assert_eq!(result, expected.as_bytes());

Serde

When using the serialize feature, quick-xml can be used with serde's Serialize/Deserialize traits.

Here is an example deserializing crates.io source:

// Cargo.toml
// [dependencies]
// serde = { version = "1.0", features = [ "derive" ] }
// quick-xml = { version = "0.21", features = [ "serialize" ] }
extern crate serde;
extern crate quick_xml;

use serde::Deserialize;
use quick_xml::de::{from_str, DeError};

#[derive(Debug, Deserialize, PartialEq)]
struct Link {
    rel: String,
    href: String,
    sizes: Option<String>,
}

#[derive(Debug, Deserialize, PartialEq)]
#[serde(rename_all = "lowercase")]
enum Lang {
    En,
    Fr,
    De,
}

#[derive(Debug, Deserialize, PartialEq)]
struct Head {
    title: String,
    #[serde(rename = "link", default)]
    links: Vec<Link>,
}

#[derive(Debug, Deserialize, PartialEq)]
struct Script {
    src: String,
    integrity: String,
}

#[derive(Debug, Deserialize, PartialEq)]
struct Body {
    #[serde(rename = "script", default)]
    scripts: Vec<Script>,
}

#[derive(Debug, Deserialize, PartialEq)]
struct Html {
    lang: Option<String>,
    head: Head,
    body: Body,
}

fn crates_io() -> Result<Html, DeError> {
    let xml = "<!DOCTYPE html>
        <html lang=\"en\">
          <head>
            <meta charset=\"utf-8\">
            <meta http-equiv=\"X-UA-Compatible\" content=\"IE=edge\">
            <meta name=\"viewport\" content=\"width=device-width, initial-scale=1\">

            <title>crates.io: Rust Package Registry</title>


        <!-- EMBER_CLI_FASTBOOT_TITLE --><!-- EMBER_CLI_FASTBOOT_HEAD -->
        <link rel=\"manifest\" href=\"/manifest.webmanifest\">
        <link rel=\"apple-touch-icon\" href=\"/cargo-835dd6a18132048a52ac569f2615b59d.png\" sizes=\"227x227\">

            <link rel=\"stylesheet\" href=\"/assets/vendor-8d023d47762d5431764f589a6012123e.css\" integrity=\"sha256-EoB7fsYkdS7BZba47+C/9D7yxwPZojsE4pO7RIuUXdE= sha512-/SzGQGR0yj5AG6YPehZB3b6MjpnuNCTOGREQTStETobVRrpYPZKneJwcL/14B8ufcvobJGFDvnTKdcDDxbh6/A==\" >
            <link rel=\"stylesheet\" href=\"/assets/cargo-cedb8082b232ce89dd449d869fb54b98.css\" integrity=\"sha256-S9K9jZr6nSyYicYad3JdiTKrvsstXZrvYqmLUX9i3tc= sha512-CDGjy3xeyiqBgUMa+GelihW394pqAARXwsU+HIiOotlnp1sLBVgO6v2ZszL0arwKU8CpvL9wHyLYBIdfX92YbQ==\" >


            <link rel=\"shortcut icon\" href=\"/favicon.ico\" type=\"image/x-icon\">
            <link rel=\"icon\" href=\"/cargo-835dd6a18132048a52ac569f2615b59d.png\" type=\"image/png\">
            <link rel=\"search\" href=\"/opensearch.xml\" type=\"application/opensearchdescription+xml\" title=\"Cargo\">
          </head>
          <body>
            <!-- EMBER_CLI_FASTBOOT_BODY -->
            <noscript>
                <div id=\"main\">
                    <div class='noscript'>
                        This site requires JavaScript to be enabled.
                    </div>
                </div>
            </noscript>

            <script src=\"/assets/vendor-bfe89101b20262535de5a5ccdc276965.js\" integrity=\"sha256-U12Xuwhz1bhJXWyFW/hRr+Wa8B6FFDheTowik5VLkbw= sha512-J/cUUuUN55TrdG8P6Zk3/slI0nTgzYb8pOQlrXfaLgzr9aEumr9D1EzmFyLy1nrhaDGpRN1T8EQrU21Jl81pJQ==\" ></script>
            <script src=\"/assets/cargo-4023b68501b7b3e17b2bb31f50f5eeea.js\" integrity=\"sha256-9atimKc1KC6HMJF/B07lP3Cjtgr2tmET8Vau0Re5mVI= sha512-XJyBDQU4wtA1aPyPXaFzTE5Wh/mYJwkKHqZ/Fn4p/ezgdKzSCFu6FYn81raBCnCBNsihfhrkb88uF6H5VraHMA==\" ></script>

          </body>
        </html>
}";
    let html: Html = from_str(xml)?;
    assert_eq!(&html.head.title, "crates.io: Rust Package Registry");
    Ok(html)
}

Credits

This has largely been inspired by serde-xml-rs. quick-xml follows its convention for deserialization, including the $value special name.

Parsing the "value" of a tag

If you have an input of the form <foo abc="xyz">bar</foo>, and you want to get at the bar, you can use the special name $value:

struct Foo {
    pub abc: String,
    #[serde(rename = "$value")]
    pub body: String,
}

Performance

Note that despite not focusing on performance (there are several unecessary copies), it remains about 10x faster than serde-xml-rs.

Features

encoding: support non utf8 xmls
serialize: support serde Serialize/Deserialize

Performance

Benchmarking is hard and the results depend on your input file and your machine.

Here on my particular file, quick-xml is around 50 times faster than xml-rs crate.

// quick-xml benches
test bench_quick_xml            ... bench:     198,866 ns/iter (+/- 9,663)
test bench_quick_xml_escaped    ... bench:     282,740 ns/iter (+/- 61,625)
test bench_quick_xml_namespaced ... bench:     389,977 ns/iter (+/- 32,045)

// same bench with xml-rs
test bench_xml_rs               ... bench:  14,468,930 ns/iter (+/- 321,171)

// serde-xml-rs vs serialize feature
test bench_serde_quick_xml      ... bench:   1,181,198 ns/iter (+/- 138,290)
test bench_serde_xml_rs         ... bench:  15,039,564 ns/iter (+/- 783,485)

For a feature and performance comparison, you can also have a look at RazrFalcon's parser comparison table.

Contribute

Any PR is welcomed!

License

MIT

Comments

Split `Reader` into `SliceReader` and `BufferedReader`

This PR was split from #417.

This splits Reader into two new structs, SliceReader and IoReader to better separate which kind of byte source the Reader uses to read bytes. Changes are based on https://github.com/tafia/quick-xml/pull/417#issuecomment-1181318331. A Reader<SliceReader> also explicitly doesn't have methods for buffered access anymore.

opened by 999eagle 25
Loosing attributes during serialization
Hello,

I have an XML with some elements that have attributes. It deserializes into a struct ok, but when I serialize to output, the attributes are no longer attributes, but rather individual elements. How do I correctly indicate that attributes need to stay as attributes and nothing else?

For example, I have the following element with attributes:

<Representation audioSamplingRate="48000" bandwidth="63700" codecs="mp4a.40.2" id="519678732693145249_AO-00-00-00-71">

When I read the element into the struct:

#[derive(Default, Debug, Clone, PartialEq, Serialize, Deserialize)] #[serde(rename_all = "camelCase")] struct Representation{ #[serde(rename = "audioSamplingRate")] a_audio_sampling_rate: Option<String>, #[serde(rename = "bandwidth")] a_bandwidth: String, #[serde(rename = "codecs")] a_codecs: String, #[serde(rename = "id")] a_id: String, }

It looks and feels right. However, during the output, the serialization step, I get the output that looks like this:

<Representation> <audioSamplingRate>48000</audioSamplingRate> <bandwidth>63700</bandwidth> <codecs>mp4a.40.2</codecs> <id>519678732693145249_AO-00-00-00-71</id> </Representation>

What am I not doing right? Or, what am I missing in this process? I'm reading the element indicated above, as a string. Also, the output is a string as well.

Thank you, Max
enhancement serde
opened by mlevkov 19
Allow using tokio's AsyncBufRead

I would like to use quick-xml together with tokio's AsyncRead, as I am fetching an object from S3 with rusoto_s3 and that returns a body that implements AsyncRead. This is an attempt at making that happen.

opened by endor 18
Allow using tokio's AsyncBufRead [Rebased]

This is a rebase of this PR https://github.com/tafia/quick-xml/pull/233 on top of the current master.

I don't know if there was a better way to do this and my intention is not to step on anyone's work but rather keep it moving along. I'm happy to take this PR down if you want to update yours @endor or honestly whatever works for people. This PR is a proper rebase of the endor/async on top of tafia/master so line by line credit is still in tact except for the few places I had to fix rebase conflicts.

@endor, @tafia happy to help with whatever y'all need help with here, lemme know!

opened by itsgreggreg 17
Rewrite serializer
This PR rewrites serializer and fixes the following issues:

Fixes #252

Fixes #280

Fixes #287

Fixes #343

Fixes #346

Invalidates #354 (but it still should be checked that it works aligned with the serializer)

Fixes #361

Partially addresses #368

Fixes #429

Fixes #430

Notable changes:

Removed $unflatten= prefix, as it not used anymore

Removed $primitive= prefix, as it not used anymore

$value special name split into two special names #text and ~#any~ #content. @dralley, please check especially, is the documentation clear about the differences?

Change how attributes is defined. Now attribute should have a name started with @ character. Without that fields serialized as elements and can be deserialized only from elements. Deserialization from element or from attribute into one field is not possible anymore. It's a weird wish anyway

bug enhancement serde arrays
opened by Mingun 16

Getting "invalid type: map, expected a sequence" when deserialising

Hello. First of all, thanks for all the great work on this crate!

I'm trying to use the serde implementation to deserialise some XML that looks like this:

<Attribute Name="Example">
    <Array>
        <DataObject ObjectType="TestObject">A</DataObject>
        <DataObject ObjectType="TestObject">B</DataObject>
        <DataObject ObjectType="TestObject">C</DataObject>
    </Array>
</Attribute>

I would like to read this as:

struct Example {
    value: Vec<TestObject>,
}

Reading the DataObject as a TestObject is easy since all DataObjects in this context will be TestObjects. The Example part was harder, but I got it working using an enum and serde's tag. However, I can't get the Vec to read correctly. I'm getting the error Custom("invalid type: map, expected a sequence").

To make sure there wasn't anything more complex causing a problem, I also created an Array object which only contains the Vec, but still get the same error.

If I remove the parent Example object and just try to deserialise the Array, it works as expected.

I also tried serialising the data structure I want, and I get exactly the output I expect, but deserialising this output fails with the error. This is strange since (as far as I can tell) my serialisation logic is identical to my deserialisation logic.

I also have some code here that reproduces this last example:

Example

use std::{error::Error, fmt::Display};

use serde::{Serialize, Deserialize};
use serde_with::serde_as;

#[derive(Serialize, Deserialize, Debug)]
struct Array<T> {
    #[serde(rename = "DataObject")]
    items: Vec<T>,
}

#[derive(Serialize, Deserialize, Debug, Default)]
struct DataObject {
    #[serde(rename = "ObjectType")]
    object_type: String,

    #[serde(rename = "Attribute", default)]
    attributes: Vec<Attribute>,

    #[serde(rename = "$value")]
    data: String,
}

impl From<TestObject> for DataObject {
    fn from(value: TestObject) -> Self {
        Self { object_type: stringify!(TestObject).into(), attributes: Vec::new(), data: value.data }
    }
}

#[derive(Debug)]
struct BadObjectTypeError {
    expected_object_type: String,
    actual_object_type: String,
}

impl Display for BadObjectTypeError {
    fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
        write!(f, "Bad object type. Expected {}, found {}", self.expected_object_type, self.actual_object_type)
    }
}

impl Error for BadObjectTypeError {}

impl TryFrom<DataObject> for TestObject {
    type Error = BadObjectTypeError;

    fn try_from(value: DataObject) -> Result<Self, Self::Error> {
        if "TestObject" == value.object_type {
            Ok(Self {
                data: value.data,
            })
        } else {
            Err(BadObjectTypeError {
                expected_object_type: std::any::type_name::<Self>().into(),
                actual_object_type: value.object_type,
            })
        }
    }
}

#[derive(Serialize, Clone, Deserialize, Debug)]
#[serde(try_from = "DataObject", into = "DataObject")]
struct TestObject {
    #[serde(rename = "$value")]
    data: String,
}

#[serde_as]
#[derive(Serialize, Deserialize, Debug)]
#[serde(tag = "Name", rename_all = "SCREAMING_SNAKE_CASE")]
enum Attribute {
    Example {
        #[serde(rename = "Array")]
        value: Array<TestObject>
    }
}

fn main() {
    let attribute = Attribute::Example { value: Array { items: vec![TestObject { data: "A".into() }, TestObject { data: "B".into() }, TestObject { data: "C".into() }] } };
    let xml = quick_xml::se::to_string(&attribute).unwrap();
    println!("{}", &xml);

    let xd = &mut quick_xml::de::Deserializer::from_str(&xml);
    let import: Attribute = serde_path_to_error::deserialize(xd).unwrap();
    dbg!(&import);
}

Would really appreciate any guidance with this!

wontfix serde arrays

opened by davystrong 16

Selectively deserializing XML elements with serde

Hi!

I want to parse a 3MB XML file with serde. I used serde-xml-rs, and it is painfully slow; I hacked my way through serde-xml-rs's code to make it work with quick-xml instead (the APIs are very similar after all), and that sped it up tremendously (from 1.5s to 0.1s).

But! I don't want to deserialize an entire 3MB XML into a giant struct (which has loads of small heap-allocated vecs inside it), when all I want is to scan for a specific element inside it, and deserialize just that one.

I thought of using quick-xml's events to reach the element I want, then read_to_end() to get the whole element as a big text block, and then use serde-xml-rs to parse the text block as xml; except this approach loses all namespacing/encoding info.

I also thought of implementing some sort of from_element(xmlreader, start_element), which would give a partial Deserializer object, which is my current favorite option.

Thoughts? Any better ways to do this?

opened by andreivasiliu 16
Add support for empty elements
Empty elements are represented as the combination of a start element with an end element. This means that reading a file with empty elements:

<some_tag attr='1'/>

would result in an empty start element, followed by an end element:

<some_tag attr='1'></some_tag>

This merge request changes that behaviour so that a read/write roundtrip results in an identical file.
enhancement
opened by tmoers 16
Add Vec-less reading of events and borrowing deserialization

Add a new reader.read_event_unbuffered() method that does not require a user-given Vec buffer, and borrows from the input, implemented if the input is a &[u8].

Still needs more polishing, and it needs the buf_position branch to be pushed first, since it is based on those changes. I had to move all of the input-reading methods into a trait, and make them return a reference to the text that was read. Because of that, there's now a new requirement of at most 1 input-reading method being called per read_event(), so I had to rework whitespace skipping, and to move all of the bang element processing into yet another read_until-like function which doesn't return until it has all of the text.

Next up is making a deserializer that can use this to remove the DeserializedOwned restriction and allow user structs to borrow from the input when possible, allowing for truly zero-copy parsing and deserialization.

The only user-facing change is the 1 new method, the rest is completely hidden. I'm not fond of the new method's name, so I'd appreciate any help with figuring out a better name for it.

Bikeshedding for the rest of the names would be appreciated too.

opened by andreivasiliu 13

DeserializeOwned prevents deserialization into structs with lifetime bounds

First of all, thanks for this great crate. It has been serving me very well and I am a big fan.

I am trying to deserialize a struct I just serialized, just a basic round-trip. The struct, to be specific, is this one. You'll notice the lifetimes and Cow<'a, str>s in the nested structs. Serialization works fine, but deserialization does not.

implementation of `_IMPL_DESERIALIZE_FOR_EdiConvertToRequest::_serde::Deserialize` is not general enough
   --> src/main.rs:102:28
    |
102 |           DataFormat::Xml => quick_xml::de::from_str(&req.data)?,
    |                              ^^^^^^^^^^^^^^^^^^^^^^^ implementation of `_IMPL_DESERIALIZE_FOR_EdiConvertToRequest::_serde::Deserialize` is not general enough
    | 
   ::: /home/alex/.cargo/registry/src/github.com-1ecc6299db9ec823/serde-1.0.104/src/de/mod.rs:531:1
    |
531 | / pub trait Deserialize<'de>: Sized {
532 | |     /// Deserialize this value from the given Serde deserializer.
533 | |     ///
534 | |     /// See the [Implementing `Deserialize`][impl-deserialize] section of the
...   |
569 | |     }
570 | | }
    | |_- trait `_IMPL_DESERIALIZE_FOR_EdiConvertToRequest::_serde::Deserialize` defined here
    |
    = note: `edi::edi_document::EdiDocument<'_, '_>` must implement `_IMPL_DESERIALIZE_FOR_EdiConvertToRequest::_serde::Deserialize<'0>`, for any lifetime `'0`...
    = note: ...but `edi::edi_document::EdiDocument<'_, '_>` actually implements `_IMPL_DESERIALIZE_FOR_EdiConvertToRequest::_serde::Deserialize<'1>`, for some specific lifetime `'1`

Following the guidance on this SO post, I think changing this crate's Deserialize definition to one like serde_json's would work. Note that I am able to successfully deserialize this struct back from JSON without issue, it is just the XML that isn't working.

I report this as an issue because as far as I can tell, there is no workaround besides forking the crate that defines the struct and making everything it contains owned.

enhancement serde optimization

opened by sezna 13

panick on read_namespaced_event with different buffers

Environment: Debian 9 amd64 Reproducibility: Always Version: quick_xml 0.12.1. Also known to be reproducible with 0.11.0. Steps to reproduce: compile and run this:

extern crate quick_xml; // version "0.11.0" or "0.12.1".

fn main()
{
    let xml = r#"<?xml version='1.0'?><a:a xmlns:a='http://example.org/something' xmlns='b:c'><a:d><hello xmlns='x:y:z'><world>earth</world></hello></a:d>"#;
    let mut parser = quick_xml::Reader::from_str(xml); // for quick_xml 0.11.0, use "quick_xml::reader::Reader".
    let mut buf = Vec::new();
    let mut buf_ns = Vec::new();
    for _ in 0..4 {
      let _ = parser.read_namespaced_event(&mut buf, &mut buf_ns);
    }
    let mut buf = Vec::new();
    let mut buf_ns = Vec::new();
    for _ in 0..4 {
        let _ = parser.read_namespaced_event(&mut buf, &mut buf_ns); // <-- panick
    }
}

Acutal result: it panics:

$ RUST_BACKTRACE=1 cargo run
    Finished dev [unoptimized + debuginfo] target(s) in 0.0 secs
     Running `/home/willem/e/crash-quick-xml/target/debug/crash-quick-xml`
thread 'main' panicked at 'index 29 out of range for slice of length 0', libcore/slice/mod.rs:785:5
stack backtrace:
   0: std::sys::unix::backtrace::tracing::imp::unwind_backtrace
             at libstd/sys/unix/backtrace/tracing/gcc_s.rs:49
   1: std::sys_common::backtrace::_print
             at libstd/sys_common/backtrace.rs:71
   2: std::panicking::default_hook::{{closure}}
             at libstd/sys_common/backtrace.rs:59
             at libstd/panicking.rs:380
   3: std::panicking::default_hook
             at libstd/panicking.rs:396
   4: std::panicking::rust_panic_with_hook
             at libstd/panicking.rs:576
   5: std::panicking::begin_panic
             at libstd/panicking.rs:537
   6: std::panicking::begin_panic_fmt
             at libstd/panicking.rs:521
   7: rust_begin_unwind
             at libstd/panicking.rs:497
   8: core::panicking::panic_fmt
             at libcore/panicking.rs:71
   9: core::slice::slice_index_len_fail
             at libcore/slice/mod.rs:785
  10: <core::ops::range::Range<usize> as core::slice::SliceIndex<[T]>>::index
             at /checkout/src/libcore/slice/mod.rs:916
  11: core::slice::<impl core::ops::index::Index<I> for [T]>::index
             at /checkout/src/libcore/slice/mod.rs:767
  12: quick_xml::reader::Namespace::prefix
             at /home/willem/.cargo/registry/src/github.com-1ecc6299db9ec823/quick-xml-0.12.1/src/reader.rs:855
  13: quick_xml::reader::NamespaceBufferIndex::find_namespace_value::{{closure}}
             at /home/willem/.cargo/registry/src/github.com-1ecc6299db9ec823/quick-xml-0.12.1/src/reader.rs:902
  14: core::iter::traits::DoubleEndedIterator::rfind::{{closure}}
             at /checkout/src/libcore/iter/traits.rs:580
  15: <core::slice::Iter<'a, T> as core::iter::traits::DoubleEndedIterator>::try_rfold
             at /checkout/src/libcore/slice/mod.rs:1319
  16: core::iter::traits::DoubleEndedIterator::rfind
             at /checkout/src/libcore/iter/traits.rs:579
  17: <core::iter::Rev<I> as core::iter::iterator::Iterator>::find
             at /checkout/src/libcore/iter/mod.rs:459
  18: quick_xml::reader::NamespaceBufferIndex::find_namespace_value
             at /home/willem/.cargo/registry/src/github.com-1ecc6299db9ec823/quick-xml-0.12.1/src/reader.rs:899
  19: <quick_xml::reader::Reader<B>>::read_namespaced_event
             at /home/willem/.cargo/registry/src/github.com-1ecc6299db9ec823/quick-xml-0.12.1/src/reader.rs:566
  20: crash_quick_xml::main
             at src/main.rs:15
  21: std::rt::lang_start::{{closure}}
             at /checkout/src/libstd/rt.rs:74
  22: std::panicking::try::do_call
             at libstd/rt.rs:59
             at libstd/panicking.rs:479
  23: __rust_maybe_catch_panic
             at libpanic_unwind/lib.rs:102
  24: std::rt::lang_start_internal
             at libstd/panicking.rs:458
             at libstd/panic.rs:358
             at libstd/rt.rs:58
  25: std::rt::lang_start
             at /checkout/src/libstd/rt.rs:74
  26: main
  27: __libc_start_main
  28: _start

Expected result:

if the API is not supposed to be used like this, and if it is possible to enforce that with Rust's typesystem: the API is written in such a way that Rust's typesystem forbids to use the library like this
if the API is not supposed to be used like this, and if it is not possible to enforce it with Rust's typesystem: the English API specifications state it must not be used like this
if the API does not intend to forbid to use it like this: it should not panic.

bug namespaces

opened by willempx 13

Deserialization of a doctype with very long content fails
quick_xml::de::from_reader() parsing fails if the XML contains a doctype with content larger than the internal BufRead capacity. For instance

<!DOCTYPE [  ]> <X></X>

Here is a minimal code to reproduce this issue. It fails with an ExpectedStart error.

use std::io::Write; use serde::Deserialize; #[derive(Deserialize)] struct X {} fn main() { { let mut file = std::fs::File::create("test.xml").unwrap(); let header = &"<!DOCTYPE X []><X></X>"; let padding = 8192 - (header.len() + 2); write!(file, "{header}{:1$}{footer}", "", padding).unwrap(); } let file = std::fs::File::open("test.xml").unwrap(); let reader = std::io::BufReader::new(file); let _: X = quick_xml::de::from_reader(reader).unwrap(); }

Cargo.toml content

[package] name = "test" version = "0.1.0" edition = "2021" [dependencies] quick-xml = { version = "0.27.1", features = ["serialize"] } serde = { version = "1.0", features = ["derive"] }

When decreasing the padding size, or using BufReader::with_capacity() to increase the buffer, even of 1 byte, there is no error.

Other BufRead implementations don't have this issue (checked with &[u8] and stdin).

Content does not have to be in one "block". The same issue occurs for a doctype split into multiple declarations and comments.

With a longer doctype with real content, the error may be different. For instance it may complains about an invalid ! from a !ENTITY tag.

No issue with serde-xml-rs, even for larger comments.

Tested on Windows, with rustc 1.66.0.

bug help wanted
opened by benoitryder 0

Deserializing data into an untagged enum with serde

Problem description

Hello 👋 I've been setting up the data models/types to automatically deserialize XML data from an API into a set of types, but I'm struggling with a particular issue involving untagged enums.

The XML data I'm receiving can include data of two formats, and I would like to deserialize the XML data into whichever type it can successfully deserialize into first.

Example XML data

Format 1: text containing a reference/href

       <ServiceLocation xmlns="http://naesb.org/espi/customer">
            <UsagePoints>
                <UsagePoint>https://api.com/DataCustodian/espi/1_1/resource/Subscription/1/UsagePoint/100</UsagePoint>
                <UsagePoint>https://api.com/DataCustodian/espi/1_1/resource/Subscription/1/UsagePoint/200</UsagePoint>
            </UsagePoints>
        </ServiceLocation>

Format 2: fully-expanded

       <ServiceLocation xmlns="http://naesb.org/espi/customer">
            <UsagePoints>
                <UsagePoint>
                  <serialNumber>100</serialNumber>
                  <status>On</status>
                </UsagePoint>
                <UsagePoint>
                  <serialNumber>200</serialNumber>
                  <status>On</status>
                </UsagePoint>
            </UsagePoints>
        </ServiceLocation>

Solution attempt

I want to deserialize this data into a single type, where a field containing an enum can contain either of these variants.

Type definitions

Here's my attempt at a type definition:

#[derive(Debug, serde::Deserialize)]
struct ServiceLocation {
    #[serde(rename = "UsagePoints")]
    usage_points: Option<Vec<UsagePoint>>
}

#[derive(Debug, serde::Deserialize)]
#[serde(untagged)
enum UsagePoint {
    UsagePointReference(UsagePointReference),
    UsagePointFull(UsagePointFull)
}

#[derive(Debug, Deserialize)]
struct UsagePointReference {
    #[serde(rename = "#text")]
    href: Url,
}

#[derive(Debug, Deserialize)]
struct UsagePointFull {
    #[serde(rename = "serialNumber")]
    serial_number: String,
    status: String
}

Deserialization error

Here's the error I received when providing either of the XML blobs (shown in the previous section) to a call of quick_xml::de::from_str:

called `Result::unwrap()` on an `Err` value: Custom("data did not match any variant of untagged enum UsagePoint")
thread 'resources::customer::tests::should_parse_service_location_with_usagepoint_reference' panicked at 'called `Result::unwrap()` on an `Err` value: Custom("data did not match any variant of untagged enum UsagePoint")',

If you could advise me on how I'm supposed to define the types to handle this variable situation regarding my data, or point me to docs where some similar use-case is elaborated on, I'd really appreciate it.

Thanks!

help wanted serde

opened by seanpianka 3

Merge consequent text and CDATA events into one string

This PR fixes #474 and introduces a way to read current parser configuration, which was impossible before that.

I've changed the way how configuration is accessed and changed: instead of having functions to change configuration flags, readers now provides a reference to a Config object. Immutable and mutable references are provided. This new feature is used to temporary disable trimming while read text events in serde Deserializer.

After fixing #516, all configuration flags are safe to changed at any time, because their does not change the internal state of a reader in a user-visible way (for example, the expand_empty_elements changes an internal state of a reader, but that change is rolled back after next call to read_event, so user cannot see it consequences. It is safe to disable this setting just after read fake Start event and still get a fake End event after that).
bug enhancement serde

opened by Mingun 3
Allow to continue parsing after `Error::EndEventMismatch`
I'm trying to parse a Netscape bookmark file which has unclosed tags (i.e. <DL>), and for this I'm explicitly ignoring the EndEventMismatch error during the parsing:

match reader.read_event() { Err(e) => match e { QuickXmlError::EndEventMismatch { expected: _, found: _ } => (),

The issue is that an Eof event immediately follows this, which causes that the reader stops parsing the rest of the document.

I couldn't find in the documentation if this behaviour is expected, hence why I'm opening the issue.
enhancement help wanted
opened by moy2010 2
Fix #257 and allow $text to work with tags in the text
This is a very early attempt at solving #257 (awful, but functional, code below). Unfortunately, I ran into a design issue so I'd like to open the discussion now and get your feedback on what to do.

Suppose we have the following struct to deserialize (from the test case attached in serde-de.rs below)

#[derive(Debug, Deserialize, PartialEq)] struct Trivial<T> { #[serde(rename = "$text")] value: T, }

The test case has the following xml

<root>style tags in this document</root>

deserialize into

Trivial { value: "style tags in this document".to_string(), }

The test case also assumes the xml

<outer><root>style tags in this document</root></outer>

should not deserialize (missing field $text)

However, if we change the design for how $text works to include embedded tags, this would now be deserializable where 'outer' is the root tag and everything between can now be fed into a text field because root can be part of the string now.

Outside of the above being newly expected behavior we could make the user describe which tags can be deserialized into text fields instead of read:

#[derive(Debug, Deserialize, PartialEq)] struct Trivial<T> { #[serde(rename = "$text$$")] // only ...s and ...s are allowed to be part of the string value: T, }

Let me know if you have any other ideas on what would be best to do here.
enhancement serde
opened by JOSEPHGILBY 3
[Question]: Deserialize optional vector
Afaik the following line in Cargo.toml should mean that I'm using the latest git version of quick-xml:

quick-xml = { git = "https://github.com/tafia/quick-xml", features = ['serialize']}

Although I've put question in the title this issue might be a bug, but I can't tell whether I'm doing something wrong or if it is indeed a bug.

I've got the following xml:

<ENTRY> <CUE_V2 NAME="foo"></CUE_V2> <CUE_V2 NAME="bar"></CUE_V2> </ENTRY>

And I've got the following struct to deserialize:

use serde::{Deserialize, Serialize}; #[derive(Debug, Serialize, Deserialize)] #[serde(rename = "ENTRY")] struct Entry { #[serde(rename = "CUE_V2")] cues: Vec<Cue>, } #[derive(Debug, Serialize, Deserialize)] #[serde_with::serde_as] struct Cue { #[serde(rename = "@NAME")] name: String, }

This works but is incorrect since the cue vector should be optional like so:

#[derive(Debug, Serialize, Deserialize)] #[serde(rename = "ENTRY")] struct Entry { #[serde(rename = "CUE_V2")] cues: Option<Vec<Cue>>, }

But this gives the error:

thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: UnexpectedEnd([69, 78, 84, 82, 89])', src/main.rs:330:56

There might be a trivial solution, but after two days of using rust in total I can't seem to find it :)

If I may take the opportunity to ask another unrelated question, how do I build the documentation for the git version of the package locally? I've been looking through #490 but I think having the actual documentation would be a bit better :)
bug serde arrays
opened by sandersantema 4

Releases(v0.27.1)

v0.27.1(Dec 28, 2022)
What's Changed

Bug Fixes

#530: Fix an infinite loop leading to unbounded memory consumption that occurs when skipping events on malformed XML with the overlapped-lists feature active.

#530: Fix an error in the Deserializer::read_to_end when overlapped-lists feature is active and malformed XML is parsed

Full Changelog: https://github.com/tafia/quick-xml/compare/v0.27.0...v0.27.1
Source code(tar.gz)
Source code(zip)
v0.27.0(Dec 25, 2022)
What's Changed

MSRV was increased from 1.46 to 1.52 in #521.

New Features

#521: Implement Clone for all error types. This required changing Error::Io to contain Arc<std::io::Error> instead of std::io::Error since std::io::Error does not implement Clone.

Bug Fixes

#490: Ensure that serialization of map keys always produces valid XML names. In particular, that means that maps with numeric and numeric-like keys (for example, "42") no longer can be serialized because XML name cannot start from a digit

#500: Fix deserialization of top-level sequences of enums, like
<?xml version="1.0" encoding="UTF-8"?>  <A/> <C/>

#514: Fix wrong reporting Error::EndEventMismatch after disabling and enabling .check_end_names

#517: Fix swapped codes for \r and \n characters when escaping them

#523: Fix incorrect skipping text and CDATA content before any map-like structures in serde deserializer, like
unwanted text<struct>...</struct>

#523: Fix incorrect handling of xs:lists with encoded spaces: they still act as delimiters, which is confirmed also by mature XmlBeans Java library

#473: Fix a hidden requirement to enable serde's derive feature to get quick-xml's serialize feature for edition = 2021 or resolver = 2 crates

Misc Changes

#490: Removed $unflatten= special prefix for fields for serde (de)serializer, because:

it is useless for deserializer

serializer was rewritten and does not require it anymore

This prefix allowed you to serialize struct field as an XML element and now replaced by a more thoughtful system explicitly indicating that a field should be serialized as an attribute by prepending @ character to its name

#490: Removed $primitive= prefix. That prefix allowed you to serialize struct field as an attribute instead of an element and now replaced by a more thoughtful system explicitly indicating that a field should be serialized as an attribute by prepending @ character to its name

#490: In addition to the $value special name for a field a new $text special name was added:

$text is used if you want to map field to text content only. No markup is expected (but text can represent a list as defined by xs:list type)

$value is used if you want to map elements with different names to one field, that should be represented either by an enum, or by sequence of enums (Vec, tuple, etc.), or by string. Use it when you want to map field to any content of the field, text or markup

Refer to documentation for details.

#521: MSRV bumped to 1.52.

#473: serde feature that used to make some types serializable, renamed to serde-types

#528: Added documentation for XML to serde mapping

New Contributors

@sashka made their first contribution in https://github.com/tafia/quick-xml/pull/498

@ultrasaurus made their first contribution in https://github.com/tafia/quick-xml/pull/504

@zeenix made their first contribution in https://github.com/tafia/quick-xml/pull/521

Full Changelog: https://github.com/tafia/quick-xml/compare/v0.26.0...v0.27.0
Source code(tar.gz)
Source code(zip)