CBOR: Concise Binary Object Representation

quininer

Last update: Dec 27, 2022

Related tags

Learning Resources cbor4ii

Overview

CBOR 0x(4+4)9 0x49

“The Concise Binary Object Representation (CBOR) is a data format whose design goals include the possibility of extremely small code size, fairly small message size, and extensibility without the need for version negotiation.”

see rfc8949

Compatibility

The core mod should be fully compatible with rfc8949, but some extensions will not be implemented in this crate, such as datetime, bignum, bigfloat.

The serde mod defines how Rust types should be expressed in CBOR, which is not any standard, so different crate may have inconsistent behavior.

This library is intended to be compatible with serde_cbor, but will not follow some unreasonable designs of serde_cbor.

cbor4ii will express the unit type as an empty array instead of null. This avoids the problem that serde_cbor cannot distinguish between None and Some(()). see https://github.com/pyfisch/cbor/issues/185
cbor4ii does not support packed mode, and it may be implemented in future, but it may not be compatible with serde_cbor. If you want packed mode, you should look at bincode.

Performance

It is not specifically optimized for performance in implementation, but benchmarks shows that its performance is slightly better than serde_cbor.

And it supports zero-copy deserialization and deserialize_ignored_any of serde, so in some scenarios it may perform better than crate that do not support such feature.

Robustness

The decode part has been fuzz tested, and it should not crash or panic during the decoding process.

The decode of serde module has a depth limit to prevent stack overflow or OOM caused by specially constructed input. If you want to turn off deep inspection or adjust parameters, you can implement the dec::Read trait yourself.

License

This project is licensed under the MIT license.

Comments

Support for RFC-7049 Canonical CBOR key ordering

This library explicitly specifies RF-8949, so this request might be out of scope.

In the project I want to use cbor4ii for I'm stuck with RFC-7049 Canonical CBOR key ordering. This means that keys are sorted by their length first. I wonder if that could perhaps be added behind a feature flag. Here is an implementation the seems to work. I didn't create a PR as this clearly needs more discussion first.

    fn collect_map<K, V, I>(self, iter: I) -> Result<(), Self::Error>
    where
        K: ser::Serialize,
        V: ser::Serialize,
        I: IntoIterator<Item = (K, V)>,
    {
        #[cfg(not(feature = "use_std"))]
        use crate::alloc::vec::Vec;
        use serde::ser::SerializeMap;

        // TODO vmx 2022-04-04: This could perhaps be upstreamed, or the
        // `cbor4ii::serde::buf_writer::BufWriter` could be made public.
        impl enc::Write for Vec<u8> {
            type Error = crate::alloc::collections::TryReserveError;

            fn push(&mut self, input: &[u8]) -> Result<(), Self::Error> {
                self.try_reserve(input.len())?;
                self.extend_from_slice(input);
                Ok(())
            }
        }

        // CBOR RFC-7049 specifies a canonical sort order, where keys are sorted by length first.
        // This was later revised with RFC-8949, but we need to stick to the original order to stay
        // compatible with existing data.
        // We first serialize each map entry into a buffer and then sort those buffers. Byte-wise
        // comparison gives us the right order as keys in DAG-CBOR are always strings and prefixed
        // with the length. Once sorted they are written to the actual output.
        let mut buffer: Vec<u8> = Vec::new();
        let mut mem_serializer = Serializer::new(&mut buffer);
        let mut serializer = Collect {
            bounded: true,
            ser: &mut mem_serializer,
        };
        let mut entries = Vec::new();
        for (key, value) in iter {
            serializer.serialize_entry(&key, &value)
               .map_err(|_| enc::Error::Msg("Map entry cannot be serialized.".into()))?;
            entries.push(serializer.ser.writer.clone());
            serializer.ser.writer.clear();
        }

        TypeNum::new(major::MAP << 5, entries.len() as u64).encode(&mut self.writer)?;
        entries.sort_unstable();
        for entry in entries {
            self.writer.push(&entry)?;
        }

        Ok(())
    }

I'd also like to note that I need even more changes for my use case (it's a subset of CBOR), for which I will need to fork this library. Nonetheless I think it would be a useful addition and I'd also prefer if the fork would be as minimal as possible. I thought I bring it up, to make clear that it won't be a showstopper if this change wouldn't be accepted.

opened by vmx 12

Making more things public
I've now created a working Serde implementation for my needs, that is based on cbor4ii core. Though I still need to patch core as some things that I need for the deserializer are not public.

core::marker, core::dec::peek_one, core::dec::pull_one and core::dec::decode_len are currently pub(crate), but are needed by my deserializer (which is more or less a copy of yours). Could those be made public?

Besides those, I've copy and paste also other things. I'm OK with having those duplicated if you don't think they should be part of the public interface, though I surely prefer having less code in my crate. Those are:

core::util::ScopeGuard: I've a full copy of that without any changes.

serde::io_buf_reader::IoReader: I've added a constructor. As I don't use the serde1 one feature, it would be cool if it could perhaps be added to core::utils.

serde::io_writer::IoWrite: I've added a constructor. Same as with the reader, having it in core::utils would be great.
opened by vmx 6

Should this work?

use serde::{Deserialize, Serialize};

#[derive(Deserialize, Serialize)]
enum Foo {
    A(String),
}

fn main() {
    let foo = Foo::A(String::new());

    let mut data = Vec::new();
    cbor4ii::serde::to_writer(&mut data, &foo).unwrap();

    let reader = std::io::BufReader::new(data.as_slice());
    let _: Foo = cbor4ii::serde::from_reader(reader).unwrap();
}

I get:

thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: RequireBorrowed { name: "str" }', src/main.rs:15:54

Can you tell me what am I doing wrong?

opened by smoelius 4

core: tag parsing without value

It's not easily possible to parse a tag without the value with the current public API. This commit introduces a TagStart, which is similar to ArrayStart and MapStart. It only parses the tag into a u64, without advancing to the value.

This is the API I came up with, as always, if there's a better way to solve my problem of parsing a tag value with the current public API, I'm happy to hear about it.

opened by vmx 3
serde-derived `Serialize`/`Deserialize` impl for `Value` is not working as expected
The Value type is expected to serialize/deserialize from data it represents. For the following test code:

fn test_deserialize_null_value() { let val = crate::core::Value::Null; let ret = crate::serde::to_vec(vec![], &val).unwrap(); eprintln!("{:02x?}", &ret); }

Expected Output:

[f6]

which is the primitive value null, but

Actual Output:

[64, 4e, 75, 6c, 6c]

which is a 4-character UTF-8 text Null.

It looks like the generated implementation from the derive directives interprets Value as a externally tagged enum. Is it a mistake or a feature by design?
opened by bdbai 3
serde: use scopeguard dependency

Instead of having a custom scope guard implementation use the established scopeguard crate.

I made this change to my Serde implementation and I thought I upstream it. I'm a fan of having less code to maintain. scopeguard itself doesn't have any further dependencies so the overall build time/code size should stay similar. Though I can also understand if you prefer keeping your dependencies to a minimum.

opened by vmx 2
serde: give access to underlying writer

When using the Serializer from an external crate it can be useful to be able to access the underlying writer.

It turns out that the code I posted at https://github.com/quininer/cbor4ii/issues/13#issuecomment-1095342912 only works with this patch.

opened by vmx 2
core: expose a public in-memory writer

BufWriter can be used to serialize things into memory. The API is inspired by std::io::BufWriter.

In case you wonder why the decode tests suddenly need use_std, that's been the case even before this change.

opened by vmx 2
core: fix decoding of the maximum negative 64-bit value as i128

Decoding of the maximum negative 64-bit value in CBOR (-2^64 = -18446744073709551616) wasn't possible and resulted in an overflow error.

This commit also adds test for smaller values, e.g. decoding the maximum negative 32-bit as i64. Those were already working correctly.

opened by vmx 1
core: decode_len can be private

core::dec::decode_len doesn't need to be public to the crate, it can be private. This PR is based on the comment by @quininer at https://github.com/quininer/cbor4ii/issues/16#issuecomment-1098666029

opened by vmx 1
serde: fix overflow on i64::MIN
When decoding the minimum value of i64 (-9223372036854775808), there is an overflow error. With this fix it can be deserialized correctly.

The reason for the failure was the order of the operations. Prior to this change the decoding had these steps:

Decode the bytes into a u64 => 9223372036854775807

Add 1 => 9223372036854775808

Cast to i64 => error as i64::MAX is 9223372036854775807.

Negate the number

The new steps are:

Decode the bytes into a u64 => 9223372036854775807

Cast to i64 => 9223372036854775807

Negate the number => -9223372036854775807

Subtract 1 => -9223372036854775808
opened by vmx 1
Support zero copy using the bytes crate
I have a few cases where I would like to deserialize cbor from an incoming BytesMut into cbor struct that looks sth like this

struct Foo { pub data: Bytes, // other small fields }

where I very much need to avoid copying the data (that data is on the order of 100s of MB often).

I tried hacking sth together, but didn't quite get anywhere, so wondering if you could give me some hints on how this could be done. Thanks.
opened by dignifiedquire 4
Remove `*Start` types from dec module

Instead of having TagStart, ArrayStart and MapStart, implement functions directly on the concrete types.

This PR tries to address the comment at https://github.com/quininer/cbor4ii/pull/22#issuecomment-1108791827.

I've based it on the 0.3 branch. But this also means that I haven't tested it with my own Serde implementation (which is still on on 0.2.x), but I don't see a reason why it shouldn't work there. I didn't want to spent too much time on this PR as I'm not sure if that's a good approach or not. Though I'm happy to spend more time on it, in case it's the right direction.

opened by vmx 2

improve some case

I noticed some cases where Cow<str> is not enough.

For example, decoding a struct with a short lifetime reader requires a memory allocation for each fields name. This is unnecessary, because we only need to judge whether the key is as expected, and we don't need to use it. Also, the automatic allocation of memory on the heap makes it difficult for us to improve this.

I'm thinking of exposing the decode_buf interface in some form to get around this.

Change `Decode` trait

I considered changing the Decode trait to allow this optimization. like

trait Decode<'de, T> {
    fn decode<R: Read<'de>>(&mut self, reader: &mut R) -> Result<T, Error>;
}

This allows decoding the object without allocating any memory, just identifying if it is as expected. will look like this

struct Expect<'a> {
    expect: &'a str,
    count: usize
}

impl<'de> Decode<'de, bool> for Expect<'a> {
    fn decode<R: Read<'de>>(&mut self, reader: &mut R) -> Result<T, Error> {
        let mut result = true;
        while let Some(want_len) = self.expect.len().checked_sub(self.count) {
            let buf = reader.fill(want_len)?;
            if buf.is_empty() { return Err(Error::Eof) };
            let len = cmp::min(buf.len(), want_len);
            if self.expect.as_bytes()[self.count..][..len] != buf {
                result = true;
                break
            }
            self.count += len;
            reader.advance(len);
        }
        Ok(result)
    }
}

This also allows for more precise memory allocations, such as decode bytes to stack

struct StackVec([u8; 1024]);

impl<'de> Decode<'de, &[u8]> for StackVec {
    fn decode<R: Read<'de>>(&mut self, reader: &mut R) -> Result<T, Error> {
        let mut len = decode_len(reader)?;
        let mut count = 0;
        while len != 0 {
            let buf = reader.fill(len)?;
            let buf_len = buf.len()
            if buf_len + count > self.0.len() { return Err(Error::Eof) };
            self.0[count..][..buf_len)].copy_from_slice(&buf);
            count += buf_len;
            reader.advance(buf_len);
        }
        Ok(&self.0[..count])
    }
}

enhancement

opened by quininer 1

Remove decode_with

The idea of decode_with is to avoid peeking byte multiple times, which might be good for performance, but the api becomes a little bit more complicated.

We should try remove it and see if there is a negative impact on performance.

opened by quininer 0
Error refactor
Error seems a bit confusing now, especially DecodeError.

Split core::Error and serde::Error so that you don't care about Msg kind when using core mod.

Merge kinds such as Eof/RequireLength, Mismatch/TypeMismatch, etc.

Since this affects the api, it will be in the next version.
opened by quininer 2