Skip to content
This repository has been archived by the owner on May 20, 2022. It is now read-only.

I really don't understand serde custom deserialization #3

Closed
simoncozens opened this issue Apr 15, 2021 · 13 comments
Closed

I really don't understand serde custom deserialization #3

simoncozens opened this issue Apr 15, 2021 · 13 comments

Comments

@simoncozens
Copy link
Owner

I need to write a custom deserializer for Counted in types.rs and I just don't know how to do it. :-/

@cmyr
Copy link
Collaborator

cmyr commented Apr 15, 2021

Happy to take a look.

@simoncozens
Copy link
Owner Author

simoncozens commented Apr 15, 2021

Thanks. I’ve got the hang of serialisation types but I just don’t see how deserialisation keeps state and knows which methods to call. If I could see what the derived macro code was doing I would probably be able to customise it, but I don’t know how to get at that either.

We need counted arrays, offsets and some other tricks (see #2). Have a look at the Counted module in otspec/types.rs and the Font struct in src/font.rs for how I’ve done the serialisation.

@cmyr
Copy link
Collaborator

cmyr commented Apr 15, 2021

You can get the derived macro code with cargo expand.

I'm looking now, and I'm also a bit confused; hard to keep track of what is supposed to be in the Deserializer and what is in the custom Deserialize impl.

A potentially useful reference I'm looking through now is the bincode project, which handles deserialization of binary data.

To check my understanding: what I would expect here is that we are writing out a u16 representing the length of the collection, and then writing out that number of items, and then the inverse on deserialize? I couldn't see in your ser code where you wrote out the length field.

@simoncozens
Copy link
Owner Author

It’s the same question: what should be done by the Serialize trait and what should be done by the Serializer? Because our data layout is specific to the data we’re serialising there’s overlap. What I have done so far is to tell the Serialize trait that a Counted array maps to a serde seq, and tell the Serializer that a seq is written out with the length first. This seems to be how other implementations do it, and serialize_seq takes a length probably so that it can be written out first.

I notice that, like the deserializer, calling a method on the serializer causes a move, so you can’t serialize twice in the same method. (ie if you try calling serialize_u16 to emit the count you then can’t serialize the vec.) serialize_seq returns a handle that can serialize multiple elements, so that was another hint that I should go in that direction.

@cmyr
Copy link
Collaborator

cmyr commented Apr 15, 2021

Yea, I think you're on the right track. I have some progress here: excuse the code-dump but I'm going to step out shortly.

This is coming fairly close, but the test is failing the equality check; won't have time to dig into that today, unfortunately. :(

pub mod Counted {
    use serde::de::{SeqAccess, Visitor};
    use serde::ser::SerializeSeq;
    use serde::Serialize;
    use serde::{Deserialize, Deserializer, Serializer};

    pub fn serialize<S, T>(v: &[T], serializer: S) -> Result<S::Ok, S::Error>
    where
        S: Serializer,
        T: Serialize,
    {
        let mut my_seq = serializer.serialize_seq(Some(v.len()))?;
        my_seq.serialize_element(&(v.len() as u16));
        for k in v {
            my_seq.serialize_element(&k)?;
        }
        my_seq.end()
    }
    pub fn deserialize<'de, D, T: serde::Deserialize<'de>>(d: D) -> Result<Vec<T>, D::Error>
    where
        D: Deserializer<'de>,
    {
        d.deserialize_seq(SeqVisitor::new())
    }

    struct SeqVisitor<T> {
        len: usize,
        _phantom: std::marker::PhantomData<T>,
    }

    impl<T> SeqVisitor<T> {
        fn new() -> Self {
            SeqVisitor {
                len: 0,
                _phantom: std::marker::PhantomData,
            }
        }
    }

    impl<'de, T> Visitor<'de> for SeqVisitor<T>
    where
        T: serde::Deserialize<'de>,
    {
        type Value = Vec<T>;

        fn expecting(&self, formatter: &mut std::fmt::Formatter) -> std::fmt::Result {
            write!(formatter, "A sequence of {} values", self.len)
        }

        fn visit_seq<A: SeqAccess<'de>>(mut self, mut seq: A) -> Result<Self::Value, A::Error> {
            self.len = seq
                .next_element::<u16>()?
                .ok_or_else(|| serde::de::Error::custom("Count type must begin with length"))? as usize;

            let mut result = Vec::with_capacity(self.len);
            for i in 0..self.len {
                let next = seq
                    .next_element::<T>()?
                    .ok_or_else(|| serde::de::Error::invalid_length(i, &self))?;
                result.push(next)
            }
            Ok(result)
        }
    }
}

@simoncozens
Copy link
Owner Author

Thank you so much! I just added a small test case for the Counted module, and that works, so the problem is somewhere else. I think I can work on that from here. Thanks again.

@simoncozens
Copy link
Owner Author

Stupid cut-and-paste error, I was deserialising an F2DOT14 with a I32Visitor.

@cmyr
Copy link
Collaborator

cmyr commented Apr 16, 2021

great, happy to hear. I definitely stared at this blankly for a long time before it clicked; it was actually in the process of writing a response about how I was confused and couldn't figure it out that I realized my mistake. Happy to see continued progress on this, it looks like it could be a really useful utility!

@simoncozens
Copy link
Owner Author

Yep. I now have (mostly) Font objects serializing and deserialising. I'm hacking around the offset issue for now but I do want to get that sorted.

@cmyr
Copy link
Collaborator

cmyr commented Apr 16, 2021

Yep. I now have (mostly) Font objects serializing and deserialising. I'm hacking around the offset issue for now but I do want to get that sorted.

What's the offset issue?

@simoncozens
Copy link
Owner Author

In lots of cases, we need to be able to read data potentially out-of-order somewhere else in the big data structure, based on an offset value stored in the struct. e.g. in the top-level table:

struct TableRecord {
    tag: Tag,
    checksum: uint32,
    offset: uint32,
    length: uint32,
}

But I can't find a way to tell serde "go to this position in the stream and read this many bytes".

@simoncozens
Copy link
Owner Author

Mainly because (a) I can't pass any data into a .serialize_... / .deserialize_... method, and (b) I can't work out how to call my own methods the serializer/deserializer, given that serde wants a serialise method to be generic over all potential serialisers.

@cmyr
Copy link
Collaborator

cmyr commented Apr 16, 2021

right, this is definitely a tricky thing to fit into the serde model. I'll let it stew for a bit, there may be some more or less hacky solutions. One might be to deserialize to an intermediate helper type, and then process that into your final representation afterwards, something like:

struct cmap {
    header: CmapHeader,
    tables: Vec<CmapTable>,
}

#[derive(Deserialize)]
struct CmapHelper<'a> {
    header: CmapHeader,
    table_data: &'a [u8],
}

impl<'de> Deserialize<'de> for cmap {
    fn deserialize<D>(deserializer: D) -> Result<cmap, D::Error>
    where
        D: Deserializer<'de>,
    {
        let helper = CmapHelper::deserialize(d)?;
        let mut tables = Vec::with_capacity(helper.records.len());
    
        for record in &helper.header.records {
            tables.push(CmapTable::decode(record, helper.header.table_data)?);
        }
        
        Ok(cmap { header: helper.header, tables })
    }
}

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants