TLV-C encoding support.

Related tags

Encoding Multimedia tlvc

Overview

TLV-C: Tag - Length - Value - Checksum

TLV-C is a variant on the traditional [TLV] format that adds a whole mess of checksums and whatnot. Why, you ask?

To support checking integrity of randomly accessed bits of data within a nested TLV-C structure, without necessarily needing to read the whole thing. (TLV itself provides no integrity checking.)
To enable basic structural traversal of a structure without needing to understand the type tags. (TLV can only be interpreted with reference to the definition of each type tag, making a general pretty-printer, for example, difficult.)

The format

TLV-C is stored as a blob of bytes. What it's stored in can vary -- a file on a disk, an I2C EEPROM, mercury delay lines, etc. Regardless of the storage media the format is the same.

Chunks

TLV-C content consists of a sequence of (zero or more) chunks. Each chunk is marked with

A tag, distinguishing a particular chunk from other kinds of chunks,
A length, giving the number of bytes in the body (below),
A header checksum, computed over the previous two fields,
A body, containing zero or more bytes of data, and
A body checksum, computed over the body contents.

Now to walk through those fields in more detail.

Tag. A chunk's tag is a four-byte field used to indicate how the contents of the chunk should be interpreted. In TLV, tags are often chosen to be four-character ASCII strings, so that the chunk can be mnemonic to both humans and computers, and we continue this tradition -- except for consistency the field is encoded as UTF-8. Any four-byte UTF-8 string can be a tag. If you want a chunk tag that encodes to fewer than four bytes (like, say, "HI"), pad it with NUL or spaces and include the padding in your tag/format definition.

Length. The length gives the number of bytes in the body, encoded as a 32-bit little-endian unsigned integer. This helps a reader know how far to skip to ignore the chunk, or to find the body checksum (below). Note that the body itself is padded to a multiple of four bytes, but that padding is not included in the length.

Header Checksum. The header checksum is computed over the tag and length fields. It is separate from the body checksum so that a header can be quickly integrity-checked (or distinguished from random data) without having to read the entire body, which could be large. For efficient implementation on modern microcontrollers, the header checksum is a simple integer multiply-accumulate expression over the tag (interpreted as an unsigned 32-bit integer) and the length:

header_checksum = complement(tag_as_u32 * 0x6b32_9f69 + length)

...where the magic number is a randomly selected 32-bit prime, and the result is bitwise-complemented to ensure that zeros don't hash to zero.

Body. The body consists of zero or more bytes (as determined by the length field), plus padding if required to make it a multiple of four bytes in length. The body contents can be arbitrary, and their interpretation is subject to the type tag and any associated specification. However, it is common to nest TLV-C structures, so the APIs provide utility routines for interpreting body contents as TLV-C. We can do this without knowledge of the type tag / specification by checking the checksums, to (probabilistically) distinguish TLV-C from random data.

Body checksum. The body checksum follows the body (and any padding), and is a 32-bit little-endian integer. It is computed using CRC-32 over the body (excluding the padding) using the iSCSI polynomial. We selected this polynomial to have decent error detection over smallish Hamming distances for our expected chunk sizes, which are on the order of 10 kiB.

Concatenating chunks

A single chunk is a valid TLV-C structure. Any concatenation of chunks, with no intervening non-chunk data, is also a valid TLV-C structure. A particular application of TLV-C could expect a single chunk to be found on a medium, or could continue seeking and reading chunks until it runs out.

Given a TLV-C structure followed by non-TLV-C noise -- such as zero fill, erased Flash reading as 0xFF, or random data -- a reader can probabilistically determine the end of the valid TLV-C sequence by stopping when a "header" without a valid checksum is found. One caveat here is if TLV-C data is written over existing TLV-C, e.g. when rewriting an EEPROM; in this case, a valid TLV-C chunk from the previous data may follow the new data and be interpreted along with it. To avoid this case, terminate any TLV-C structure with an invalid header; the easiest invalid header is a sequence of 12 0x00 bytes.

Nesting chunks

As noted above, it's common for the body contents of a chunk to be more TLV-C data. In this case, we wind up with bytes being covered by multiple checksums:

The container's body checksum covers the entire body.
The header checksum of a chunk in the body redundantly covers the header.
The body checksums of chunks in the body redundantly cover their portion of the body contents.
And so forth recursively.

This makes the data more difficult to generate, but is a deliberate concession to making it easy to process. It's possible to integrity check an arbitrarily-complex nested TLV-C structure by only examining the top-level header and body checksums, and reading its contents in one pass. On the other hand, it's also possible to perform integrity checks of deeply nested data, by inspecting header checksums while traversing the nested structure, and then the final body checksum covering the data in question.

Text format

The tlvctool program uses a text format to make it easier to write and view TLV-C content. In this format, the data structure is described in UTF-8 text using the following notation for a chunk:

("atag", [ body contents ])

Body contents can be arbitrary sequences of bytes (written in square brackets) or more chunks. For instance, a chunk with tag "BARC" and seven bytes of body content would be written as

("BARC", [ [8, 6, 7, 5, 3, 0, 9] ])

whereas the same chunk containing a nested "FOOB" chunk with the same body contents, followed by an empty "QUUX" chunk, would be written as

("BARC", [
    ("FOOB", [ [8, 6, 7, 5, 3, 0, 9] ]),
    ("QUUX", []),
])

In this format, the body of a chunk can alternate freely between chunks and arbitrary sequences of bytes, which is useful for describing complex formats, or for writing test fixtures that include chunks with invalid checksums. (Since checksums are implicit in the text format, you can't otherwise express an invalid checksum.)

Whitespace and newlines are insignificant, trailing commas in square-bracket lists can be included or omitted, hex and binary are available with the 0x and 0b prefixes, and comments in both // C++ style and /* C style */ are accepted.

This format happens to be valid RON.

A CSV parser for Rust, with Serde support.

csv A fast and flexible CSV reader and writer for Rust, with support for Serde. Dual-licensed under MIT or the UNLICENSE. Documentation https://docs.r

1.3k Jan 5, 2023

A tool to deserialize data from an input encoding, transform it and serialize it back into an output encoding.

dts A simple tool to deserialize data from an input encoding, transform it and serialize it back into an output encoding. Requires rust = 1.56.0. Ins

11 Dec 14, 2022

Databento Binary Encoding (DBZ) - Fast message encoding and storage format for market data

dbz A library (dbz-lib) and CLI tool (dbz-cli) for working with Databento Binary Encoding (DBZ) files. Python bindings for dbz-lib are provided in the

15 Nov 4, 2022

Encoding and decoding support for BSON in Rust

bson-rs Encoding and decoding support for BSON in Rust Index Overview of BSON Format Usage BSON Values BSON Documents Modeling BSON with strongly type

304 Dec 30, 2022

Character encoding support for Rust

Encoding 0.3.0-dev Character encoding support for Rust. (also known as rust-encoding) It is based on WHATWG Encoding Standard, and also provides an ad

264 Dec 14, 2022

Serde support for encoding/decoding rusty_v8 values

34 Nov 28, 2022

Error context library with support for type-erased sources and backtraces, targeting full support of all features on stable Rust

Error context library with support for type-erased sources and backtraces, targeting full support of all features on stable Rust, and with an eye towards serializing runtime errors using serde.

1 Feb 12, 2022

I/O and binary data encoding for Rust

nue A collection of tools for working with binary data and POD structs in Rust. pod is an approach at building a safe interface for transmuting POD st

38 Nov 9, 2022

Implementation of Bencode encoding written in rust

Rust Bencode Implementation of Bencode encoding written in rust. Project Status Not in active developement due to lack of time and other priorities. I

32 Aug 6, 2022

A Gecko-oriented implementation of the Encoding Standard in Rust

encoding_rs encoding_rs an implementation of the (non-JavaScript parts of) the Encoding Standard written in Rust and used in Gecko (starting with Fire

284 Dec 13, 2022

A HTML entity encoding library for Rust

A HTML entity encoding library for Rust Example usage All example assume a extern crate htmlescape; and use htmlescape::{relevant functions here}; is

41 Nov 1, 2022

A TOML encoding/decoding library for Rust

toml-rs A TOML decoder and encoder for Rust. This library is currently compliant with the v0.5.0 version of TOML. This library will also likely contin

1k Dec 30, 2022

Encoding and decoding images in Rust

Image Maintainers: @HeroicKatora, @fintelia How to contribute An Image Processing Library This crate provides basic image processing functions and met

3.5k Jan 9, 2023

Rust-tokenizer offers high-performance tokenizers for modern language models, including WordPiece, Byte-Pair Encoding (BPE) and Unigram (SentencePiece) models

rust-tokenizers Rust-tokenizer offers high-performance tokenizers for modern language models, including WordPiece, Byte-Pair Encoding (BPE) and Unigra

165 Jan 1, 2023

Comments

Update tlvc-text for use as a library; run clippy
There were three different implementations of TLV-C encoding in the repo:

tool/src/main.rs

tool/src/text.rs

text/src/lib.rs

Of the three, only tool/src/main.rs actually worked.

This PR consolidates to a single implementation in text/src/lib.rs, in order to use it as a library elsewhere.

In addition, it adds a full encode/decode test.

Finally, it includes Clippy and rustfmt fixes (with our standard 80-column .rustfmt.toml)
opened by mkeeter 1

TLV-C encoding support.

Related tags

Overview

TLV-C: Tag - Length - Value - Checksum

The format

Chunks

Concatenating chunks

Nesting chunks

Text format

You might also like...

A CSV parser for Rust, with Serde support.

A tool to deserialize data from an input encoding, transform it and serialize it back into an output encoding.

Databento Binary Encoding (DBZ) - Fast message encoding and storage format for market data

Encoding and decoding support for BSON in Rust

Character encoding support for Rust

Serde support for encoding/decoding rusty_v8 values

Error context library with support for type-erased sources and backtraces, targeting full support of all features on stable Rust

I/O and binary data encoding for Rust

Implementation of Bencode encoding written in rust

A Gecko-oriented implementation of the Encoding Standard in Rust

A HTML entity encoding library for Rust

A TOML encoding/decoding library for Rust

Encoding and decoding images in Rust

Rust-tokenizer offers high-performance tokenizers for modern language models, including WordPiece, Byte-Pair Encoding (BPE) and Unigram (SentencePiece) models

Google Encoded Polyline encoding & decoding in Rust.

TIFF decoding and encoding library in pure Rust

A performant binary encoding for geographic data based on flatbuffers

ABI encoding, fast

Variable-length signed and unsigned integer encoding that is byte-orderable for Rust

Comments

Update tlvc-text for use as a library; run clippy

Owner

Oxide Computer Company

Character encoding support for Rust

Implementation of Bencode encoding written in rust

A Gecko-oriented implementation of the Encoding Standard in Rust

A HTML entity encoding library for Rust

A TOML encoding/decoding library for Rust

Variable-length signed and unsigned integer encoding that is byte-orderable for Rust

A series of compact encoding schemes for building small and fast parsers and serializers

Astro Format is a library for efficiently encoding and decoding a set of bytes into a single buffer format.

Entropy Encoding notebook. Simple implementations of the "tANS" encoder/decoder.

Rust implementation of CRC(16, 32, 64) with support of various standards