Universal Schema Analysis

Last update: Jul 13, 2022

Related tags

Miscellaneous schema_analysis

Overview

schema_analysis

Universal-ish Schema Analysis

Ever wished you could figure out what was in that json file? Or maybe it was xml... Ehr, yaml? It was definitely toml.

Alas, many great tools will only work with one of those formats, and the internet is not so nice a place as to finally understand that no, xml is not an acceptable document format.

Enter this neat little tool, a single interface to any self-describing format supported by our gymnast friend, serde.

Features

Works with any self-describing format with a Serde implementation.
Suitable for large files.
Keeps track of some useful info for each type.
Keeps track of null/normal/missing/duplicate values separately.
Integrates with Schemars and json_typegen to produce types and json schema if needed.
There's a demo website here.

Usage

let data: &[u8] = b"true";

// Just pick your format, and deserialize InferredSchema as if it were a normal type.
let inferred: InferredSchema = serde_json::from_slice(data)?;
// let inferred: InferredSchema = serde_yaml::from_slice(data)?;
// let inferred: InferredSchema = serde_cbor::from_slice(data)?;
// let inferred: InferredSchema = toml::from_slice(data)?;
// let inferred: InferredSchema = rawbson::de::from_bytes(data)?;
// let inferred: InferredSchema = quick_xml::de::from_reader(data)?;

// InferredSchema is a wrapper around Schema
let schema: Schema = inferred.schema;
let expected: Schema = Schema::Boolean(Default::default());
assert!(schema.structural_eq(&expected));

// The wrapper is there so we can both do the magic above, and also store the data for later
let serialized_schema: String = serde_json::to_string_pretty(&schema)?;

That's it.

Check Schema to see what info you get, and targets to see the available integrations (which include code and json schema generation).

Advanced Usage

I know, I know, the internet is evil and has decided to plague you with not one, but thousands, maybe even millions, of files.

Unfortunately Serde relies on type information to work, so ~~there is nothing we can do about it~~ we can bring out the big guns: DeserializeSeed. It's everything you love about Serde, but with runtime state.

= Default::default(); context.aggregate(&1); context.aggregate(&2); context.aggregate(&1000); assert_eq!(inferred.schema, Schema::Integer(context)); } ">

let a_lot_of_json_files: &[&str] = &[ "1", "2", "1000" ];
let mut iter = a_lot_of_json_files.iter();

if let Some(file) = iter.next() {
    // We use the first file to generate a new schema to work with.
    let mut inferred: InferredSchema = serde_json::from_str(file)?;

    // Then we iterate over the rest to expand the schema.
    for file in iter {
        let mut json_deserializer = serde_json::Deserializer::from_str(file);
        // DeserializeSeed is implemented on &mut InferredSchema
        // So here it borrows the data mutably and runs it against the deserializer.
        let () = inferred.deserialize(&mut json_deserializer)?;
    }

    // The result in this case would be a simple integer schema
    // that 'has met' the numbers 1, 2, and 100.
    let mut context: NumberContext<i128> = Default::default();
    context.aggregate(&1);
    context.aggregate(&2);
    context.aggregate(&1000);

    assert_eq!(inferred.schema, Schema::Integer(context));
}

Furthermore, if you need to generate separate schemas (for example to run the analysis on multiple threads) you can use the Coalesce trait to merge them after-the-fact.

I really wish I could convert that Schema in something, you know, actually useful.

You are in luck! You can check out here the integrations with json_typegen and Schemars to convert the analysis into useful files like Rust types and json schemas. You can also find a demo website here.

How does this work?

For a the short story long go here, the juicy bit is that Serde is kind enough to let the format tell us what it is working with, we take it from there and construct a nice schema from that info.

Performance

These are not proper benchmarks, but should give a vague idea of the performance on a 3 years old i7 laptop with the raw data already loaded into memory.

Size	wasm (MB/s)	native (MB/s)	Format	File #
~180MB	~20s (9)	~5s (36)	json	1
~650MB	~150s (4.3)	~50s (13)	json	1
~1.7GB	~470s (3.6)	~145s (11.7)	json	1
~2.1GB	^a	~182s (11.5)	json	1
~13.3GB^b		~810s (16.4)	xml	~200k

^a This one seems to go over some kind of browser limit when fetching the data in the Web Worker, I believe I would have to split large files to handle it.

^b ~2.7GB compressed. This one seems like it would be a worst-case scenario because it includes decompression overhead and the files had a section that was formatted text which resulted in crazy schemas. (The json pretty printed schema was almost 0.5GB!)

Universal configuration library parser

LIBUCL Table of Contents generated with DocToc Introduction Basic structure Improvements to the json notation General syntax sugar Automatic arrays cr

1.5k Jan 7, 2023

Universal changelog generator using conventional commit+ with monorepo support. Written in Rust.

chlog Universal changelog generator using conventional commit+ with monorepo support. chlog can generate the changelog from the conventional commits w

3 Nov 27, 2022

a universal meta-transliterator that can decipher arbitrary encoding schemas, built in pure Rust

transliterati a universal meta-transliterator that can decipher arbitrary encoding schemas, built in pure Rust what does it do? You give it this: Барл

7 Dec 21, 2022

A universal, distributed package manager

39 Dec 30, 2022

An interactive, universal Wordle solver

Eldrow (Wordle in reverse) is an interactive, universal Wordle solver that attempts to achieve near to mathematically perfect performance without rely

3 Sep 2, 2022

A universal SDK for FDU. Powered by Rust.

libfdu A universal SDK for FDU. Building You need Rust Nightly installed: $ rustup default nightly Build the library by running: $ cargo build or $ ca

6 Sep 11, 2022

Universal Windows library for discovering common render engines functions. Supports DirectX9 (D3D9), DirectX10 (D3D10), DirectX11 (D3D11), DirectX12 (D3D12).

Shroud Universal library for discovering common render engines functions. Supports DirectX9 (D3D9), DirectX10 (D3D10), DirectX11 (D3D11), DirectX12 (D

6 Dec 10, 2022

ABQ is a universal test runner that runs test suites in parallel. It’s the best tool for splitting test suites into parallel jobs locally or on CI

🌐 abq.build 🐦 @rwx_research 💬 discord 📚 documentation ABQ is a universal test runner that runs test suites in parallel. It’s the best tool f

13 Apr 7, 2023

Truly universal encoding detector in pure Rust - port of Python version

Charset Normalizer A library that helps you read text from an unknown charset encoding. Motivated by original Python version of charset-normalizer, I'

29 Oct 9, 2023

🗽 Universal Node Package Manager

🗽 NY Universal Node Package Manager node • yarn • pnpm Features Universal - Picks the right package manager for you based on the lockfile in your fol

46 Oct 12, 2023

A universal load testing framework for Rust, with real-time tui support.

rlt A Rust Load Testing framework with real-time tui support. rlt provides a simple way to create load test tools in Rust. It is designed to be a univ

129 Jul 20, 2024

openapi schema serialization for rust

open api Rust crate for serializing and deserializing open api documents Documentation install add the following to your Cargo.toml file [dependencies

107 Dec 6, 2022

Rust version of the Haskell ERD tool. Translates a plain text description of a relational database schema to dot files representing an entity relation diagram.

erd-rs Rust CLI tool for creating entity-relationship diagrams from plain text markup. Based on erd (uses the same input format and output rendering).

32 Jul 25, 2022

🦔 Fast, lightweight & schema-less search backend. An alternative to Elasticsearch that runs on a few MBs of RAM.

Comments

Support for preserving field order in the generated schema
This crate is already super useful as is, but I think it would be even better if the generated schema had the fields in the same order as in the file.

For example if the file contains:

{ "Id": 1, "Username": "test", "BetaUser": true, "AccessedAccount": false }

the generated schema (chose typescript because it's the most concise) currently looks like:

export interface Root { AccessedAccount: boolean; BetaUser: boolean; Id: number; Username: string; }

with the fields in sorted order. With preserve order, it would look like this:

export interface Root { Id: number; Username: string; BetaUser: boolean; AccessedAccount: boolean; }

serde_json has a feature called preserve_order and when it is enabled it will use indexmap::IndexMap instead of a BTreeMap: https://github.com/serde-rs/json/blob/df1fb717badea8bda80f7e104d80265da0686166/src/map.rs#L15

serde_yaml has this behavior by default by using linked_hash_map: https://github.com/dtolnay/serde-yaml/blob/4684ca06ad5092ff337ffc7805d5388c86ed4cf3/src/mapping.rs#L13
enhancement
opened by icsaszar 3

Universal Schema Analysis

Related tags

Overview

schema_analysis

Universal-ish Schema Analysis

Features

Usage

Advanced Usage

I really wish I could convert that Schema in something, you know, actually useful.

How does this work?

Performance

You might also like...

Universal configuration library parser

Universal changelog generator using conventional commit+ with monorepo support. Written in Rust.

a universal meta-transliterator that can decipher arbitrary encoding schemas, built in pure Rust

A universal, distributed package manager

An interactive, universal Wordle solver

A universal SDK for FDU. Powered by Rust.

Universal Windows library for discovering common render engines functions. Supports DirectX9 (D3D9), DirectX10 (D3D10), DirectX11 (D3D11), DirectX12 (D3D12).

ABQ is a universal test runner that runs test suites in parallel. It’s the best tool for splitting test suites into parallel jobs locally or on CI

Truly universal encoding detector in pure Rust - port of Python version

🗽 Universal Node Package Manager

A universal load testing framework for Rust, with real-time tui support.

openapi schema serialization for rust

Rust version of the Haskell ERD tool. Translates a plain text description of a relational database schema to dot files representing an entity relation diagram.

🦔 Fast, lightweight & schema-less search backend. An alternative to Elasticsearch that runs on a few MBs of RAM.

Turbine is a toy CLI app for converting Rails schema declarations into equivalent type declarations in other languages.

A node package based on jsonschema-rs for performing JSON schema validation

Visualize your database schema

Typify - Compile JSON Schema documents into Rust types.

Avro schema compatibility checker

Comments

Support for preserving field order in the generated schema

Owner

Schema2000 is a tool that parses exsiting JSON documents and tries to derive a JSON schema from these documents.

sblade or switchblade it's a multitool in one capable of doing simple analysis with any type of data, attempting to speed up ethical hacking activities

A crate to convert bytes to something more useable and the other way around in a way Compatible with the Confluent Schema Registry. Supporting Avro, Protobuf, Json schema, and both async and blocking.

Infer a JSON schema from example data, produce nonsense synthetic data (drivel) according to the schema

Retina is a network analysis framework that supports 100+ Gbps traffic analysis on a single server with no specialized hardware.

A parallel universal-ctags wrapper for git repository

Cargo subcommand to automatically create universal libraries for iOS.

Bindings for Binomial LLC's basis-universal Supercompressed GPU Texture Codec

Wikit - A universal dictionary

Wikit - A universal dictionary