Universal Schema Analysis

Overview

schema_analysis

Universal-ish Schema Analysis

Ever wished you could figure out what was in that json file? Or maybe it was xml... Ehr, yaml? It was definitely toml.

Alas, many great tools will only work with one of those formats, and the internet is not so nice a place as to finally understand that no, xml is not an acceptable document format.

Enter this neat little tool, a single interface to any self-describing format supported by our gymnast friend, serde.

Features

  • Works with any self-describing format with a Serde implementation.
  • Suitable for large files.
  • Keeps track of some useful info for each type.
  • Keeps track of null/normal/missing/duplicate values separately.
  • Integrates with Schemars and json_typegen to produce types and json schema if needed.
  • There's a demo website here.

Usage

let data: &[u8] = b"true";

// Just pick your format, and deserialize InferredSchema as if it were a normal type.
let inferred: InferredSchema = serde_json::from_slice(data)?;
// let inferred: InferredSchema = serde_yaml::from_slice(data)?;
// let inferred: InferredSchema = serde_cbor::from_slice(data)?;
// let inferred: InferredSchema = toml::from_slice(data)?;
// let inferred: InferredSchema = rawbson::de::from_bytes(data)?;
// let inferred: InferredSchema = quick_xml::de::from_reader(data)?;

// InferredSchema is a wrapper around Schema
let schema: Schema = inferred.schema;
let expected: Schema = Schema::Boolean(Default::default());
assert!(schema.structural_eq(&expected));

// The wrapper is there so we can both do the magic above, and also store the data for later
let serialized_schema: String = serde_json::to_string_pretty(&schema)?;

That's it.

Check Schema to see what info you get, and targets to see the available integrations (which include code and json schema generation).

Advanced Usage

I know, I know, the internet is evil and has decided to plague you with not one, but thousands, maybe even millions, of files.

Unfortunately Serde relies on type information to work, so there is nothing we can do about it we can bring out the big guns: DeserializeSeed. It's everything you love about Serde, but with runtime state.

= Default::default(); context.aggregate(&1); context.aggregate(&2); context.aggregate(&1000); assert_eq!(inferred.schema, Schema::Integer(context)); } ">
let a_lot_of_json_files: &[&str] = &[ "1", "2", "1000" ];
let mut iter = a_lot_of_json_files.iter();

if let Some(file) = iter.next() {
    // We use the first file to generate a new schema to work with.
    let mut inferred: InferredSchema = serde_json::from_str(file)?;

    // Then we iterate over the rest to expand the schema.
    for file in iter {
        let mut json_deserializer = serde_json::Deserializer::from_str(file);
        // DeserializeSeed is implemented on &mut InferredSchema
        // So here it borrows the data mutably and runs it against the deserializer.
        let () = inferred.deserialize(&mut json_deserializer)?;
    }

    // The result in this case would be a simple integer schema
    // that 'has met' the numbers 1, 2, and 100.
    let mut context: NumberContext<i128> = Default::default();
    context.aggregate(&1);
    context.aggregate(&2);
    context.aggregate(&1000);

    assert_eq!(inferred.schema, Schema::Integer(context));
}

Furthermore, if you need to generate separate schemas (for example to run the analysis on multiple threads) you can use the Coalesce trait to merge them after-the-fact.

I really wish I could convert that Schema in something, you know, actually useful.

You are in luck! You can check out here the integrations with json_typegen and Schemars to convert the analysis into useful files like Rust types and json schemas. You can also find a demo website here.

How does this work?

For a the short story long go here, the juicy bit is that Serde is kind enough to let the format tell us what it is working with, we take it from there and construct a nice schema from that info.

Performance

These are not proper benchmarks, but should give a vague idea of the performance on a 3 years old i7 laptop with the raw data already loaded into memory.

Size wasm (MB/s) native (MB/s) Format File #
~180MB ~20s (9) ~5s (36) json 1
~650MB ~150s (4.3) ~50s (13) json 1
~1.7GB ~470s (3.6) ~145s (11.7) json 1
~2.1GB a ~182s (11.5) json 1
~13.3GBb ~810s (16.4) xml ~200k

a This one seems to go over some kind of browser limit when fetching the data in the Web Worker, I believe I would have to split large files to handle it.

b ~2.7GB compressed. This one seems like it would be a worst-case scenario because it includes decompression overhead and the files had a section that was formatted text which resulted in crazy schemas. (The json pretty printed schema was almost 0.5GB!)

You might also like...
Universal configuration library parser

LIBUCL Table of Contents generated with DocToc Introduction Basic structure Improvements to the json notation General syntax sugar Automatic arrays cr

Universal changelog generator using conventional commit+ with monorepo support. Written in Rust.
Universal changelog generator using conventional commit+ with monorepo support. Written in Rust.

chlog Universal changelog generator using conventional commit+ with monorepo support. chlog can generate the changelog from the conventional commits w

a universal meta-transliterator that can decipher arbitrary encoding schemas, built in pure Rust
a universal meta-transliterator that can decipher arbitrary encoding schemas, built in pure Rust

transliterati a universal meta-transliterator that can decipher arbitrary encoding schemas, built in pure Rust what does it do? You give it this: Барл

A universal, distributed package manager

Cask A universal, distributed package manager. Installation | Usage | How to publish package? | Design | Contributing | Cask.toml If you are tired of:

An interactive, universal Wordle solver
An interactive, universal Wordle solver

Eldrow (Wordle in reverse) is an interactive, universal Wordle solver that attempts to achieve near to mathematically perfect performance without rely

A universal SDK for FDU. Powered by Rust.

libfdu A universal SDK for FDU. Building You need Rust Nightly installed: $ rustup default nightly Build the library by running: $ cargo build or $ ca

Universal Windows library for discovering common render engines functions. Supports DirectX9 (D3D9), DirectX10 (D3D10), DirectX11 (D3D11), DirectX12 (D3D12).
Universal Windows library for discovering common render engines functions. Supports DirectX9 (D3D9), DirectX10 (D3D10), DirectX11 (D3D11), DirectX12 (D3D12).

Shroud Universal library for discovering common render engines functions. Supports DirectX9 (D3D9), DirectX10 (D3D10), DirectX11 (D3D11), DirectX12 (D

ABQ is a universal test runner that runs test suites in parallel. It’s the best tool for splitting test suites into parallel jobs locally or on CI

🌐 abq.build   🐦 @rwx_research   💬 discord   📚 documentation ABQ is a universal test runner that runs test suites in parallel. It’s the best tool f

Truly universal encoding detector in pure Rust - port of Python version

Charset Normalizer A library that helps you read text from an unknown charset encoding. Motivated by original Python version of charset-normalizer, I'

🗽 Universal Node Package Manager
🗽 Universal Node Package Manager

🗽 NY Universal Node Package Manager node • yarn • pnpm Features Universal - Picks the right package manager for you based on the lockfile in your fol

A universal load testing framework for Rust, with real-time tui support.
A universal load testing framework for Rust, with real-time tui support.

rlt A Rust Load Testing framework with real-time tui support. rlt provides a simple way to create load test tools in Rust. It is designed to be a univ

openapi schema serialization for rust

open api Rust crate for serializing and deserializing open api documents Documentation install add the following to your Cargo.toml file [dependencies

Rust version of the Haskell ERD tool. Translates a plain text description of a relational database schema to dot files representing an entity relation diagram.

erd-rs Rust CLI tool for creating entity-relationship diagrams from plain text markup. Based on erd (uses the same input format and output rendering).

🦔 Fast, lightweight & schema-less search backend. An alternative to Elasticsearch that runs on a few MBs of RAM.
🦔 Fast, lightweight & schema-less search backend. An alternative to Elasticsearch that runs on a few MBs of RAM.

🦔 Fast, lightweight & schema-less search backend. An alternative to Elasticsearch that runs on a few MBs of RAM.

Turbine is a toy CLI app for converting Rails schema declarations into equivalent type declarations in other languages.

Turbine Turbine is a toy CLI app for converting Rails schema declarations into equivalent type declarations in other languages. It’s described as a to

A node package based on jsonschema-rs for performing JSON schema validation

A node package based on jsonschema-rs for performing JSON schema validation.

Visualize your database schema

dbviz Visualize your database schema. The tool loads database schema and draws it as a graph. Usage $ dbviz -d database_name | dot -Tpng schema.png

Typify - Compile JSON Schema documents into Rust types.

Typify Compile JSON Schema documents into Rust types. This can be used ... via the macro import_types!("types.json") to generate Rust types directly i

Avro schema compatibility checker

DeGauss Your friendly neighborhood Avro schema compatibility checker. Install cargo install degauss Example Check the compatibility of your schemas d

Comments
  • Support for preserving field order in the generated schema

    Support for preserving field order in the generated schema

    This crate is already super useful as is, but I think it would be even better if the generated schema had the fields in the same order as in the file.

    For example if the file contains:

    {
    "Id": 1,
    "Username": "test",
    "BetaUser": true,
    "AccessedAccount": false
    }
    

    the generated schema (chose typescript because it's the most concise) currently looks like:

    export interface Root {
        AccessedAccount: boolean;
        BetaUser: boolean;
        Id: number;
        Username: string;
    }
    

    with the fields in sorted order. With preserve order, it would look like this:

    export interface Root {
        Id: number;
        Username: string;
        BetaUser: boolean;
        AccessedAccount: boolean;
    }
    

    serde_json has a feature called preserve_order and when it is enabled it will use indexmap::IndexMap instead of a BTreeMap: https://github.com/serde-rs/json/blob/df1fb717badea8bda80f7e104d80265da0686166/src/map.rs#L15

    serde_yaml has this behavior by default by using linked_hash_map: https://github.com/dtolnay/serde-yaml/blob/4684ca06ad5092ff337ffc7805d5388c86ed4cf3/src/mapping.rs#L13

    enhancement 
    opened by icsaszar 3
Owner
null
Schema2000 is a tool that parses exsiting JSON documents and tries to derive a JSON schema from these documents.

Schema 2000 Schema2000 is a tool that parses exsiting JSON documents and tries to derive a JSON schema from these documents. Currently, Schema2000 is

REWE Digital GmbH 12 Dec 6, 2022
sblade or switchblade it's a multitool in one capable of doing simple analysis with any type of data, attempting to speed up ethical hacking activities

sblade or switchblade it's a multitool in one capable of doing simple analysis with any type of data, attempting to speed up ethical hacking activities

Gabriel Correia 1 Dec 27, 2022
A crate to convert bytes to something more useable and the other way around in a way Compatible with the Confluent Schema Registry. Supporting Avro, Protobuf, Json schema, and both async and blocking.

#schema_registry_converter This library provides a way of using the Confluent Schema Registry in a way that is compliant with the Java client. The rel

Gerard Klijs 69 Dec 13, 2022
Infer a JSON schema from example data, produce nonsense synthetic data (drivel) according to the schema

drivel drivel is a command-line tool written in Rust for inferring a schema from an example JSON (or JSON lines) file, and generating synthetic data (

Daniël 36 Jul 5, 2024
Retina is a network analysis framework that supports 100+ Gbps traffic analysis on a single server with no specialized hardware.

Retina Retina is a network analysis framework that enables operators and researchers to ask complex questions about high-speed (>100gbE) network links

Stanford Security Research 73 Jun 21, 2023
A parallel universal-ctags wrapper for git repository

ptags A parallel universal-ctags wrapper for git repository Description ptags is a universal-ctags wrapper to have the following features. Search git

null 107 Dec 30, 2022
Cargo subcommand to automatically create universal libraries for iOS.

cargo lipo Provides a cargo lipo subcommand which automatically creates a universal library for use with your iOS application. Maintenance Status Plea

Tim Neumann 430 Dec 29, 2022
Bindings for Binomial LLC's basis-universal Supercompressed GPU Texture Codec

basis-universal-rs Bindings for Binomial LLC's basis-universal Supercompressed GPU Texture Codec basis-universal functionality can be grouped into two

Philip Degarmo 27 Dec 3, 2022
Wikit - A universal dictionary

Wikit - A universal dictionary What is it? To be short, Wikit is a tool which can (fully, may be in future) render and create dictionary file in MDX/M

bugnofree 120 Dec 3, 2022
Wikit - A universal dictionary

Wikit - A universal dictionary

bugnofree 120 Dec 3, 2022