Infer a JSON schema from example data, produce nonsense synthetic data (drivel) according to the schema

Daniël

Last update: Jul 5, 2024

Related tags

Command-line rust json schema schema-inference test-data-generator synthetic-data

Overview

drivel

drivel is a command-line tool written in Rust for inferring a schema from an example JSON (or JSON lines) file, and generating synthetic data (the drivel in question) based on the inferred schema.

Features

Schema Inference: drivel can analyze JSON input and infer its schema, including data types, array lengths, and object structures.
Data Generation: Based on the inferred schema, drivel can generate synthetic data that adheres to the inferred structure.
Easy to integrate: drivel reads JSON input from stdin and writes its output to stdout, allowing for easy integration into pipelines and workflows.

Installation

Binaries and a shell-based installer are available for each release.

To install the drivel executable through Cargo, ensure you have the Rust toolchain installed and run:

cargo install drivel

To add drivel as a dependency to your project, e.g., to use the schema inference engine, run:

cargo add drivel

Usage

Infer a schema from JSON input, and generate synthetic data based on the inferred schema.

Usage: drivel [OPTIONS] <COMMAND>

Commands:
  describe  Describe the inferred schema for the input data
  produce   Produce synthetic data adhering to the inferred schema
  help      Print this message or the help of the given subcommand(s)

Options:
      --infer-enum                     Infer that some string fields are enums based on the number of unique values seen
      --enum-max-uniq <ENUM_MAX_UNIQ>  The maximum ratio of unique values to total values for a field to be considered an enum. Default = 0.1
      --enum-min-n <ENUM_MIN_N>        The minimum number of strings to consider when inferring enums. Default = 1
  -h, --help                           Print help
  -V, --version                        Print version

Examples

Consider a JSON file input.json:

{
  "name": "John Doe",
  "id": "0e3a99a5-0201-4444-9ab1-8343fac56233",
  "age": 30,
  "is_student": false,
  "grades": [85, 90, 78],
  "address": {
    "city": "New York",
    "zip_code": "10001"
  }
}

Running drivel in 'describe' mode:

cat input.json | drivel describe

Output:

{
  "age": int (30),
  "address": {
    "city": string (8),
    "zip_code": string (5)
  },
  "is_student": boolean,
  "grades": [
    int (78-90)
  ] (3),
  "name": string (8),
  "id": string (uuid)
}

Running drivel in 'produce' mode:

cat input.json | drivel produce -n 3

Output:

[
  {
    "address": {
      "city": "o oowrYN",
      "zip_code": "01110"
    },
    "age": 30,
    "grades": [83, 88, 88],
    "is_student": true,
    "name": "nJ heo D",
    "id": "9e0a7687-800d-404b-835f-e7d803b60380"
  },
  {
    "address": {
      "city": "oro wwNN",
      "zip_code": "11000"
    },
    "age": 30,
    "grades": [83, 88, 89],
    "is_student": false,
    "name": "oeoooeeh",
    "id": "c6884c6b-4f6a-4788-a048-e749ec30793d"
  },
  {
    "address": {
      "city": "orww ok ",
      "zip_code": "00010"
    },
    "age": 30,
    "grades": [85, 90, 86],
    "is_student": false,
    "name": "ehnDoJDo",
    "id": "71884608-2760-4853-8c12-e11149c642cd"
  }
]

Contributing

We welcome contributions from anyone interested in improving or extending drivel! Whether you have ideas for new features, bug fixes, or improvements to the documentation, feel free to open an issue or submit a pull request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Comments

Support multi-object files

It currently panics when you give it a file with one object per line, like

{"user_id": null, "user_email": null, "user_name": null, "email_domain": null, "org": null, "method": "POST", "path": "/logged_in", "status_code": 200, "latency": 0.09175515174865723, "timestamp": "2023-09-05T21:38:34+0000", "response_len": 450, "properties": {}}
{"user_id": 2, "user_email": "[email protected]", "user_name": "Tim", "email_domain": "fr.ai", "org": null, "method": "GET", "path": "/datasets/web", "status_code": 200, "latency": 0.0032494068145751953, "timestamp": "2023-09-05T21:38:35+0000", "response_len": 672, "properties": {}}
{"user_id": 2, "user_email": "[email protected]", "user_name": "Tim", "email_domain": "fr.ai", "org": null, "method": "GET", "path": "/datasets/file", "status_code": 200, "latency": 0.3335745334625244, "timestamp": "2023-09-05T21:38:35+0000", "response_len": 85195, "properties": {}}
{"user_id": null, "user_email": null, "user_name": null, "email_domain": null, "org": null, "method": "POST", "path": "/logged_in", "status_code": 200, "latency": 0.04735732078552246, "timestamp": "2023-09-05T22:13:40+0000", "response_len": 459, "properties": {}}
{"user_id": 3, "user_email": "[email protected]", "user_name": "Justin", "email_domain": "fr.ai", "org": null, "method": "GET", "path": "/datasets/web", "status_code": 200, "latency": 0.002116680145263672, "timestamp": "2023-09-05T22:13:42+0000", "response_len": 672, "properties": {}}

An easy way to handle this might be to ignore everything after the first complete object, e.g. this works fine:

❯ cat usage_log.json | head -n1 | drivel describe

But realistically, I'd want it treated as if the top level is an implicit array

opened by tkellogg 5

Automatically build and release binaries

Use a tool like https://opensource.axo.dev/cargo-dist/book/introduction.html to automatically build common binaries and make them available as a release whenever a tag is pushed.

opened by hgrsd 1
perf: Optimise string inference

Using some basic heuristics, we can decrease the amount of regex matching that we do. This leads to a significant (15-20% in some local benchmarks) speedup for large data sets, depending on how many strings are matches for the specific stringtypes.

opened by hgrsd 0
Enable enum inference

Closes #5

Adds a command line option to infer enums from strings, with the ability for a user to specify what ratio of unique strings they want to consider the "cutoff" for thinking of a string as an enum, as well as a minimum sample size.

opened by hgrsd 0
use multithreading

Closes #8 by introducing rayon to parallelise array schema inference as well as value production.

Benchmarking shows that jemalloc is required in order to see a perf benefit for operating on large (100+ MB) JSON files. Without changing allocator, there is a point at which multi-threading becomes slower than single threaded on Linux due to the allocator working overtime.

opened by hgrsd 0
Use multiple CPU cores for schema inference and data production

When working on seriously large JSON files (100s of MBs to GBs), schema inference starts to become a little bit slow. The same is true for producing a large number of values.

We can leverage a crate like rayon to utilise multiple CPU cores for the inference and production logic, where things can be parallelised. For infererring, for instance, we can parallelise inferring the type of each array element and folding the merge function over them. For producing, we can parallelise produce all values in the root array in parallel.

opened by danielhogers-toast 0
Make configurable when to generate random strings vs choose from a sample

For some data sets, it might not be useful to generate random strings based on the sample of characters observed. Instead, it might be better to treat strings with multiple values as enums, where one of the enum variants is picked at random whenever a value is produced.

opened by hgrsd 0
Add support for user-defined schemas
Description generated by Claude 3

Currently, drivel infers the schema from the provided JSON input and generates data based on that inferred schema. While this is useful, there are cases where users may want to define their own schema and have drivel generate data based on that schema.

We should add a new feature that allows users to provide a schema file (in a format to be determined, possibly JSON Schema or a custom format) that describes the desired data structure. drivel should then generate data that conforms to this user-defined schema.

This feature would be beneficial in scenarios where:

The user wants to generate data with a specific structure that may not be easily inferred from a single JSON example.

The user wants to generate data with certain constraints or patterns that are not present in the example JSON.

The user wants to generate data for a schema that they have designed beforehand, without needing to provide a JSON example.

To implement this feature, we would need to:

Design a schema format (or adopt an existing one) that allows users to define the desired data structure, including field names, types, constraints, and relationships between fields.

Add a new command-line flag (e.g., --schema-file) that allows users to specify the path to their schema file.

Modify the produce mode to check for the presence of a user-defined schema file. If present, use that schema for data generation instead of inferring the schema from the input JSON.

Update the documentation to describe how to use this new feature, including examples of the schema format and how to generate data from a user-defined schema.

This feature would greatly enhance the flexibility and usefulness of drivel, allowing users to generate data for a wider variety of use cases and scenarios.
opened by nathannorman-toast 0

Releases(v0.2.2)

v0.2.2(Jul 4, 2024)
Install drivel 0.2.2

Install prebuilt binaries via shell script

curl --proto '=https' --tlsv1.2 -LsSf https://github.com/hgrsd/drivel/releases/download/v0.2.2/drivel-installer.sh | sh

Download drivel 0.2.2

| File | Platform | Checksum | |--------|----------|----------| | drivel-aarch64-apple-darwin.tar.xz | Apple Silicon macOS | checksum | | drivel-x86_64-apple-darwin.tar.xz | Intel macOS | checksum | | drivel-x86_64-unknown-linux-gnu.tar.xz | x64 Linux | checksum |
Source code(tar.gz)
Source code(zip)
dist-manifest.json(9.86 KB)
drivel-aarch64-apple-darwin.tar.xz(1.06 MB)
drivel-aarch64-apple-darwin.tar.xz.sha256(101 bytes)
drivel-installer.sh(30.42 KB)
drivel-x86_64-apple-darwin.tar.xz(1.16 MB)
drivel-x86_64-apple-darwin.tar.xz.sha256(100 bytes)
drivel-x86_64-unknown-linux-gnu.tar.xz(1.34 MB)
drivel-x86_64-unknown-linux-gnu.tar.xz.sha256(105 bytes)
source.tar.gz(22.15 KB)
source.tar.gz.sha256(80 bytes)
v0.2.1(Jun 16, 2024)
Install drivel 0.2.1

Install prebuilt binaries via shell script

curl --proto '=https' --tlsv1.2 -LsSf https://github.com/hgrsd/drivel/releases/download/v0.2.1/drivel-installer.sh | sh

Download drivel 0.2.1

| File | Platform | Checksum | |--------|----------|----------| | drivel-aarch64-apple-darwin.tar.xz | Apple Silicon macOS | checksum | | drivel-x86_64-apple-darwin.tar.xz | Intel macOS | checksum | | drivel-x86_64-unknown-linux-gnu.tar.xz | x64 Linux | checksum |
Source code(tar.gz)
Source code(zip)
dist-manifest.json(9.86 KB)
drivel-aarch64-apple-darwin.tar.xz(1.08 MB)
drivel-aarch64-apple-darwin.tar.xz.sha256(101 bytes)
drivel-installer.sh(30.42 KB)
drivel-x86_64-apple-darwin.tar.xz(1.19 MB)
drivel-x86_64-apple-darwin.tar.xz.sha256(100 bytes)
drivel-x86_64-unknown-linux-gnu.tar.xz(2.88 MB)
drivel-x86_64-unknown-linux-gnu.tar.xz.sha256(105 bytes)
source.tar.gz(21.99 KB)
source.tar.gz.sha256(80 bytes)
v0.2.0(Apr 30, 2024)
Install drivel 0.2.0

Install prebuilt binaries via shell script

curl --proto '=https' --tlsv1.2 -LsSf https://github.com/hgrsd/drivel/releases/download/v0.2.0/drivel-installer.sh | sh

Download drivel 0.2.0

| File | Platform | Checksum | |--------|----------|----------| | drivel-aarch64-apple-darwin.tar.xz | Apple Silicon macOS | checksum | | drivel-x86_64-apple-darwin.tar.xz | Intel macOS | checksum | | drivel-x86_64-unknown-linux-gnu.tar.xz | x64 Linux | checksum |
Source code(tar.gz)
Source code(zip)
dist-manifest.json(9.37 KB)
drivel-aarch64-apple-darwin.tar.xz(1.02 MB)
drivel-aarch64-apple-darwin.tar.xz.sha256(101 bytes)
drivel-installer.sh(26.48 KB)
drivel-x86_64-apple-darwin.tar.xz(1.12 MB)
drivel-x86_64-apple-darwin.tar.xz.sha256(100 bytes)
drivel-x86_64-unknown-linux-gnu.tar.xz(2.81 MB)
drivel-x86_64-unknown-linux-gnu.tar.xz.sha256(105 bytes)
source.tar.gz(20.23 KB)
source.tar.gz.sha256(80 bytes)

Owner

Daniël

Software Engineer & Law PhD

GitHub

Neovide - No Nonsense Neovim Client in Rust

Neovide This is a simple graphical user interface for Neovim (an aggressively refactored and updated Vim editor). Where possible there are some graphi

9.3k Jan 5, 2023

Turbine is a toy CLI app for converting Rails schema declarations into equivalent type declarations in other languages.

Turbine Turbine is a toy CLI app for converting Rails schema declarations into equivalent type declarations in other languages. It’s described as a to

2 Jan 21, 2022

Traverse DMMF of Prisma schema, in your terminal

Let's DMMF Traverse DMMF of Prisma schema, in your terminal. Powered by jless. Installation brew tap yujong-lee/tap brew install letsdmmf Usage # lets

6 Dec 9, 2022

Event-sourcing Schema Definition Language

ESDL Event-sourcing Schema Definition Language Schema definition language for defining aggregates, commands, events & custom types. Heavily inspired b

35 Dec 15, 2022

This crate provides a set of functions to generate SQL statements for various PostgreSQL schema objects

This crate provides a set of functions to generate SQL statements for various PostgreSQL schema objects, such as tables, views, materialized views, functions, triggers, and indexes. The generated SQL statements can be useful for schema introspection, documentation, or migration purposes.

11 Apr 4, 2023

An ultra-fast CLI app that fixes JSON files in large codebase or folders

minosse An ultra fast CLI app that fixes json files in large codebase or folders USAGE: minosse [OPTIONS] <input-dir> FLAGS: -h, --help Prints

5 Oct 17, 2022

A simple CLI for combining json and yaml files

16 Jul 4, 2022

A Rust CLI to provide last publish dates for packages in a package-lock.json file

NPM Package Age A Rust CLI which if you provide a npm lockfile (package-lock.json to start), it will give you a listing of all of the packages & the l

1 Feb 4, 2022

This is a simple command line application to convert bibtex to json written in Rust and Python

bibtex-to-json This is a simple command line application to convert bibtex to json written in Rust and Python. Why? To enable you to convert very big

3 Mar 23, 2022

hj is a command line tool to convert HTTP/1-style text into JSON

hj hj is a command line tool to convert HTTP/1-style text into JSON. This command is inspired by yusukebe/rj, which is a standalone HTTP client that s

10 Aug 21, 2022

CLI application to run clang-format on a set of files specified using globs in a JSON configuration file.

run_clang_format CLI application for running clang-format for an existing .clang-format file on a set of files, specified using globs in a .json confi

6 Dec 16, 2022

A small CLI tool to query ArcGIS REST API services, implemented in Rust. The server response is returned as pretty JSON.

2 Apr 25, 2022

CLI application to run clang-tidy on a set of files specified using globs in a JSON configuration file.

run-clang-tidy CLI application for running clang-tidy for an existing .clang-tidy file on a set of files, specified using globs in a .json configurati

7 Nov 4, 2022

A Faster(⚡) formatter, linter, bundler, and more for JavaScript, TypeScript, JSON, HTML, Markdown, and CSS Lapce Plugin

Lapce Plugin for Rome Lapce-rome is a Lapce plugin for rome, The Rome is faster ⚡ , A formatter, linter, compiler, bundler, and more for JavaScript, T

7 Dec 16, 2022

jf "jf: %q" "JSON Format"

jf jf "jf: %q" "JSON Format" jf is a jo alternative to help safely format and print JSON objects in the commandline. However, unlike jo, where you bui

15 Apr 1, 2023

A simple CLI tool for converting CSV file content to JSON.

fast-csv-to-json A simple CLI tool for converting CSV file content to JSON. 我花了一個小時搓出來，接著優化了兩天的快速 CSV 轉 JSON CLI 小工具 Installation Install Rust with ru

3 Apr 5, 2023

This project returns Queried value from SOAP(XML) in form of JSON.

About This is project by team SSDD for HachNUThon (TechHolding). This project stores and allows updating SOAP(xml) data and responds to various querie

3 Apr 30, 2023

This CLI utility facilitates effortless manipulation and exploration of TOML, YAML, JSON and RON files.

???????? This CLI utility facilitates effortless manipulation and exploration of TOML, YAML, JSON and RON files.

3 Apr 26, 2023

List key patterns of a JSON file for jq.

jqk jqk lists key patterns of a JSON file for jq. Why? jq is a useful command line tool to filter values from a JSON file quickly on a terminal; howev

8 Jun 25, 2023

Infer a JSON schema from example data, produce nonsense synthetic data (drivel) according to the schema

Related tags

Overview

drivel

Features

Installation

Usage

Examples

Contributing

License

Comments

Releases(v0.2.2)

v0.2.2(Jul 4, 2024)

Install drivel 0.2.2

Install prebuilt binaries via shell script

Download drivel 0.2.2

v0.2.1(Jun 16, 2024)

Install drivel 0.2.1

Install prebuilt binaries via shell script

Download drivel 0.2.1

v0.2.0(Apr 30, 2024)

Install drivel 0.2.0

Install prebuilt binaries via shell script

Download drivel 0.2.0

Owner

Daniël

Neovide - No Nonsense Neovim Client in Rust

Turbine is a toy CLI app for converting Rails schema declarations into equivalent type declarations in other languages.

Traverse DMMF of Prisma schema, in your terminal

Event-sourcing Schema Definition Language

This crate provides a set of functions to generate SQL statements for various PostgreSQL schema objects

An ultra-fast CLI app that fixes JSON files in large codebase or folders

A simple CLI for combining json and yaml files

A Rust CLI to provide last publish dates for packages in a package-lock.json file

This is a simple command line application to convert bibtex to json written in Rust and Python

hj is a command line tool to convert HTTP/1-style text into JSON

CLI application to run clang-format on a set of files specified using globs in a JSON configuration file.

A small CLI tool to query ArcGIS REST API services, implemented in Rust. The server response is returned as pretty JSON.

CLI application to run clang-tidy on a set of files specified using globs in a JSON configuration file.

A Faster(⚡) formatter, linter, bundler, and more for JavaScript, TypeScript, JSON, HTML, Markdown, and CSS Lapce Plugin

jf "jf: %q" "JSON Format"

A simple CLI tool for converting CSV file content to JSON.

This project returns Queried value from SOAP(XML) in form of JSON.

This CLI utility facilitates effortless manipulation and exploration of TOML, YAML, JSON and RON files.

List key patterns of a JSON file for jq.