Infer a JSON schema from example data, produce nonsense synthetic data (drivel) according to the schema

Overview

drivel

drivel is a command-line tool written in Rust for inferring a schema from an example JSON (or JSON lines) file, and generating synthetic data (the drivel in question) based on the inferred schema.

Features

  • Schema Inference: drivel can analyze JSON input and infer its schema, including data types, array lengths, and object structures.
  • Data Generation: Based on the inferred schema, drivel can generate synthetic data that adheres to the inferred structure.
  • Easy to integrate: drivel reads JSON input from stdin and writes its output to stdout, allowing for easy integration into pipelines and workflows.

Installation

Binaries and a shell-based installer are available for each release.

To install the drivel executable through Cargo, ensure you have the Rust toolchain installed and run:

cargo install drivel

To add drivel as a dependency to your project, e.g., to use the schema inference engine, run:

cargo add drivel

Usage

Infer a schema from JSON input, and generate synthetic data based on the inferred schema.

Usage: drivel [OPTIONS] <COMMAND>

Commands:
  describe  Describe the inferred schema for the input data
  produce   Produce synthetic data adhering to the inferred schema
  help      Print this message or the help of the given subcommand(s)

Options:
      --infer-enum                     Infer that some string fields are enums based on the number of unique values seen
      --enum-max-uniq <ENUM_MAX_UNIQ>  The maximum ratio of unique values to total values for a field to be considered an enum. Default = 0.1
      --enum-min-n <ENUM_MIN_N>        The minimum number of strings to consider when inferring enums. Default = 1
  -h, --help                           Print help
  -V, --version                        Print version

Examples

Consider a JSON file input.json:

{
  "name": "John Doe",
  "id": "0e3a99a5-0201-4444-9ab1-8343fac56233",
  "age": 30,
  "is_student": false,
  "grades": [85, 90, 78],
  "address": {
    "city": "New York",
    "zip_code": "10001"
  }
}

Running drivel in 'describe' mode:

cat input.json | drivel describe

Output:

{
  "age": int (30),
  "address": {
    "city": string (8),
    "zip_code": string (5)
  },
  "is_student": boolean,
  "grades": [
    int (78-90)
  ] (3),
  "name": string (8),
  "id": string (uuid)
}

Running drivel in 'produce' mode:

cat input.json | drivel produce -n 3

Output:

[
  {
    "address": {
      "city": "o oowrYN",
      "zip_code": "01110"
    },
    "age": 30,
    "grades": [83, 88, 88],
    "is_student": true,
    "name": "nJ heo D",
    "id": "9e0a7687-800d-404b-835f-e7d803b60380"
  },
  {
    "address": {
      "city": "oro wwNN",
      "zip_code": "11000"
    },
    "age": 30,
    "grades": [83, 88, 89],
    "is_student": false,
    "name": "oeoooeeh",
    "id": "c6884c6b-4f6a-4788-a048-e749ec30793d"
  },
  {
    "address": {
      "city": "orww ok ",
      "zip_code": "00010"
    },
    "age": 30,
    "grades": [85, 90, 86],
    "is_student": false,
    "name": "ehnDoJDo",
    "id": "71884608-2760-4853-8c12-e11149c642cd"
  }
]

Contributing

We welcome contributions from anyone interested in improving or extending drivel! Whether you have ideas for new features, bug fixes, or improvements to the documentation, feel free to open an issue or submit a pull request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Comments
  • Support multi-object files

    Support multi-object files

    It currently panics when you give it a file with one object per line, like

    {"user_id": null, "user_email": null, "user_name": null, "email_domain": null, "org": null, "method": "POST", "path": "/logged_in", "status_code": 200, "latency": 0.09175515174865723, "timestamp": "2023-09-05T21:38:34+0000", "response_len": 450, "properties": {}}
    {"user_id": 2, "user_email": "[email protected]", "user_name": "Tim", "email_domain": "fr.ai", "org": null, "method": "GET", "path": "/datasets/web", "status_code": 200, "latency": 0.0032494068145751953, "timestamp": "2023-09-05T21:38:35+0000", "response_len": 672, "properties": {}}
    {"user_id": 2, "user_email": "[email protected]", "user_name": "Tim", "email_domain": "fr.ai", "org": null, "method": "GET", "path": "/datasets/file", "status_code": 200, "latency": 0.3335745334625244, "timestamp": "2023-09-05T21:38:35+0000", "response_len": 85195, "properties": {}}
    {"user_id": null, "user_email": null, "user_name": null, "email_domain": null, "org": null, "method": "POST", "path": "/logged_in", "status_code": 200, "latency": 0.04735732078552246, "timestamp": "2023-09-05T22:13:40+0000", "response_len": 459, "properties": {}}
    {"user_id": 3, "user_email": "[email protected]", "user_name": "Justin", "email_domain": "fr.ai", "org": null, "method": "GET", "path": "/datasets/web", "status_code": 200, "latency": 0.002116680145263672, "timestamp": "2023-09-05T22:13:42+0000", "response_len": 672, "properties": {}}
    

    An easy way to handle this might be to ignore everything after the first complete object, e.g. this works fine:

    ❯ cat usage_log.json | head -n1 | drivel describe
    

    But realistically, I'd want it treated as if the top level is an implicit array

    opened by tkellogg 5
  • Automatically build and release binaries

    Automatically build and release binaries

    Use a tool like https://opensource.axo.dev/cargo-dist/book/introduction.html to automatically build common binaries and make them available as a release whenever a tag is pushed.

    opened by hgrsd 1
  • perf: Optimise string inference

    perf: Optimise string inference

    Using some basic heuristics, we can decrease the amount of regex matching that we do. This leads to a significant (15-20% in some local benchmarks) speedup for large data sets, depending on how many strings are matches for the specific stringtypes.

    opened by hgrsd 0
  • Enable enum inference

    Enable enum inference

    Closes #5

    Adds a command line option to infer enums from strings, with the ability for a user to specify what ratio of unique strings they want to consider the "cutoff" for thinking of a string as an enum, as well as a minimum sample size.

    opened by hgrsd 0
  • use multithreading

    use multithreading

    Closes #8 by introducing rayon to parallelise array schema inference as well as value production.

    Benchmarking shows that jemalloc is required in order to see a perf benefit for operating on large (100+ MB) JSON files. Without changing allocator, there is a point at which multi-threading becomes slower than single threaded on Linux due to the allocator working overtime.

    opened by hgrsd 0
  • Use multiple CPU cores for schema inference and data production

    Use multiple CPU cores for schema inference and data production

    When working on seriously large JSON files (100s of MBs to GBs), schema inference starts to become a little bit slow. The same is true for producing a large number of values.

    We can leverage a crate like rayon to utilise multiple CPU cores for the inference and production logic, where things can be parallelised. For infererring, for instance, we can parallelise inferring the type of each array element and folding the merge function over them. For producing, we can parallelise produce all values in the root array in parallel.

    opened by danielhogers-toast 0
  • Make configurable when to generate random strings vs choose from a sample

    Make configurable when to generate random strings vs choose from a sample

    For some data sets, it might not be useful to generate random strings based on the sample of characters observed. Instead, it might be better to treat strings with multiple values as enums, where one of the enum variants is picked at random whenever a value is produced.

    opened by hgrsd 0
  • Add support for user-defined schemas

    Add support for user-defined schemas

    Description generated by Claude 3

    Currently, drivel infers the schema from the provided JSON input and generates data based on that inferred schema. While this is useful, there are cases where users may want to define their own schema and have drivel generate data based on that schema.

    We should add a new feature that allows users to provide a schema file (in a format to be determined, possibly JSON Schema or a custom format) that describes the desired data structure. drivel should then generate data that conforms to this user-defined schema.

    This feature would be beneficial in scenarios where:

    • The user wants to generate data with a specific structure that may not be easily inferred from a single JSON example.
    • The user wants to generate data with certain constraints or patterns that are not present in the example JSON.
    • The user wants to generate data for a schema that they have designed beforehand, without needing to provide a JSON example.

    To implement this feature, we would need to:

    • Design a schema format (or adopt an existing one) that allows users to define the desired data structure, including field names, types, constraints, and relationships between fields.
    • Add a new command-line flag (e.g., --schema-file) that allows users to specify the path to their schema file.
    • Modify the produce mode to check for the presence of a user-defined schema file. If present, use that schema for data generation instead of inferring the schema from the input JSON.
    • Update the documentation to describe how to use this new feature, including examples of the schema format and how to generate data from a user-defined schema.

    This feature would greatly enhance the flexibility and usefulness of drivel, allowing users to generate data for a wider variety of use cases and scenarios.

    opened by nathannorman-toast 0
Releases(v0.2.2)
Owner
Daniël
Software Engineer & Law PhD
Daniël
Neovide - No Nonsense Neovim Client in Rust

Neovide This is a simple graphical user interface for Neovim (an aggressively refactored and updated Vim editor). Where possible there are some graphi

Neovide 9.3k Jan 5, 2023
Turbine is a toy CLI app for converting Rails schema declarations into equivalent type declarations in other languages.

Turbine Turbine is a toy CLI app for converting Rails schema declarations into equivalent type declarations in other languages. It’s described as a to

Justin 2 Jan 21, 2022
Traverse DMMF of Prisma schema, in your terminal

Let's DMMF Traverse DMMF of Prisma schema, in your terminal. Powered by jless. Installation brew tap yujong-lee/tap brew install letsdmmf Usage # lets

Jayden 6 Dec 9, 2022
Event-sourcing Schema Definition Language

ESDL Event-sourcing Schema Definition Language Schema definition language for defining aggregates, commands, events & custom types. Heavily inspired b

null 35 Dec 15, 2022
This crate provides a set of functions to generate SQL statements for various PostgreSQL schema objects

This crate provides a set of functions to generate SQL statements for various PostgreSQL schema objects, such as tables, views, materialized views, functions, triggers, and indexes. The generated SQL statements can be useful for schema introspection, documentation, or migration purposes.

Tyr Chen 11 Apr 4, 2023
An ultra-fast CLI app that fixes JSON files in large codebase or folders

minosse An ultra fast CLI app that fixes json files in large codebase or folders USAGE: minosse [OPTIONS] <input-dir> FLAGS: -h, --help Prints

Antonino Bertulla 5 Oct 17, 2022
A simple CLI for combining json and yaml files

A simple CLI for combining json and yaml files

Avencera 16 Jul 4, 2022
A Rust CLI to provide last publish dates for packages in a package-lock.json file

NPM Package Age A Rust CLI which if you provide a npm lockfile (package-lock.json to start), it will give you a listing of all of the packages & the l

Benjamin Lannon 1 Feb 4, 2022
This is a simple command line application to convert bibtex to json written in Rust and Python

bibtex-to-json This is a simple command line application to convert bibtex to json written in Rust and Python. Why? To enable you to convert very big

null 3 Mar 23, 2022
hj is a command line tool to convert HTTP/1-style text into JSON

hj hj is a command line tool to convert HTTP/1-style text into JSON. This command is inspired by yusukebe/rj, which is a standalone HTTP client that s

FUJI Goro 10 Aug 21, 2022
CLI application to run clang-format on a set of files specified using globs in a JSON configuration file.

run_clang_format CLI application for running clang-format for an existing .clang-format file on a set of files, specified using globs in a .json confi

martin 6 Dec 16, 2022
A small CLI tool to query ArcGIS REST API services, implemented in Rust. The server response is returned as pretty JSON.

A small CLI tool to query ArcGIS REST API services, implemented in Rust. The server response is returned as pretty JSON.

Andrew Vitale 2 Apr 25, 2022
CLI application to run clang-tidy on a set of files specified using globs in a JSON configuration file.

run-clang-tidy CLI application for running clang-tidy for an existing .clang-tidy file on a set of files, specified using globs in a .json configurati

martin 7 Nov 4, 2022
A Faster(⚡) formatter, linter, bundler, and more for JavaScript, TypeScript, JSON, HTML, Markdown, and CSS Lapce Plugin

Lapce Plugin for Rome Lapce-rome is a Lapce plugin for rome, The Rome is faster ⚡ , A formatter, linter, compiler, bundler, and more for JavaScript, T

xiaoxin 7 Dec 16, 2022
jf "jf: %q" "JSON Format"

jf jf "jf: %q" "JSON Format" jf is a jo alternative to help safely format and print JSON objects in the commandline. However, unlike jo, where you bui

Arijit Basu 15 Apr 1, 2023
A simple CLI tool for converting CSV file content to JSON.

fast-csv-to-json A simple CLI tool for converting CSV file content to JSON. 我花了一個小時搓出來,接著優化了兩天的快速 CSV 轉 JSON CLI 小工具 Installation Install Rust with ru

Ming Chang 3 Apr 5, 2023
This project returns Queried value from SOAP(XML) in form of JSON.

About This is project by team SSDD for HachNUThon (TechHolding). This project stores and allows updating SOAP(xml) data and responds to various querie

Sandipsinh Rathod 3 Apr 30, 2023
This CLI utility facilitates effortless manipulation and exploration of TOML, YAML, JSON and RON files.

???????? This CLI utility facilitates effortless manipulation and exploration of TOML, YAML, JSON and RON files.

Moe 3 Apr 26, 2023
List key patterns of a JSON file for jq.

jqk jqk lists key patterns of a JSON file for jq. Why? jq is a useful command line tool to filter values from a JSON file quickly on a terminal; howev

Kentaro Wada 8 Jun 25, 2023