A fast and simple command-line tool for common operations over JSON-lines files

Related tags

Encoding JSON rjp
Overview

rjp: Rapid JSON-lines processor

A fast and simple command-line tool for common operations over JSON-lines files, such as:

  • converting to and from text files, TSV files
  • joining files on (multiple) keys
  • merging files line by line
  • adding, removing, selecting fields
  • ...

You could use jq for some of these tasks (and in fact, jq is a far more general tool) but:

  • rjp is designed for the JSON-lines format specifically
  • rjp can be faster
  • some common tasks are more easily done in rjp

This is my attempt to learn a bit of Rust, don't take this tool too seriously. That being said, it is pretty quick and handy, at least for me.

Build & Installation

Get rust:

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y

Clone and build rjp:

git clone https://github.com/ales-t/rjp.git
cd rjp
cargo build --release

You will find the binary in target/release/rjp. You can add it to your PATH e.g. like this:

export PATH="$(pwd)/target/release:$PATH"

Basic usage

rjp < input_file [INPUT_CONVERSION] [PROCESSOR [PROCESSOR...]] [OUTPUT_CONVERSION] > output_file

rjp runs a chain of processors on each instance in the input stream (STDIN), finally printing the processed instances to STDOUT.

Input conversions

By default, rjp reads the input file as JSON lines. You can optionally specify a file conversion as the first positional argument.

TSV

Convert TSV lines with specified field names.

Aliases: tsv_to_json, from_tsv

Examples:

  • rjp < in.tsv from_tsv first_field_name,second_field_name,... [PROCESSORS] [OUTPUT_CONVERSION] > output_file

Plain text

Conversion from TXT treats the whole input line as a single string field, you need to specify its name.

Aliases: txt_to_json, from_txt

Examples:

  • rjp < in.txt from_txt field_name [PROCESSORS] [OUTPUT_CONVERSION] > output_file

Processors

The following processors are implemented (brackets list shorthand aliases):

Add fields

Add new fields with constant values.

Aliases: add_fields, af, add

Examples:

  • rjp < in.json add_fields new_field_name:value1,another_field:value2 > out.json

Drop fields

Remove existing fields.

Aliases: drop_fields, df, drop

Examples:

  • rjp < in.json to_drop,another_to_drop > out.json

Extract items

Extract items from arrays and objects.

Aliases: extract_items e, extract

Examples:

  • rjp < in.json array_field[0]:new_field,object_field[key]:another_field > out.json

Join

Perform inner join with another input stream (with optional file conversion).

Note on performance: while the main stream is processed line-by-line, the stream to join is loaded in RAM (i.e. use the smaller file as the joined stream).

Aliases: join, j, inner_join

Examples:

  • rjp < in.json join file.json key_field_1,key_field_2 > out.json
  • With file conversion: rjp < in.json join file.tsv key from_tsv key,tsv_value > out.json

Left join

Identical to join, except that lines from the main stream that don't have a corresponding instance in the joined stream are kept (and no additional fields are added to them).

Aliases: lj, left_join

Merge

Merge wih another input stream line-by-line, with optional file conversion.

Aliases: merge, mrg

Examples:

  • rjp < in.json merge file_to_merge.json > out.json
  • With file conversion: rjp < in.json merge to_merge.tsv from_tsv col_a,col_b > out.json

Rename fields

Rename fields in instances.

Aliases: rename, rnm

Examples:

  • rjp < in.json old_name:new_name,another_old:another_new > out.json

Select fields

Select a subset of fields (the rest are dropped).

Aliases: select_fields, sf, select, sel

Examples:

  • rjp < in.json select_fields first,second > out.json

To number

Convert a string field to a numeric one.

Aliases: to_number, num

Examples:

  • rjp < in.json to_number string_field_name:new_numeric_field_name,another_string:another_numeric > out.json

Output conversions

By default, rjp will produce JSON-lines. You can change that with a file conversion.

TSV

Convert into TSV liens with specified fields.

Aliases: to_tsv, json_to_tsv, tsv

Examples:

  • rjp < in.json to_tsv field_1,field_2 > out.tsv
Comments
  • Testing scaffolding

    Testing scaffolding

    I managed to decouple the CLI from the main worker which I now call main_worker. There are some issues associated with it that are described in test.rs, such as the trait bounds etc.

    Copying the description of the test_in_dirs function:

    Runs all tests in the tests directory. This assumes that the directory tests contains multiple directories, each with at least two files: command and output. The command is executed and compared with the output (assert). The test is run from the top-level project directory but arguments in the command ending with .json and .tsv are automatically prefixed with the path to the subdirectory. The first parameter of the command is the file that's redirected to the worker (imagine prefixing the whole command with a <). To add more tests, simply copy one of the existing folders.

    Pros of this approach:

    • Very easy to add more simple tests (simply create a new directory)

    Cons of this approach:

    • When running cargo test, all these tests from all folders are under the same name test_in_dirs and count as one test.
    • It is also hard to see what test failed.

    There are three options:

    1. Accept this code as is
    2. Find a way to resolve the cons but using the current structure
    3. Turn the written test code into an easily assertible function that takes in a string of commands (written in code, each in individual functions) and a path to expected output.

    I personally lean towards 3 because it easily fixes all the current cons. What I don't like about it is that the command to run rjp is written in code. From the perspective of tests that seems rather like data and should therefore be in a separate file. But maybe there's an intermediate solution that combines the current directory structure with the more modular functions.

    I currently added two test from your comment in #3 and did not think much about their validity. There are more issues with this code, such as panicking whenever rjp panics but I think that for the initial shot on testing it's ok. Apologies it took me so long to get to this.

    Let me know whether option 3 is a good way to go and whether you have some more remarks. It should be fairly easy to implement it.

    opened by zouharvi 4
  • Test cases

    Test cases

    I'm still in the process of transforming the code to be testable (first step is detaching the CLI from the code entry point). @ales-t do you have some test cases which you used during development that could guide me in this?

    opened by zouharvi 3
  • Reorganize readme

    Reorganize readme

    Despite the readme processor descriptions being properly organized using markdown headings, I find them hard to navigate. Putting them in tables saves on vertical space (less heading padding) and makes it easier to find the commands.

    opened by zouharvi 2
  • reorganize project into modules

    reorganize project into modules

    This PR reorganizes the project into directories and modules for easier management. It may not be the most elegant solution but gets rid of the flat directory structure.

    opened by zouharvi 0
  • Misc. Non-breaking Updates

    Misc. Non-breaking Updates

    There's no real content in this PR as it is mostly cosmetics.

    • README typos
    • Begin work on making code testable (separation of CLI from entrypoint)
    • Lint every file with rustfmt
    • Accept suggestions by clippy
    opened by zouharvi 0
  • Large test file generation for benchmarking

    Large test file generation for benchmarking

    Since the current test files are very small and cargo bench rounds the runtime to 0.0s, I was thinking about running the test on larger files (notably on (1) very long lines, (2) lots of lines or (3) both). The issue is that nobody really wants to have 0.5GB files in the repository.

    Since their content is rather arbitrary, maybe we could just generate random data (with fixed seed) and have them be generated only locally? A few lines of a Python script would do but maybe it'd be purer adding a simple binary crate to this package.

    In all cases I'm not sure if there's an elegant way to link the dependency other than erroring on tests that the files were not found and that the user probably forgot to generate them.

    opened by zouharvi 0
  • Templating processor

    Templating processor

    In jq one can write the following (complex) query jq '[.[] | {message: .commit.message, name: .commit.committer.name, parents: [.parents[].html_url]}]'. I find this templating quite intuitive and useful but at the same time it seems like there's a lot of work behind this (mostly the parsing).

    Currently rjp is based around simple operations and piping multiple rjp calls together so I'm not sure that this would even fit into this schema. If it was implemented, it would replace extract, rename and select.

    enhancement 
    opened by zouharvi 0
  • Argument parser

    Argument parser

    Relevant to #7, we should definitely transition from the handwritten argument parser to something like clap. I'm however wary of implementing it now because version 3.0 should come out soon:tm: and I had issues by including the beta in Cargo.toml.

    opened by zouharvi 1
  • Tolerant mode

    Tolerant mode

    Currently, rjp will stop when encountering any error, such as:

    • Select/rename not finding the required fields in an instance.
    • serde failing to parse an input line.
    • Join not finding the keys to join on in an instance.
    • Merge when the stream lengths are mismatched.
    • ...

    Oftentimes, JSON lines files are noisy and contain lines with problems. It would be helpful to have some way to request "tolerant" behavior. For example, rename_field would not change an instance if the input fields are not found. Alternatively, problematic instances may be skipped in the output stream.

    It's currently not clear to me how the various processors should behave in this tolerant setting, or whether there should be more ways (for instance --skip-bad-instances, --stop-on-bad-instance, --keep-bad-instances?).

    opened by ales-t 2
  • Way to approach paralelization

    Way to approach paralelization

    There are multiple ways in which we could do paralelization. Assuming that each line can be processed independently (which may not be true in the future if we make more processors), then with workers A B C the paralelization may look as follows (assume 12 lines):

    1. ABC ABC ABC ABC each worker is fed exactly one line and when all three are finished, they are sent to the buffer. This is the easiest solution but may not even be worth it given that thread spawning is not zero cost.
    2. AABBCC AABBCC another step is to increase the number of lines fed to each worker (here 2). There is certainly a threshold number from which it becomes beneficial to use threading (that number probably being larger than 1).
    3. AAA BBB CCC this chunks the whole data into equal parts and feeds them to the workers. It creates the largest chunks and is the fastest from paralelization perspective but requires that the whole data is first loaded into memory which may not be viable for large files. (In a sense this is an extreme case of 3)
    4. ABC AAABB CAC this is the most complicated solution (though probably the fastest one). It considers the processing dynamically and e.g. if A is finished and B and C are working on some long and difficult lines, then A can get next lines assigned.

    I think that 2 is the best option because it can be parametrized via two parameters from the CLI. Maybe I missed some approach.

    opened by zouharvi 1
Owner
Ales Tamchyna
Ales Tamchyna
Jq - Command-line JSON processor

jq jq is a lightweight and flexible command-line JSON processor. , Unix: , Windows: If you want to learn to use jq, read the documentation at https://

Stephen Dolan 23.9k Jan 4, 2023
jless is a command-line JSON viewer

jless is a command-line JSON viewer. Use it as a replacement for whatever combination of less, jq, cat and your editor you currently use for viewing JSON files. It is written in Rust and can be installed as a single standalone binary.

null 3.5k Jan 8, 2023
Get JSON values quickly - JSON parser for Rust

get json values quickly GJSON is a Rust crate that provides a fast and simple way to get values from a json document. It has features such as one line

Josh Baker 160 Dec 29, 2022
JSON model for interacting with nftables' nft command

nftables-json Serde JSON model for interacting with the nftables nft executable Provides Rust types that map directly to the nftables JSON object mode

Alex Forster 3 Jan 8, 2023
A fast way to minify JSON

COMPACTO (work in progress) A fast way to minify JSON. Usage/Examples # Compress # Input example (~0.11 KB) # { # "id": "123", # "name": "Edua

Eduardo Stuart 4 Feb 27, 2022
A tool for outputs semantic difference of json

jsondiff A tool for outputs semantic difference of json. "semantic" means: sort object key before comparison sort array before comparison (optional, b

niboshi 3 Sep 22, 2021
CLI tool to convert HOCON into valid JSON or YAML written in Rust.

{hocon:vert} CLI Tool to convert HOCON into valid JSON or YAML. Under normal circumstances this is mostly not needed because hocon configs are parsed

Mathias Oertel 23 Jan 6, 2023
A JSON Query Language CLI tool built with Rust 🦀

JQL A JSON Query Language CLI tool built with Rust ?? ?? Core philosophy ?? Stay lightweight ?? Keep its features as simple as possible ?? Avoid redun

Davy Duperron 872 Jan 1, 2023
Smol cli tool that converts Happy Scribe JSON to VTT understood by Podlove.

happyscribe2podlove Super smol cli tool that converts JSON from Happy Scribe to a sane VTT format that works with the Podlove Publisher. Get started E

Arne Bahlo 2 Feb 7, 2022
Jsonptr - Data structures and logic for resolving, assigning, and deleting by JSON Pointers

jsonptr - JSON Pointers for Rust Data structures and logic for resolving, assigning, and deleting by JSON Pointers (RFC 6901). Usage Resolve JSON Poin

Chance 38 Aug 28, 2022
A easy and declarative way to test JSON input in Rust.

assert_json A easy and declarative way to test JSON input in Rust. assert_json is a Rust macro heavily inspired by serde json macro. Instead of creati

Charles Vandevoorde 8 Dec 5, 2022
Converts cargo check (and clippy) JSON output to the GitHub Action error format

cargo-action-fmt Takes JSON-formatted cargo check (and cargo clippy) output and formats it for GitHub actions. Examples This tool can be used with a v

Oliver Gould 8 Oct 12, 2022
Esri JSON struct definitions and serde integration.

serde_esri Esri JSON parsing library. This crate provides representations of Esri JSON objects with serde::Deserialize and serde::Serialize trait impl

Josiah Parry 5 Nov 23, 2023
JSON parser which picks up values directly without performing tokenization in Rust

Pikkr JSON parser which picks up values directly without performing tokenization in Rust Abstract Pikkr is a JSON parser which picks up values directl

Pikkr 615 Dec 29, 2022
Strongly typed JSON library for Rust

Serde JSON   Serde is a framework for serializing and deserializing Rust data structures efficiently and generically. [dependencies] serde_json = "1.0

null 3.6k Jan 5, 2023
JSON implementation in Rust

json-rust Parse and serialize JSON with ease. Changelog - Complete Documentation - Cargo - Repository Why? JSON is a very loose format where anything

Maciej Hirsz 500 Dec 21, 2022
Rust port of gjson,get JSON value by dotpath syntax

A-JSON Read JSON values quickly - Rust JSON Parser change name to AJSON, see issue Inspiration comes from gjson in golang Installation Add it to your

Chen Jiaju 90 Dec 6, 2022
rurl is like curl but with a json configuration file per request

rurl rurl is a curl-like cli tool made in rust, the difference is that it takes its params from a json file so you can have all different requests sav

Bruno Ribeiro da Silva 6 Sep 10, 2022
A rust script to convert a better bibtex json file from Zotero into nice organised notes in Obsidian

Zotero to Obsidian script This is a script that takes a better bibtex JSON file exported by Zotero and generates an organised collection of reference

Sashin Exists 3 Oct 9, 2022