A fast and simple command-line tool for common operations over JSON-lines files

Ales Tamchyna

Last update: Jul 8, 2022

Related tags

Encoding JSON rjp

Overview

rjp: Rapid JSON-lines processor

A fast and simple command-line tool for common operations over JSON-lines files, such as:

converting to and from text files, TSV files
joining files on (multiple) keys
merging files line by line
adding, removing, selecting fields
...

You could use jq for some of these tasks (and in fact, jq is a far more general tool) but:

rjp is designed for the JSON-lines format specifically
rjp can be faster
some common tasks are more easily done in rjp

This is my attempt to learn a bit of Rust, don't take this tool too seriously. That being said, it is pretty quick and handy, at least for me.

Build & Installation

Get rust:

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y

Clone and build rjp:

git clone https://github.com/ales-t/rjp.git
cd rjp
cargo build --release

You will find the binary in target/release/rjp. You can add it to your PATH e.g. like this:

export PATH="$(pwd)/target/release:$PATH"

Basic usage

rjp < input_file [INPUT_CONVERSION] [PROCESSOR [PROCESSOR...]] [OUTPUT_CONVERSION] > output_file

rjp runs a chain of processors on each instance in the input stream (STDIN), finally printing the processed instances to STDOUT.

Input conversions

By default, rjp reads the input file as JSON lines. You can optionally specify a file conversion as the first positional argument.

TSV

Convert TSV lines with specified field names.

Aliases: tsv_to_json, from_tsv

Examples:

rjp < in.tsv from_tsv first_field_name,second_field_name,... [PROCESSORS] [OUTPUT_CONVERSION] > output_file

Plain text

Conversion from TXT treats the whole input line as a single string field, you need to specify its name.

Aliases: txt_to_json, from_txt

Examples:

rjp < in.txt from_txt field_name [PROCESSORS] [OUTPUT_CONVERSION] > output_file

Processors

The following processors are implemented (brackets list shorthand aliases):

Add fields

Add new fields with constant values.

Aliases: add_fields, af, add

Examples:

rjp < in.json add_fields new_field_name:value1,another_field:value2 > out.json

Drop fields

Remove existing fields.

Aliases: drop_fields, df, drop

Examples:

rjp < in.json to_drop,another_to_drop > out.json

Extract items

Extract items from arrays and objects.

Aliases: extract_items e, extract

Examples:

rjp < in.json array_field[0]:new_field,object_field[key]:another_field > out.json

Join

Perform inner join with another input stream (with optional file conversion).

Note on performance: while the main stream is processed line-by-line, the stream to join is loaded in RAM (i.e. use the smaller file as the joined stream).

Aliases: join, j, inner_join

Examples:

rjp < in.json join file.json key_field_1,key_field_2 > out.json
With file conversion: rjp < in.json join file.tsv key from_tsv key,tsv_value > out.json

Left join

Identical to join, except that lines from the main stream that don't have a corresponding instance in the joined stream are kept (and no additional fields are added to them).

Aliases: lj, left_join

Merge

Merge wih another input stream line-by-line, with optional file conversion.

Aliases: merge, mrg

Examples:

rjp < in.json merge file_to_merge.json > out.json
With file conversion: rjp < in.json merge to_merge.tsv from_tsv col_a,col_b > out.json

Rename fields

Rename fields in instances.

Aliases: rename, rnm

Examples:

rjp < in.json old_name:new_name,another_old:another_new > out.json

Select fields

Select a subset of fields (the rest are dropped).

Aliases: select_fields, sf, select, sel

Examples:

rjp < in.json select_fields first,second > out.json

To number

Convert a string field to a numeric one.

Aliases: to_number, num

Examples:

rjp < in.json to_number string_field_name:new_numeric_field_name,another_string:another_numeric > out.json

Output conversions

By default, rjp will produce JSON-lines. You can change that with a file conversion.

TSV

Convert into TSV liens with specified fields.

Aliases: to_tsv, json_to_tsv, tsv

Examples:

rjp < in.json to_tsv field_1,field_2 > out.tsv

Comments

Testing scaffolding
I managed to decouple the CLI from the main worker which I now call main_worker. There are some issues associated with it that are described in test.rs, such as the trait bounds etc.

Copying the description of the test_in_dirs function:

Runs all tests in the tests directory. This assumes that the directory tests contains multiple directories, each with at least two files: command and output. The command is executed and compared with the output (assert). The test is run from the top-level project directory but arguments in the command ending with .json and .tsv are automatically prefixed with the path to the subdirectory. The first parameter of the command is the file that's redirected to the worker (imagine prefixing the whole command with a <). To add more tests, simply copy one of the existing folders.

Pros of this approach:

Very easy to add more simple tests (simply create a new directory)

Cons of this approach:

When running cargo test, all these tests from all folders are under the same name test_in_dirs and count as one test.

It is also hard to see what test failed.

There are three options:

Accept this code as is

Find a way to resolve the cons but using the current structure

Turn the written test code into an easily assertible function that takes in a string of commands (written in code, each in individual functions) and a path to expected output.

I personally lean towards 3 because it easily fixes all the current cons. What I don't like about it is that the command to run rjp is written in code. From the perspective of tests that seems rather like data and should therefore be in a separate file. But maybe there's an intermediate solution that combines the current directory structure with the more modular functions.

I currently added two test from your comment in #3 and did not think much about their validity. There are more issues with this code, such as panicking whenever rjp panics but I think that for the initial shot on testing it's ok. Apologies it took me so long to get to this.

Let me know whether option 3 is a good way to go and whether you have some more remarks. It should be fairly easy to implement it.
opened by zouharvi 4
Test cases

I'm still in the process of transforming the code to be testable (first step is detaching the CLI from the code entry point). @ales-t do you have some test cases which you used during development that could guide me in this?

opened by zouharvi 3
Reorganize readme

Despite the readme processor descriptions being properly organized using markdown headings, I find them hard to navigate. Putting them in tables saves on vertical space (less heading padding) and makes it easier to find the commands.

opened by zouharvi 2
reorganize project into modules

This PR reorganizes the project into directories and modules for easier management. It may not be the most elegant solution but gets rid of the flat directory structure.

opened by zouharvi 0
Misc. Non-breaking Updates
There's no real content in this PR as it is mostly cosmetics.

README typos

Begin work on making code testable (separation of CLI from entrypoint)

Lint every file with rustfmt

Accept suggestions by clippy
opened by zouharvi 0
Large test file generation for benchmarking

Since the current test files are very small and cargo bench rounds the runtime to 0.0s, I was thinking about running the test on larger files (notably on (1) very long lines, (2) lots of lines or (3) both). The issue is that nobody really wants to have 0.5GB files in the repository.

Since their content is rather arbitrary, maybe we could just generate random data (with fixed seed) and have them be generated only locally? A few lines of a Python script would do but maybe it'd be purer adding a simple binary crate to this package.

In all cases I'm not sure if there's an elegant way to link the dependency other than erroring on tests that the files were not found and that the user probably forgot to generate them.

opened by zouharvi 0
Templating processor

In jq one can write the following (complex) query jq '[.[] | {message: .commit.message, name: .commit.committer.name, parents: [.parents[].html_url]}]'. I find this templating quite intuitive and useful but at the same time it seems like there's a lot of work behind this (mostly the parsing).

Currently rjp is based around simple operations and piping multiple rjp calls together so I'm not sure that this would even fit into this schema. If it was implemented, it would replace extract, rename and select.
enhancement

opened by zouharvi 0
Argument parser

Relevant to #7, we should definitely transition from the handwritten argument parser to something like clap. I'm however wary of implementing it now because version 3.0 should come out soon:tm: and I had issues by including the beta in Cargo.toml.

opened by zouharvi 1
Tolerant mode
Currently, rjp will stop when encountering any error, such as:

Select/rename not finding the required fields in an instance.

serde failing to parse an input line.

Join not finding the keys to join on in an instance.

Merge when the stream lengths are mismatched.

...

Oftentimes, JSON lines files are noisy and contain lines with problems. It would be helpful to have some way to request "tolerant" behavior. For example, rename_field would not change an instance if the input fields are not found. Alternatively, problematic instances may be skipped in the output stream.

It's currently not clear to me how the various processors should behave in this tolerant setting, or whether there should be more ways (for instance --skip-bad-instances, --stop-on-bad-instance, --keep-bad-instances?).
opened by ales-t 2
Way to approach paralelization
There are multiple ways in which we could do paralelization. Assuming that each line can be processed independently (which may not be true in the future if we make more processors), then with workers A B C the paralelization may look as follows (assume 12 lines):

ABC ABC ABC ABC each worker is fed exactly one line and when all three are finished, they are sent to the buffer. This is the easiest solution but may not even be worth it given that thread spawning is not zero cost.

AABBCC AABBCC another step is to increase the number of lines fed to each worker (here 2). There is certainly a threshold number from which it becomes beneficial to use threading (that number probably being larger than 1).

AAA BBB CCC this chunks the whole data into equal parts and feeds them to the workers. It creates the largest chunks and is the fastest from paralelization perspective but requires that the whole data is first loaded into memory which may not be viable for large files. (In a sense this is an extreme case of 3)

ABC AAABB CAC this is the most complicated solution (though probably the fastest one). It considers the processing dynamically and e.g. if A is finished and B and C are working on some long and difficult lines, then A can get next lines assigned.

I think that 2 is the best option because it can be parametrized via two parameters from the CLI. Maybe I missed some approach.
opened by zouharvi 1