JSON parser which picks up values directly without performing tokenization in Rust

Overview

Pikkr

Crates.io version shield Build Status

JSON parser which picks up values directly without performing tokenization in Rust

Abstract

Pikkr is a JSON parser which picks up values directly without performing tokenization in Rust. This JSON parser is implemented based on Y. Li, N. R. Katsipoulakis, B. Chandramouli, J. Goldstein, and D. Kossmann. Mison: a fast JSON parser for data analytics. In VLDB, 2017.

This JSON parser extracts values from a JSON record without using finite state machines (FSMs) and performing tokenization. It parses JSON records in the following procedures:

  1. [Indexing] Creates an index which maps logical locations of queried fields to their physical locations by using SIMD instructions and bit manipulation.
  2. [Basic parsing] Finds values of queried fields by scanning a JSON record using the index created in the previous process and learns their logical locations (i.e. pattern of the JSON structure) in the early stages.
  3. [Speculative parsing] Speculates logical locations of queried fields by using the learned result information, jumps directly to their physical locations and extracts values in the later stages. Fallbacks to basic parsing if the speculation fails.

This JSON parser performs well when there are a limited number of different JSON structural variants in a JSON data stream or JSON collection, and that is a common case in data analytics field.

Please read the paper mentioned in the opening paragraph for the details of the JSON parsing algorithm.

Performance

Benchmark Result

Hardware

Model Name: MacBook Pro
Processor Name: Intel Core i7
Processor Speed: 3.3 GHz
Number of Processors: 1
Total Number of Cores: 2
L2 Cache (per Core): 256 KB
L3 Cache: 4 MB
Memory: 16 GB

Rust

$ cargo --version
cargo 0.23.0-nightly (34c0674a2 2017-09-01)

$ rustc --version
rustc 1.22.0-nightly (d93036a04 2017-09-07)

Crates

JSON Data

Benchmark Code

Example

Code

extern crate pikkr;

fn main() {
    let queries = vec![
        "$.f1".as_bytes(),
        "$.f2.f1".as_bytes(),
    ];
    let train_num = 2; // Number of records used as training data
                       // before Pikkr starts speculative parsing.
    let mut p = match pikkr::Pikkr::new(&queries, train_num) {
        Ok(p) => p,
        Err(err) => panic!("There was a problem creating a JSON parser: {:?}", err.kind()),
    };
    let recs = vec![
        r#"{"f1": "a", "f2": {"f1": 1, "f2": true}}"#,
        r#"{"f1": "b", "f2": {"f1": 2, "f2": true}}"#,
        r#"{"f1": "c", "f2": {"f1": 3, "f2": true}}"#, // Speculative parsing starts from this record.
        r#"{"f2": {"f2": true, "f1": 4}, "f1": "d"}"#,
        r#"{"f2": {"f2": true, "f1": 5}}"#,
        r#"{"f1": "e"}"#
    ];
    for rec in recs {
        match p.parse(rec.as_bytes()) {
            Ok(results) => {
                for result in results {
                    print!("{} ", match result {
                        Some(result) => String::from_utf8(result.to_vec()).unwrap(),
                        None => String::from("None"),
                    });
                }
                println!();
            },
            Err(err) => println!("There was a problem parsing a record: {:?}", err.kind()),
        }
    }
    /*
    Output:
        "a" 1
        "b" 2
        "c" 3
        "d" 4
        None 5
        "e" None
    */
}

Build

$ cargo --version
cargo 0.23.0-nightly (34c0674a2 2017-09-01) # Make sure that nightly release is being used.
$ RUSTFLAGS="-C target-cpu=native" cargo build --release

Run

$ ./target/release/[package name]
"a" 1
"b" 2
"c" 3
"d" 4
None 5
"e" None

Documentation

Restrictions

  • Rust nightly channel and CPUs with AVX2 are needed to build Rust source code which depends on Pikkr and run the executable binary file because Pikkr uses AVX2 Instructions.

Contributing

Any kind of contribution (e.g. comment, suggestion, question, bug report and pull request) is welcome.

Comments
  • Benchmarks need to be reproducible

    Benchmarks need to be reproducible

    In the readme I see links to the benchmark code and a collection of 5 large JSON datasets, but I can't tell which of those datasets were used, what queries were executed, and what size the training data was. I would need all of these to be able to reproduce the benchmarks.

    opened by dtolnay 6
  • doesn't build on fresh copy of nightly 1.23

    doesn't build on fresh copy of nightly 1.23

    Trying this out on a fresh copy of nightly 1.23 and it gives me error building.

    error[E0432]: unresolved import `x86intrin::mm256_setr_epi8`
     --> /home/soapdog/.cargo/registry/src/github.com-1ecc6299db9ec823/pikkr-0.16.0/src/avx.rs:1:24
      |
    1 | use x86intrin::{m256i, mm256_setr_epi8};
      |                        ^^^^^^^^^^^^^^^ no `mm256_setr_epi8` in the root. Did you mean to use `mm_setr_epi8`?
    
    error[E0432]: unresolved import `x86intrin::mm256_cmpeq_epi8`
     --> /home/soapdog/.cargo/registry/src/github.com-1ecc6299db9ec823/pikkr-0.16.0/src/index_builder.rs:4:24
      |
    4 | use x86intrin::{m256i, mm256_cmpeq_epi8, mm256_movemask_epi8, mm256_setr_epi8};
      |                        ^^^^^^^^^^^^^^^^ no `mm256_cmpeq_epi8` in the root. Did you mean to use `mm_cmpeq_epi8`?
    
    error[E0432]: unresolved import `x86intrin::mm256_movemask_epi8`
     --> /home/soapdog/.cargo/registry/src/github.com-1ecc6299db9ec823/pikkr-0.16.0/src/index_builder.rs:4:42
      |
    4 | use x86intrin::{m256i, mm256_cmpeq_epi8, mm256_movemask_epi8, mm256_setr_epi8};
      |                                          ^^^^^^^^^^^^^^^^^^^ no `mm256_movemask_epi8` in the root. Did you mean to use `mm_movemask_epi8`?
    
    error[E0432]: unresolved import `x86intrin::mm256_setr_epi8`
     --> /home/soapdog/.cargo/registry/src/github.com-1ecc6299db9ec823/pikkr-0.16.0/src/index_builder.rs:4:63
      |
    4 | use x86intrin::{m256i, mm256_cmpeq_epi8, mm256_movemask_epi8, mm256_setr_epi8};
      |                                                               ^^^^^^^^^^^^^^^ no `mm256_setr_epi8` in the root. Did you mean to use `mm_setr_epi8`?
    
    error: aborting due to 4 previous errors
    
    error: failed to compile `desafio-5-rust-pikkr v0.1.0 (file:///mnt/c/Users/soapdog/prog/osprogramadores/desafio-5-rust-pikkr)`, intermediate artifacts can be found at `/mnt/c/Users/soapdog/prog/osprogramadores/desafio-5-rust-pikkr/target`
    
    Caused by:
      Could not compile `pikkr`.
    

    This was on an empty project trying cargo install just with Pikkr in the toml file.

    opened by soapdog 3
  • allow build without AVX

    allow build without AVX

    This relegates AVX support to the avx-accel feature, building an "emulation" layer instead by default. I have not yet benchmarked this, but I guess it should be rather slow, but fast enough as a baseline to improve upon.

    opened by llogiq 3
  • Support building without AVX2

    Support building without AVX2

    Could there be a way to fall back to a slower codepath if AVX2 is not available?

    For now this could be a Cargo feature and later it could be detected automatically by https://github.com/rust-lang/rfcs/pull/2045.

    enhancement 
    opened by dtolnay 3
  • Remove rustfmt settings from ci.sh

    Remove rustfmt settings from ci.sh

    The installation process of rustfmt makes the total time of CI test huge. IMO, applying the code formatter should be done before commiting, not in CI.

    opened by ubnt-intrepid 2
  • won't compile: use of unstable library feature 'option_entry'

    won't compile: use of unstable library feature 'option_entry'

    I just clone this repo and try to build it with rust 1.21.0, nightly. Here's the compiler complaints.

       Compiling pikkr v0.10.1
    error: use of unstable library feature 'option_entry' (see issue #39288)
       --> /home/congee/.cargo/registry/src/github.com-1ecc6299db9ec823/pikkr-0.10.1/src/pikkr.rs:180:47
        |
    180 |             let mut children = query.children.get_or_insert(FnvHashMap::default());
        |                                               ^^^^^^^^^^^^^
        |
        = help: add #![feature(option_entry)] to the crate attributes to enable
    
    error: aborting due to previous error
    
    error: Could not compile `pikkr`.
    
    To learn more, run the command again with --verbose.
    

    Is there a workaround to enable this feature?

    opened by Congee 2
  • Change internal structure in `Query` and `QueryTree`

    Change internal structure in `Query` and `QueryTree`

    It also renames Query to QueryNode, to clarify the role of them as a node in pattern tree. Note that the argument level in Parser::basic_parse() and Parser::speculative_parse() has also moved to QueryNode, because they are the same as the level of node in pattern tree.

    opened by ubnt-intrepid 1
  • Integer overflow in parser

    Integer overflow in parser

    Discovered by https://github.com/rust-fuzz/targets/pull/89:

    extern crate pikkr;
    
    fn main() {
        let j = b"\\\":";
        let q = vec!["$.a".as_bytes()];
        pikkr::Pikkr::new(&q, 1).parse(j);
    }
    
    thread 'main' panicked at 'attempt to subtract with overflow', .cargo/registry/src/github.com-1ecc6299db9ec823/pikkr-0.5.1/src/parser.rs:26:15
    stack backtrace:
      10: pikkr::parser::basic_parse
                 at .cargo/registry/src/github.com-1ecc6299db9ec823/pikkr-0.5.1/src/parser.rs:26
      11: pikkr::pikkr::Pikkr::parse
                 at .cargo/registry/src/github.com-1ecc6299db9ec823/pikkr-0.5.1/src/pikkr.rs:110
      12: testing::main
                 at src/main.rs:6
    
    opened by dtolnay 1
  • Miscellaneous fixups about CI configuration and Cargo

    Miscellaneous fixups about CI configuration and Cargo

    It also removes Cargo.lock from version controls, because the actual version of dependencies may be changed on the user side (see FAQ in crates.io for details).

    opened by ubnt-intrepid 0
  • Refactor: define a new struct `Parser` to manage parser context

    Refactor: define a new struct `Parser` to manage parser context

    Note that Parser has an internal mutability for colon_positions, to avoid the compiler error by borrow checker. It will add some runtime costs due to RefCell, but its cost is negligibly small as far as I measured.

    opened by ubnt-intrepid 0
  • Performance Tuning

    Performance Tuning

    https://users.rust-lang.org/t/are-there-any-ways-to-use-stack-not-heap-when-handling-a-sequence-of-elements-whose-length-is-determined-at-runtime-not-compilation-time/12784/3?u=kjmrknsn

    opened by kjmrknsn 0
  • Weird result when accessing fields of objects in arrays

    Weird result when accessing fields of objects in arrays

    I'm not 100% sure whether this crate is supposed to support accessing fields of objects within arrays but I'm seeing strange behaviour when doing so...

    Given a JSON record:

    {"list_of_objects": [{"first_field": 1, "second_field": true}]}
    

    The results of the following queries are:

    $.list_of_objects.first_field       => 1
    $.list_of_objects.second_field      => true}]
    

    Or as a runnable example (modelled on the example from the readme):

    fn main() {
        let queries = vec![
            "$.list_of_objects.first_field".as_bytes(),
            "$.list_of_objects.second_field".as_bytes(),
        ];
        let mut p = pikkr::Pikkr::new(&queries, 0).unwrap();
        let recs = vec![
            r#"{"list_of_objects": [{"first_field": 1, "second_field": true}]}"#,
        ];
        for rec in recs {
            let results = p.parse(rec.as_bytes()).unwrap();
    
            for result in results {
                print!("{} ", match result {
                    Some(result) => String::from_utf8(result.to_vec()).unwrap(),
                    None => String::from("None"),
                });
            }
            println!();
        }
    }
    
    opened by daniel-ferguson 1
  • Parses and returns invalid JSON

    Parses and returns invalid JSON

    What guarantees does this library make about rejecting invalid JSON and returning valid JSON?

    In particular, the following program accepts the input which is not valid JSON and returns b"0," as output which is not valid JSON.

    extern crate pikkr;
    
    fn main() {
        let j = br#" {"x":0,} "#;
        let q = vec!["$.x".as_bytes()];
        let mut p = pikkr::Pikkr::new(&q, 1);
        for r in p.parse(j) {
            println!("{}", std::str::from_utf8(r.unwrap()).unwrap());
        }
    }
    
    opened by dtolnay 3
  • Update kostya benchmarks

    Update kostya benchmarks

    This seems like it would be applicable to kostya's JSON benchmark. Rust is already significantly faster than C++ but D does some partial parsing tricks that pikkr may be able to overcome.

    • https://github.com/kostya/benchmarks#json
    • https://github.com/kostya/benchmarks/tree/master/json/json.rs
    enhancement 
    opened by dtolnay 0
  • Publish C hooks

    Publish C hooks

    Awesome project! Could C hooks be published, so that nonRust languages can take advantage of this accelerated library?

    https://doc.rust-lang.org/1.5.0/book/rust-inside-other-languages.html

    enhancement 
    opened by mcandre 2
Releases(0.14.0)
Owner
Pikkr
JSON parser which picks up values directly without performing tokenization
Pikkr
A node package based on jsonschema-rs for performing JSON schema validation

A node package based on jsonschema-rs for performing JSON schema validation.

dxd 49 Dec 18, 2022
Strongly typed JSON library for Rust

Serde JSON   Serde is a framework for serializing and deserializing Rust data structures efficiently and generically. [dependencies] serde_json = "1.0

null 3.6k Jan 5, 2023
JSON implementation in Rust

json-rust Parse and serialize JSON with ease. Changelog - Complete Documentation - Cargo - Repository Why? JSON is a very loose format where anything

Maciej Hirsz 500 Dec 21, 2022
Rust port of gjson,get JSON value by dotpath syntax

A-JSON Read JSON values quickly - Rust JSON Parser change name to AJSON, see issue Inspiration comes from gjson in golang Installation Add it to your

Chen Jiaju 90 Dec 6, 2022
A rust script to convert a better bibtex json file from Zotero into nice organised notes in Obsidian

Zotero to Obsidian script This is a script that takes a better bibtex JSON file exported by Zotero and generates an organised collection of reference

Sashin Exists 3 Oct 9, 2022
CLI tool to convert HOCON into valid JSON or YAML written in Rust.

{hocon:vert} CLI Tool to convert HOCON into valid JSON or YAML. Under normal circumstances this is mostly not needed because hocon configs are parsed

Mathias Oertel 23 Jan 6, 2023
Typify - Compile JSON Schema documents into Rust types.

Typify Compile JSON Schema documents into Rust types. This can be used ... via the macro import_types!("types.json") to generate Rust types directly i

Oxide Computer Company 73 Dec 27, 2022
A easy and declarative way to test JSON input in Rust.

assert_json A easy and declarative way to test JSON input in Rust. assert_json is a Rust macro heavily inspired by serde json macro. Instead of creati

Charles Vandevoorde 8 Dec 5, 2022
A small rust database that uses json in memory.

Tiny Query Database (TQDB) TQDB is a small library for creating a query-able database that is encoded with json. The library is well tested (~96.30% c

Kace Cottam 2 Jan 4, 2022
A JSON Query Language CLI tool built with Rust 🦀

JQL A JSON Query Language CLI tool built with Rust ?? ?? Core philosophy ?? Stay lightweight ?? Keep its features as simple as possible ?? Avoid redun

Davy Duperron 872 Jan 1, 2023
rurl is like curl but with a json configuration file per request

rurl rurl is a curl-like cli tool made in rust, the difference is that it takes its params from a json file so you can have all different requests sav

Bruno Ribeiro da Silva 6 Sep 10, 2022
A tool for outputs semantic difference of json

jsondiff A tool for outputs semantic difference of json. "semantic" means: sort object key before comparison sort array before comparison (optional, b

niboshi 3 Sep 22, 2021
Easily create dynamic css using json notation

jss! This crate provides an easy way to write dynamic css using json notation. This gives you more convenient than you think. Considering using a dyna

Jovansonlee Cesar 7 May 14, 2022
Decode Metaplex mint account metadata into a JSON file.

Simple Metaplex Decoder (WIP) Install From Source Install Rust. curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh Clone the source: git c

Samuel Vanderwaal 8 Aug 25, 2022
A fast and simple command-line tool for common operations over JSON-lines files

rjp: Rapid JSON-lines processor A fast and simple command-line tool for common operations over JSON-lines files, such as: converting to and from text

Ales Tamchyna 3 Jul 8, 2022
Tools for working with Twitter JSON data

Twitter stream user info extractor This project lets you parse JSON data from the Twitter API or other sources to extract some basic user information,

Travis Brown 4 Apr 21, 2022
A fast way to minify JSON

COMPACTO (work in progress) A fast way to minify JSON. Usage/Examples # Compress # Input example (~0.11 KB) # { # "id": "123", # "name": "Edua

Eduardo Stuart 4 Feb 27, 2022
JSON Schema validation library

A JSON Schema validator implementation. It compiles schema into a validation tree to have validation as fast as possible.

Dmitry Dygalo 308 Dec 30, 2022
Jq - Command-line JSON processor

jq jq is a lightweight and flexible command-line JSON processor. , Unix: , Windows: If you want to learn to use jq, read the documentation at https://

Stephen Dolan 23.9k Jan 4, 2023