JSON parser which picks up values directly without performing tokenization in Rust



JSON parser which picks up values directly without performing tokenization in Rust


Pikkr is a JSON parser which picks up values directly without performing tokenization in Rust. This JSON parser is implemented based on Y. Li, N. R. Katsipoulakis, B. Chandramouli, J. Goldstein, and D. Kossmann. Mison: a fast JSON parser for data analytics. In VLDB, 2017.

This JSON parser extracts values from a JSON record without using finite state machines (FSMs) and performing tokenization. It parses JSON records in the following procedures:

  1. [Indexing] Creates an index which maps logical locations of queried fields to their physical locations by using SIMD instructions and bit manipulation.
  2. [Basic parsing] Finds values of queried fields by scanning a JSON record using the index created in the previous process and learns their logical locations (i.e. pattern of the JSON structure) in the early stages.
  3. [Speculative parsing] Speculates logical locations of queried fields by using the learned result information, jumps directly to their physical locations and extracts values in the later stages. Fallbacks to basic parsing if the speculation fails.

This JSON parser performs well when there are a limited number of different JSON structural variants in a JSON data stream or JSON collection, and that is a common case in data analytics field.

Please read the paper mentioned in the opening paragraph for the details of the JSON parsing algorithm.


Benchmark Result


Model Name: MacBook Pro
Processor Name: Intel Core i7
Processor Speed: 3.3 GHz
Number of Processors: 1
Total Number of Cores: 2
L2 Cache (per Core): 256 KB
L3 Cache: 4 MB
Memory: 16 GB


$ cargo --version
cargo 0.23.0-nightly (34c0674a2 2017-09-01)

$ rustc --version
rustc 1.22.0-nightly (d93036a04 2017-09-07)



Benchmark Code



extern crate pikkr;

fn main() {
    let queries = vec![
    let train_num = 2; // Number of records used as training data
                       // before Pikkr starts speculative parsing.
    let mut p = match pikkr::Pikkr::new(&queries, train_num) {
        Ok(p) => p,
        Err(err) => panic!("There was a problem creating a JSON parser: {:?}", err.kind()),
    let recs = vec![
        r#"{"f1": "a", "f2": {"f1": 1, "f2": true}}"#,
        r#"{"f1": "b", "f2": {"f1": 2, "f2": true}}"#,
        r#"{"f1": "c", "f2": {"f1": 3, "f2": true}}"#, // Speculative parsing starts from this record.
        r#"{"f2": {"f2": true, "f1": 4}, "f1": "d"}"#,
        r#"{"f2": {"f2": true, "f1": 5}}"#,
        r#"{"f1": "e"}"#
    for rec in recs {
        match p.parse(rec.as_bytes()) {
            Ok(results) => {
                for result in results {
                    print!("{} ", match result {
                        Some(result) => String::from_utf8(result.to_vec()).unwrap(),
                        None => String::from("None"),
            Err(err) => println!("There was a problem parsing a record: {:?}", err.kind()),
        "a" 1
        "b" 2
        "c" 3
        "d" 4
        None 5
        "e" None


$ cargo --version
cargo 0.23.0-nightly (34c0674a2 2017-09-01) # Make sure that nightly release is being used.
$ RUSTFLAGS="-C target-cpu=native" cargo build --release


$ ./target/release/[package name]
"a" 1
"b" 2
"c" 3
"d" 4
None 5
"e" None



  • Rust nightly channel and CPUs with AVX2 are needed to build Rust source code which depends on Pikkr and run the executable binary file because Pikkr uses AVX2 Instructions.


Any kind of contribution (e.g. comment, suggestion, question, bug report and pull request) is welcome.

  • Benchmarks need to be reproducible

    Benchmarks need to be reproducible

    In the readme I see links to the benchmark code and a collection of 5 large JSON datasets, but I can't tell which of those datasets were used, what queries were executed, and what size the training data was. I would need all of these to be able to reproduce the benchmarks.

    opened by dtolnay 6
  • doesn't build on fresh copy of nightly 1.23

    doesn't build on fresh copy of nightly 1.23

    Trying this out on a fresh copy of nightly 1.23 and it gives me error building.

    error[E0432]: unresolved import `x86intrin::mm256_setr_epi8`
     --> /home/soapdog/.cargo/registry/src/github.com-1ecc6299db9ec823/pikkr-0.16.0/src/avx.rs:1:24
    1 | use x86intrin::{m256i, mm256_setr_epi8};
      |                        ^^^^^^^^^^^^^^^ no `mm256_setr_epi8` in the root. Did you mean to use `mm_setr_epi8`?
    error[E0432]: unresolved import `x86intrin::mm256_cmpeq_epi8`
     --> /home/soapdog/.cargo/registry/src/github.com-1ecc6299db9ec823/pikkr-0.16.0/src/index_builder.rs:4:24
    4 | use x86intrin::{m256i, mm256_cmpeq_epi8, mm256_movemask_epi8, mm256_setr_epi8};
      |                        ^^^^^^^^^^^^^^^^ no `mm256_cmpeq_epi8` in the root. Did you mean to use `mm_cmpeq_epi8`?
    error[E0432]: unresolved import `x86intrin::mm256_movemask_epi8`
     --> /home/soapdog/.cargo/registry/src/github.com-1ecc6299db9ec823/pikkr-0.16.0/src/index_builder.rs:4:42
    4 | use x86intrin::{m256i, mm256_cmpeq_epi8, mm256_movemask_epi8, mm256_setr_epi8};
      |                                          ^^^^^^^^^^^^^^^^^^^ no `mm256_movemask_epi8` in the root. Did you mean to use `mm_movemask_epi8`?
    error[E0432]: unresolved import `x86intrin::mm256_setr_epi8`
     --> /home/soapdog/.cargo/registry/src/github.com-1ecc6299db9ec823/pikkr-0.16.0/src/index_builder.rs:4:63
    4 | use x86intrin::{m256i, mm256_cmpeq_epi8, mm256_movemask_epi8, mm256_setr_epi8};
      |                                                               ^^^^^^^^^^^^^^^ no `mm256_setr_epi8` in the root. Did you mean to use `mm_setr_epi8`?
    error: aborting due to 4 previous errors
    error: failed to compile `desafio-5-rust-pikkr v0.1.0 (file:///mnt/c/Users/soapdog/prog/osprogramadores/desafio-5-rust-pikkr)`, intermediate artifacts can be found at `/mnt/c/Users/soapdog/prog/osprogramadores/desafio-5-rust-pikkr/target`
    Caused by:
      Could not compile `pikkr`.

    This was on an empty project trying cargo install just with Pikkr in the toml file.

    opened by soapdog 3
  • allow build without AVX

    allow build without AVX

    This relegates AVX support to the avx-accel feature, building an "emulation" layer instead by default. I have not yet benchmarked this, but I guess it should be rather slow, but fast enough as a baseline to improve upon.

    opened by llogiq 3
  • Support building without AVX2

    Support building without AVX2

    Could there be a way to fall back to a slower codepath if AVX2 is not available?

    For now this could be a Cargo feature and later it could be detected automatically by https://github.com/rust-lang/rfcs/pull/2045.

    opened by dtolnay 3
  • Remove rustfmt settings from ci.sh

    Remove rustfmt settings from ci.sh

    The installation process of rustfmt makes the total time of CI test huge. IMO, applying the code formatter should be done before commiting, not in CI.

    opened by ubnt-intrepid 2
  • won't compile: use of unstable library feature 'option_entry'

    won't compile: use of unstable library feature 'option_entry'

    I just clone this repo and try to build it with rust 1.21.0, nightly. Here's the compiler complaints.

       Compiling pikkr v0.10.1
    error: use of unstable library feature 'option_entry' (see issue #39288)
       --> /home/congee/.cargo/registry/src/github.com-1ecc6299db9ec823/pikkr-0.10.1/src/pikkr.rs:180:47
    180 |             let mut children = query.children.get_or_insert(FnvHashMap::default());
        |                                               ^^^^^^^^^^^^^
        = help: add #![feature(option_entry)] to the crate attributes to enable
    error: aborting due to previous error
    error: Could not compile `pikkr`.
    To learn more, run the command again with --verbose.

    Is there a workaround to enable this feature?

    opened by Congee 2
  • Change internal structure in `Query` and `QueryTree`

    Change internal structure in `Query` and `QueryTree`

    It also renames Query to QueryNode, to clarify the role of them as a node in pattern tree. Note that the argument level in Parser::basic_parse() and Parser::speculative_parse() has also moved to QueryNode, because they are the same as the level of node in pattern tree.

    opened by ubnt-intrepid 1
  • Integer overflow in parser

    Integer overflow in parser

    Discovered by https://github.com/rust-fuzz/targets/pull/89:

    extern crate pikkr;
    fn main() {
        let j = b"\\\":";
        let q = vec!["$.a".as_bytes()];
        pikkr::Pikkr::new(&q, 1).parse(j);
    thread 'main' panicked at 'attempt to subtract with overflow', .cargo/registry/src/github.com-1ecc6299db9ec823/pikkr-0.5.1/src/parser.rs:26:15
    stack backtrace:
      10: pikkr::parser::basic_parse
                 at .cargo/registry/src/github.com-1ecc6299db9ec823/pikkr-0.5.1/src/parser.rs:26
      11: pikkr::pikkr::Pikkr::parse
                 at .cargo/registry/src/github.com-1ecc6299db9ec823/pikkr-0.5.1/src/pikkr.rs:110
      12: testing::main
                 at src/main.rs:6
    opened by dtolnay 1
  • Miscellaneous fixups about CI configuration and Cargo

    Miscellaneous fixups about CI configuration and Cargo

    It also removes Cargo.lock from version controls, because the actual version of dependencies may be changed on the user side (see FAQ in crates.io for details).

    opened by ubnt-intrepid 0
  • Refactor: define a new struct `Parser` to manage parser context

    Refactor: define a new struct `Parser` to manage parser context

    Note that Parser has an internal mutability for colon_positions, to avoid the compiler error by borrow checker. It will add some runtime costs due to RefCell, but its cost is negligibly small as far as I measured.

    opened by ubnt-intrepid 0
  • Performance Tuning

    Performance Tuning


    opened by kjmrknsn 0
  • Weird result when accessing fields of objects in arrays

    Weird result when accessing fields of objects in arrays

    I'm not 100% sure whether this crate is supposed to support accessing fields of objects within arrays but I'm seeing strange behaviour when doing so...

    Given a JSON record:

    {"list_of_objects": [{"first_field": 1, "second_field": true}]}

    The results of the following queries are:

    $.list_of_objects.first_field       => 1
    $.list_of_objects.second_field      => true}]

    Or as a runnable example (modelled on the example from the readme):

    fn main() {
        let queries = vec![
        let mut p = pikkr::Pikkr::new(&queries, 0).unwrap();
        let recs = vec![
            r#"{"list_of_objects": [{"first_field": 1, "second_field": true}]}"#,
        for rec in recs {
            let results = p.parse(rec.as_bytes()).unwrap();
            for result in results {
                print!("{} ", match result {
                    Some(result) => String::from_utf8(result.to_vec()).unwrap(),
                    None => String::from("None"),
    opened by daniel-ferguson 1
  • Parses and returns invalid JSON

    Parses and returns invalid JSON

    What guarantees does this library make about rejecting invalid JSON and returning valid JSON?

    In particular, the following program accepts the input which is not valid JSON and returns b"0," as output which is not valid JSON.

    extern crate pikkr;
    fn main() {
        let j = br#" {"x":0,} "#;
        let q = vec!["$.x".as_bytes()];
        let mut p = pikkr::Pikkr::new(&q, 1);
        for r in p.parse(j) {
            println!("{}", std::str::from_utf8(r.unwrap()).unwrap());
    opened by dtolnay 3
  • Update kostya benchmarks

    Update kostya benchmarks

    This seems like it would be applicable to kostya's JSON benchmark. Rust is already significantly faster than C++ but D does some partial parsing tricks that pikkr may be able to overcome.

    • https://github.com/kostya/benchmarks#json
    • https://github.com/kostya/benchmarks/tree/master/json/json.rs
    opened by dtolnay 0
  • Publish C hooks

    Publish C hooks

    Awesome project! Could C hooks be published, so that nonRust languages can take advantage of this accelerated library?


    opened by mcandre 2
