Tool to allow parsing large JSON files without laoding into memory

Overview

JSON Lineage

Table of Contents

Introduction

JSON Linage is a tool that allows you to convert JSON to JSONL (JSON Lines) format as well as iteratively parse JSON where the JSON contains a list of objects.

The underlying program is written in Rust and is built to feed one JSON object at a time to the parser. This allows for the parsing of very large JSON files that would otherwise not fit into memory. In addition to saving memory, this program is capable of parsing JSON files faster than the built-in Python JSON parser as the file size increases.

Additionally, this project contains adapters for easy integration into other programming languages. Currently, there is only a Python adapter, but more are planned.

Adapters

Python

The Python adapter is a wrapper around the underlying Rust program, providing seamless integration into Python applications. It is designed to have a similar feel to Python's built-in json module.

Why not Just Use Python's json Library?

You might wonder why you would choose to use this library instead of Python's built-in JSON library. The answer depends on your specific use case.

If you are parsing a small JSON file, Python's JSON library is likely sufficient and performs well. However, when dealing with very large JSON files that exceed the available memory, JSON Lineage offers significant benefits.

Python's JSON library is written in C and is highly optimised for speed. However, it loads the entire JSON file into memory, making it unsuitable for parsing very large JSON files. This is where JSON Lineage shines.

JSON Lineage is specifically designed to parse very large JSON files that would not fit into memory. It achieves this by parsing the JSON file one object at a time.

Functionality

The following functionality is provided:

  • load - Generate an iterator that returns each object in a JSON file.
  • aload - Generates an asynchronous iterator that returns each object in a JSON file.

A CLI is also provided for easy conversion of JSON files to JSONL files. For information on how to use the CLI, run: python -m json_lineage --help.

Performance Comparison

The following graphs compare the speed and memory usage of Python's JSON library vs JSON Lineage.

The benchmarks show that up to a file size of 500MB, the speed difference is negligible. However, already at this point, Python requires almost 2GB of memory to parse the JSON file, while JSON Lineage only requires 1.5MB.

As the file size continues to grow, Python's JSON library continues to be faster, but the memory usage continues to grow at a linear rate. JSON Lineage, in contrast, continues to use the same amount of memory.

Benchmark of difference in time as file size grows

Benchmark of difference in memory as file size grows

Installation

pip install json-lineage

Usage

Iterating over a JSON file
from json_lineage import load

jsonl_iter = load("path/to/file.json")

for obj in jsonl_iter:
    do_something(obj)
Iterating over a JSON file asynchronously
import asyncio
from random import randint
from json_lineage import aload

jsonl_iter = aload("path/to/file.json")


async def do_something(i):
    await asyncio.sleep(randint(1, 2))
    print(i)


async def main():
    tasks = []
    async for i in async_iter:
        tasks.append(asyncio.create_task(do_something(i)))
    
    await asyncio.gather(*tasks)


asyncio.run(main())
Poorly Formatted JSON

When parsing a JSON file, the program will assume that the JSON file is well formatted. If the JSON file is not well formatted, then you can provide a messy=True argument to either the sync or async load:

from json_lineage import load

jsonl_iter = load("path/to/file.json", messy=True)


for obj in jsonl_iter:
    do_something(obj)

This will cause the program to output the same results. However, how it parses the JSON file will be different. Using this option will cause the program to be slower, but it will be able to parse JSON files that are not well formatted.

If you are using the CLI, then you can use the --messy flag to achieve the same result.

Under the Hood

The underlying program is written in Rust. The full documentation for the underlying program can be found here.

You might also like...
A simple CLI for combining json and yaml files

A simple CLI for combining json and yaml files

CLI application to run clang-format on a set of files specified using globs in a JSON configuration file.
CLI application to run clang-format on a set of files specified using globs in a JSON configuration file.

run_clang_format CLI application for running clang-format for an existing .clang-format file on a set of files, specified using globs in a .json confi

CLI application to run clang-tidy on a set of files specified using globs in a JSON configuration file.
CLI application to run clang-tidy on a set of files specified using globs in a JSON configuration file.

run-clang-tidy CLI application for running clang-tidy for an existing .clang-tidy file on a set of files, specified using globs in a .json configurati

 This CLI utility facilitates effortless manipulation and exploration of TOML, YAML, JSON and RON files.
This CLI utility facilitates effortless manipulation and exploration of TOML, YAML, JSON and RON files.

πŸ“πŸ’»πŸ”πŸ”§ This CLI utility facilitates effortless manipulation and exploration of TOML, YAML, JSON and RON files.

A simple code that will load a shellcode directly into RAM memory in a new process
A simple code that will load a shellcode directly into RAM memory in a new process

γ€Œ πŸ”„ 」About RustSCLoader RustSCLoader is a simple code that has the intention of loading a shellcode directly into RAM memory in a new process that wi

Framework for large distributed pipelines
Framework for large distributed pipelines

Rain Rain is an open-source distributed computational framework for processing of large-scale task-based pipelines. Rain aims to lower the entry barri

auto-rust is an experimental project that aims to automatically generate Rust code with LLM (Large Language Models) during compilation, utilizing procedural macros.
auto-rust is an experimental project that aims to automatically generate Rust code with LLM (Large Language Models) during compilation, utilizing procedural macros.

Auto Rust auto-rust is an experimental project that aims to automatically generate Rust code with LLM (Large Language Models) during compilation, util

πŸ¦€Rust + Large Language Models - Make AI Services Freely and Easily. Inspired by LangChain

llmchain: Modern Data Transformations with LLM πŸ¦€ + Large Language Models, inspired by LangChain. Features Models: LLMs & Chat Models & Embedding Mode

Attempt to summarize text from `stdin`, using a large language model (locally and offline), to `stdout`

summarize-cli Attempt to summarize text from stdin, using a large language model (locally and offline), to stdout. cargo build --release target/releas

Comments
  • Can improve performance by reading entire line at a time

    Can improve performance by reading entire line at a time

    At present, the underlying rust program reads one character at a line and processes it. Realistically, most JSON files should be formatted correctly. Therefore, rather than reading and processing one character at a time, the entire line may be read in one go, and as far as brackets go, use the first and last characters to make a decision in terms of the state.

    enhancement 
    opened by Salaah01 0
Releases(v0.2.2)
  • v0.2.2(Jul 11, 2023)

    What's Changed

    • new bins, updated version by @Salaah01 in https://github.com/Salaah01/json-lineage/pull/17
    • Python adapter converts each object provided by the bin stdout to dictionaries @Salaah01 in https://github.com/Salaah01/json-lineage/pull/18
    • bumped version and updated img. paths on readme by @Salaah01 in https://github.com/Salaah01/json-lineage/pull/19

    Full Changelog: https://github.com/Salaah01/json-lineage/compare/v0.2.1...v0.2.2

    Source code(tar.gz)
    Source code(zip)
  • v0.2.1(Jul 1, 2023)

    What's Changed

    • more performance updates and updated docs by @Salaah01 in https://github.com/Salaah01/json-lineage/pull/13
    • changed benchmark output by @Salaah01 in https://github.com/Salaah01/json-lineage/pull/14
    • benchmarking script and updated readme by @Salaah01 in https://github.com/Salaah01/json-lineage/pull/15
    • updated rust docs by @Salaah01 in https://github.com/Salaah01/json-lineage/pull/16

    Full Changelog: https://github.com/Salaah01/json-lineage/compare/v0.2.0...v0.2.1

    Source code(tar.gz)
    Source code(zip)
  • v0.2.0(Jun 28, 2023)

    What's Changed

    • Performance improvements by @Salaah01 in https://github.com/Salaah01/json-lineage/pull/12

    Full Changelog: https://github.com/Salaah01/json-lineage/compare/v0.1.0...v0.2.0

    Source code(tar.gz)
    Source code(zip)
  • v0.1.0(Jun 25, 2023)

    What's Changed

    • Rust library for converting JSON to JSONL
    • Python adapter for reading JSON iteratively with sync and async functions

    New Contributors

    • @Salaah01 made their first contribution in https://github.com/Salaah01/json-lineage/pull/1

    Full Changelog: https://github.com/Salaah01/json-lineage/commits/v0.1.0

    Source code(tar.gz)
    Source code(zip)
Owner
Salaah Amin
Co-Founder of Bluish Pink (bluishpink.com), Developer at @sky-uk. Software development enthusiast.
Salaah Amin
A tool that allow you to run SQL-like query on local files instead of database files using the GitQL SDK.

FileQL - File Query Language FileQL is a tool that allow you to run SQL-like query on local files instead of database files using the GitQL SDK. Sampl

Amr Hesham 39 Mar 12, 2024
ClangQL is a tool that allow you to run SQL-like query on C/C++ Code instead of database files using the GitQL SDK

ClangQL - Clang AST Query Language ClangQL is a tool that allow you to run SQL-like query on C/C++ Code instead of database files using the GitQL SDK.

Amr Hesham 490 Oct 23, 2024
hj is a command line tool to convert HTTP/1-style text into JSON

hj hj is a command line tool to convert HTTP/1-style text into JSON. This command is inspired by yusukebe/rj, which is a standalone HTTP client that s

FUJI Goro 10 Aug 21, 2022
A CLI tool that allow you to create a temporary new rust project using cargo with already installed dependencies

cargo-temp A CLI tool that allow you to create a new rust project in a temporary directory with already installed dependencies. Install Requires Rust

Yohan Boogaert 61 Oct 31, 2022
Codemod - Codemod is a tool/library to assist you with large-scale codebase refactors that can be partially automated but still require human oversight and occasional intervention

Codemod - Codemod is a tool/library to assist you with large-scale codebase refactors that can be partially automated but still require human oversight and occasional intervention. Codemod was developed at Facebook and released as open source.

Meta Archive 4k Dec 29, 2022
A library for loading and executing PE (Portable Executable) from memory without ever touching the disk

memexec A library for loading and executing PE (Portable Executable) from memory without ever touching the disk This is my own version for specific pr

FssAy 5 Aug 27, 2022
This utility traverses through your filesystem looking for open-source dependencies that are seeking donations by parsing README.md and FUNDING.yml files

This utility traverses through your filesystem looking for open-source dependencies that are seeking donations by parsing README.md and FUNDING.yml files

Mufeed VH 38 Dec 30, 2022
A crate to help you copy things into raw buffers without invoking spooky action at a distance (undefined behavior).

?? presser Utilities to help make copying data around into raw, possibly-uninitialized buffers easier and safer. presser can help you when copying dat

Embark 131 Mar 16, 2023
Shared memory - A Rust wrapper around native shared memory for Linux and Windows

shared_memory A crate that allows you to share memory between processes. This crate provides lightweight wrappers around shared memory APIs in an OS a

elast0ny 274 Dec 29, 2022
fas stand for Find all stuff and it's a go app that simplify the find command and allow you to easily search everything you nedd

fas fas stands for Find all stuff and it's a rust app that simplify the find command and allow you to easily search everything you need. Note: current

M4jrT0m 1 Dec 24, 2021