Command line tool for inspecting Parquet files

Overview

pqrs build

  • pqrs is a command line tool for inspecting Parquet files
  • This is a replacement for the parquet-tools utility written in Rust
  • Built using the Rust implementation of Parquet and Arrow
  • pqrs roughly means "parquet-tools in rust"

Installation

Recommended Method

You can download release binaries here

Alternative methods

Using Homebrew

For macOS users, pqrs is available as a homebrew tap.

brew tap manojkarthick/pqrs
brew install pqrs

Using nix

If you are a nix user, you can install pqrs from nixpkgs

nix-env --install pqrs

Building and running from source

Make sure you have rustc and cargo installed on your machine.

git clone https://github.com/manojkarthick/pqrs.git
cargo build --release
./target/release/pqrs

Running

The below snippet shows the available subcommands:

❯ pqrs --help
pqrs 0.1.1
Manoj Karthick
Apache Parquet command-line utility

USAGE:
    pqrs [FLAGS] [SUBCOMMAND]

FLAGS:
    -d, --debug      Show debug output
    -h, --help       Prints help information
    -V, --version    Prints version information

SUBCOMMANDS:
    cat         Prints the contents of Parquet file(s)
    head        Prints the first n records of the Parquet file
    help        Prints this message or the help of the given subcommand(s)
    merge       Merge file(s) into another parquet file
    rowcount    Prints the count of rows in Parquet file(s)
    sample      Prints a random sample of records from the Parquet file
    schema      Prints the schema of Parquet file(s)
    size        Prints the size of Parquet file(s)

Subcommand: cat

Prints the contents of the parquet file in a json-like or json format. Use --json for JSON output.

❯ pqrs cat data/cities.parquet
{continent: "Europe", country: {name: "France", city: ["Paris", "Nice", "Marseilles", "Cannes"]}}
{continent: "Europe", country: {name: "Greece", city: ["Athens", "Piraeus", "Hania", "Heraklion", "Rethymnon", "Fira"]}}
{continent: "North America", country: {name: "Canada", city: ["Toronto", "Vancouver", "St. John's", "Saint John", "Montreal", "Halifax", "Winnipeg", "Calgary", "Saskatoon", "Ottawa", "Yellowknife"]}}
❯ pqrs cat data/cities.parquet --json
{"continent":"Europe","country":{"name":"France","city":["Paris","Nice","Marseilles","Cannes"]}}
{"continent":"Europe","country":{"name":"Greece","city":["Athens","Piraeus","Hania","Heraklion","Rethymnon","Fira"]}}
{"continent":"North America","country":{"name":"Canada","city":["Toronto","Vancouver","St. John's","Saint John","Montreal","Halifax","Winnipeg","Calgary","Saskatoon","Ottawa","Yellowknife"]}}

Subcommand: head

Prints the first N records of the parquet file. Use --records flag to set the number of records.

❯ pqrs head data/cities.parquet --json --records 2
{"continent":"Europe","country":{"name":"France","city":["Paris","Nice","Marseilles","Cannes"]}}
{"continent":"Europe","country":{"name":"Greece","city":["Athens","Piraeus","Hania","Heraklion","Rethymnon","Fira"]}}

Subcommand: merge

Merge two Parquet files by placing row groups (or blocks) from the two files one after the other.

Disclaimer: This does not combine the files to have optimized row groups, do not use it in production!

❯ pqrs merge --input data/pems-1.snappy.parquet data/pems-2.snappy.parquet --output data/pems-merged.snappy.parquet

❯ ls -al data
total 408
drwxr-xr-x   6 manojkarthick  staff     192 Feb 14 08:53 .
drwxr-xr-x  20 manojkarthick  staff     640 Feb 14 08:52 ..
-rw-r--r--   1 manojkarthick  staff     866 Feb  8 19:50 cities.parquet
-rw-r--r--   1 manojkarthick  staff   16468 Feb  8 19:50 pems-1.snappy.parquet
-rw-r--r--   1 manojkarthick  staff   17342 Feb  8 19:50 pems-2.snappy.parquet
-rw-r--r--   1 manojkarthick  staff  160950 Feb 14 08:53 pems-merged.snappy.parquet

Subcommand: rowcount

Print the number of rows present in the parquet file.

❯ pqrs rowcount data/pems-1.snappy.parquet data/pems-2.snappy.parquet
File Name: data/pems-1.snappy.parquet: 2693 rows
File Name: data/pems-2.snappy.parquet: 2880 rows

Subcommand: sample

Prints a random sample of records from the given parquet file.

❯ pqrs sample data/pems-1.snappy.parquet --records 3
{timeperiod: "01/17/2016 07:01:27", flow1: 0, occupancy1: 0E0, speed1: 0E0, flow2: 0, occupancy2: 0E0, speed2: 0E0, flow3: 0, occupancy3: 0E0, speed3: 0E0, flow4: null, occupancy4: null, speed4: null, flow5: null, occupancy5: null, speed5: null, flow6: null, occupancy6: null, speed6: null, flow7: null, occupancy7: null, speed7: null, flow8: null, occupancy8: null, speed8: null}
{timeperiod: "01/17/2016 07:47:27", flow1: 0, occupancy1: 0E0, speed1: 0E0, flow2: 0, occupancy2: 0E0, speed2: 0E0, flow3: 0, occupancy3: 0E0, speed3: 0E0, flow4: null, occupancy4: null, speed4: null, flow5: null, occupancy5: null, speed5: null, flow6: null, occupancy6: null, speed6: null, flow7: null, occupancy7: null, speed7: null, flow8: null, occupancy8: null, speed8: null}
{timeperiod: "01/17/2016 09:44:27", flow1: 0, occupancy1: 0E0, speed1: 0E0, flow2: 0, occupancy2: 0E0, speed2: 0E0, flow3: 0, occupancy3: 0E0, speed3: 0E0, flow4: null, occupancy4: null, speed4: null, flow5: null, occupancy5: null, speed5: null, flow6: null, occupancy6: null, speed6: null, flow7: null, occupancy7: null, speed7: null, flow8: null, occupancy8: null, speed8: null}

Subcommand: schema

Print the schema from the given parquet file. Use the --detailed flag to get more detailed stats.

❯ pqrs schema data/cities.parquet
Metadata for file: data/cities.parquet

version: 1
num of rows: 3
created by: parquet-mr version 1.5.0-cdh5.7.0 (build ${buildNumber})
message hive_schema {
  OPTIONAL BYTE_ARRAY continent (UTF8);
  OPTIONAL group country {
    OPTIONAL BYTE_ARRAY name (UTF8);
    OPTIONAL group city (LIST) {
      REPEATED group bag {
        OPTIONAL BYTE_ARRAY array_element (UTF8);
      }
    }
  }
}
❯ pqrs schema data/cities.parquet --detailed

num of row groups: 1
row groups:

row group 0:
--------------------------------------------------------------------------------
total byte size: 466
num of rows: 3

num of columns: 3
columns:

column 0:
--------------------------------------------------------------------------------
column type: BYTE_ARRAY
column path: "continent"
encodings: BIT_PACKED PLAIN_DICTIONARY RLE
file path: N/A
file offset: 4
num of values: 3
total compressed size (in bytes): 93
total uncompressed size (in bytes): 93
data page offset: 4
index page offset: N/A
dictionary page offset: N/A
statistics: {min: [69, 117, 114, 111, 112, 101], max: [78, 111, 114, 116, 104, 32, 65, 109, 101, 114, 105, 99, 97], distinct_count: N/A, null_count: 0, min_max_deprecated: true}

<....output clipped>

Subcommand: size

Print the compressed/uncompressed size of the parquet file. Shows uncompressed size by default

❯ pqrs size data/pems-1.snappy.parquet --pretty
Size in Bytes:

File Name: data/pems-1.snappy.parquet
Uncompressed Size: 61 KiB
❯ pqrs size data/pems-1.snappy.parquet --pretty --compressed
Size in Bytes:

File Name: data/pems-1.snappy.parquet
Compressed Size: 12 KiB

TODO

  • Add crate
  • Test on Windows
Comments
  • Don't show file header when outputting json

    Don't show file header when outputting json

    When using pqrs cat --json the output still contains the file headers for each file, making it much less useful for quickly converting a file or a bunch of files to JSON.

    I feel like the headers should not be present at all when outputting JSON or CSV. There could be an additional flag to add them.

    opened by theduke 3
  • `merge` uses a lot of memory

    `merge` uses a lot of memory

    Feature request!

    Is it possible for merge to merge files without decompressing and recompressing them?


    My usecase:

    My parquet generator makes 1GB row groups (in memory), and writes them to individual parquet files. <40MB on disc, one row group per file. (It does this because it can't be bothered to deal with schema variations, another problem.).

    I'd like to concatenate these files; take the row group out of any that have the same schema, and make one big file with multiple row groups, with exactly the same schema?

    The current merge implementation can do this, but needs >>200GB of memory to merge 8GB of parquet files, which is not ideal.

    opened by FauxFaux 3
  • Moved cat

    Moved cat "header" to stderr for easier piping

    Fixes Issue #20 by moving the cat "header" to stderr instead of stdout to allow for piping.

    Feel free to reject this if you think that a --quiet flag might be a better option.

    opened by juan-riveros 3
  • BUG

    BUG

    joserfjunior@Clodovil Downloads % pqrs cat FND_USER_CLEAN.parquet --json

    ############################ File: FND_USER_CLEAN.parquet ############################

    thread 'main' panicked at 'No such local time', /Users/runner/.cargo/registry/src/github.com-1ecc6299db9ec823/chrono-0.4.19/src/offset/mod.rs:173:34 note: run with RUST_BACKTRACE=1 environment variable to display a backtrace

    opened by JoseRFJuniorBigData 2
  • Schema Command Should Support Structured Output

    Schema Command Should Support Structured Output

    Stumbled upon this project and it looks like a real missing link in parquet tooling.

    For the schema subcommand, it would be nice if there was an optional way to output the data in a structured way (eg json) for consumption by other tools. Something like pqrs schema data.parquet --json. Ideally both the simple and detailed flags, but even capturing only the non-detailed data would be handy.

    opened by dbready 1
  • Dependencies update + redone commands in clap 3.0 + clippy

    Dependencies update + redone commands in clap 3.0 + clippy

    This merge request has only have:

    • arrow 12
    • parquet 12
    • clap 3.0 + rewrite commands for clap::Parser. I have removed trait PQRSCommand in favor of structs.
    • commands cat and head uses Json/Csv writers from arrow.
    • fmt + clippy fixes

    All tests passes.

    Reason for this MR is it will be easier to add new sub-commands.

    opened by mateuszkj 1
  • Silence non-row output in `pqrs cat`

    Silence non-row output in `pqrs cat`

    I have to convert formats from time to time and being able to just pipe the outputs is super helpful.

    Unfortunately having the extra 6 lines means having a (slightly) more complicated command which is harder to share among my team.

    pqrs cat -j foo.parquet | tail -n +6 |jq -c . - | gzip -c > foo.ndjson.gz

    Adding a --quiet/-q flag to mute any non-data output would be neat, or possibly just moving that type of output to stderr instead (ie. changing lines 110-112 from println! to eprintln)?

    opened by juan-riveros 1
  • Directory support in `head` / pipe support

    Directory support in `head` / pipe support

    Hi, we're very happily using pqrs now and found two small issues with it:

    1. head does not support directories:
    #> pqhead data.parquet 
    Error: ParquetError(General("underlying IO error: Is a directory (os error 21)"))
    
    1. It panics when used in a pipe:
    #> pqcat data.parquet | head
    
    ###########################################################################################################################################################################################################
    File: data.parquet/d66ac6554cc44c3cbfaa56b75fa446e4.parquet
    ###########################################################################################################################################################################################################
    
    [...]
    thread 'main' panicked at 'failed printing to stdout: Broken pipe (os error 32)', library/std/src/io/stdio.rs:935:9
    
    opened by Hoeze 1
  • Support for parquet directories

    Support for parquet directories

    Hi, I just tried your tool and I find it super useful :+1: However, I noticed that one cannot simply cat whole directories:

    # pqrs cat vcf.parquet 
    Error: ParquetError(General("underlying IO error: Is a directory (os error 21)"))
    

    However, this works: pqrs cat vcf.parquet/**/*.parquet

    It would be a nice convenience function if pqrs could handle this case.

    opened by Hoeze 1
  • Support JSON output for schema subcommand

    Support JSON output for schema subcommand

    Cannot be used alongside --detailed flag currently. Will add support to view detailed json in a future PR. Closes https://github.com/manojkarthick/pqrs/issues/25.

    opened by manojkarthick 0
  • Add support for directories and CSV output for `cat` command

    Add support for directories and CSV output for `cat` command

    Directory support:

    • Add supports for directories as a parameter to the cat command.
    • When a directory is provided, the directory is traversed recursively and all the non-hidden, non-symlink files are displayed.

    CSV support:

    • Adds support for displaying Parquet files in CSV format
    • CSV output format is not supported if the Parquet schema contains Struct or Byte fields.
    • If one of the input files for the cat command does not support CSV format, an error message is displayed and skipped. The contents of the other files will still be displayed.

    Version bump: Bumping the version of the package to v0.2.0

    opened by manojkarthick 0
  • Fails to read file(s)

    Fails to read file(s)

    When calling head or cat on a single largish file, pqrs v0.2.2 (and previous versions) attempts to open lots of files? Why? I only want to read one line at a time and translate .parquet to .csv.:

    $ pqrs cat --csv live_int8.parquet > live.csv
    
    #######################
    File: live_int8.parquet
    #######################
    
    thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', /home/runner/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-12.0.0/src/util/io.rs:82:50
    note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
    
    opened by liborty 0
  • consider removing first few lines of output from pqrs cat --json?

    consider removing first few lines of output from pqrs cat --json?

    I would love to pipe the contents of pqrs cat --json <filename into jq or other tools. Currently the header from pqrs cat prevents me from doing this.

    
    ######################################################
    File: ../myfolder/myfile.parquet
    ######################################################
    
    

    Is there a chance we can either remove this entirely or remove it specifically when used with the --json flag?

    My current version:

    ❯ pqrs --version
    pqrs 0.2.0
    
    opened by AlJohri 1
  • nightly required?

    nightly required?

    Going forward will this tool only be available on nightly?

      Installing pqrs v0.2.2
    error: failed to compile `pqrs v0.2.2`, intermediate artifacts can be found at `/tmp/cargo-installWmgUiA`
    
    Caused by:
      failed to download `once_cell v1.15.0`
    
    Caused by:
      unable to get packages from source
    
    Caused by:
      failed to parse manifest at `/home/amooren/.cargo/registry/src/github.com-1ecc6299db9ec823/once_cell-1.15.0/Cargo.toml`
    
    Caused by:
      feature `edition2021` is required
    
      this Cargo does not support nightly features, but if you
      switch to nightly channel you can add
      `cargo-features = ["edition2021"]` to enable this feature
    
    opened by mooreniemi 0
  • Hi, need help with the project?

    Hi, need help with the project?

    Hi,

    I was looking for a command line util to read Parquet files and found your project. I was wondering, would you be interested in contributions to the project, e.g. open issues, new features, etc?

    I'm a software engineer (mostly backend for professional work), and recently took an interest in learning Rust. Practice is the best way to learn so let me know what you think!

    Thanks, have a nice day

    Robo

    opened by robo555 4
  • Could not parse metadata

    Could not parse metadata

    Hi, with pqrs v0.2.1 I get the following error when trying to read a parquet file written with Polars:

    Error: ParquetError(General("Could not parse metadata: bad data"))
    

    Pandas can read it without issues. Could it be that there is some missing feature flag for the parquet reader? E.g. some missing compression library?

    opened by Hoeze 0
Releases(v0.2.2)
Owner
Manoj Karthick
Data Engineer.
Manoj Karthick
The tool that powers r7kamura.com.

r7k The tool that powers r7kamura/r7kamura.com. Install Run the following command: cargo install --path . or use compiled binary from Releases (Linux

Ryo Nakamura 5 Jan 2, 2022
PostQuet: Stream PostgreSQL tables/queries to Parquet files seamlessly with this high-performance, Rust-based command-line tool.

STATUS: IN DEVELOPMENT PostQuet: Streaming PostgreSQL to Parquet Exporter PostQuet is a powerful and efficient command-line tool written in Rust that

Per Arneng 4 Apr 11, 2023
Easy c̵̰͠r̵̛̠ö̴̪s̶̩̒s̵̭̀-t̶̲͝h̶̯̚r̵̺͐e̷̖̽ḁ̴̍d̶̖̔ ȓ̵͙ė̶͎ḟ̴͙e̸̖͛r̶̖͗ë̶̱́ṉ̵̒ĉ̷̥e̷͚̍ s̷̹͌h̷̲̉a̵̭͋r̷̫̊ḭ̵̊n̷̬͂g̵̦̃ f̶̻̊ơ̵̜ṟ̸̈́ R̵̞̋ù̵̺s̷̖̅ţ̸͗!̸̼͋

Rust S̵̓i̸̓n̵̉ I̴n̴f̶e̸r̵n̷a̴l mutability! Howdy, friendly Rust developer! Ever had a value get m̵̯̅ð̶͊v̴̮̾ê̴̼͘d away right under your nose just when

null 294 Dec 23, 2022
Work out how to read Parquet files in a browser using web assembly (via the Rust toolchain)

wasm-pack-template A template for kick starting a Rust and WebAssembly project using wasm-pack. Tutorial | Chat Built with ?? ?? by The Rust and WebAs

null 5 Oct 11, 2022
Rust-based WebAssembly bindings to read and write Apache Parquet files

parquet-wasm WebAssembly bindings to read and write the Parquet format to Apache Arrow. This is designed to be used alongside a JavaScript Arrow imple

Kyle Barron 103 Dec 25, 2022
Parsing and inspecting Rust literals (particularly useful for proc macros)

litrs: parsing and inspecting Rust literals litrs offers functionality to parse Rust literals, i.e. tokens in the Rust programming language that repre

Lukas Kalbertodt 31 Dec 26, 2022
Traits for inspecting memory usage of Rust types

memuse This crate contains traits for measuring the dynamic memory usage of Rust types. About Memory-tracking is a common activity in large applicatio

null 13 Dec 23, 2022
Detect EDR's exceptions by inspecting processes' loaded modules

Description This tool looks for either the processes that have a certain binary loaded or the processes that don't. This is useful in the following sc

Kurosh Dabbagh Escalante 89 May 7, 2023
Small command-line tool to switch monitor inputs from command line

swmon Small command-line tool to switch monitor inputs from command line Installation git clone https://github.com/cr1901/swmon cargo install --path .

William D. Jones 5 Aug 20, 2022
Fastest and safest Rust implementation of parquet. `unsafe` free. Integration-tested against pyarrow

Parquet2 This is a re-write of the official parquet crate with performance, parallelism and safety in mind. The five main differentiators in compariso

Jorge Leitao 237 Jan 1, 2023
Benchmarks to read parquet to arrow

Parquet benchmarks This repository contains a set of benchmarks of different implementations of Parquet (storage format) <-> Arrow (in-memory format).

null 11 Dec 21, 2022
cryo is the easiest way to extract blockchain data to parquet, csv, or json

❄️ ?? cryo ?? ❄️ cryo is the easiest way to extract blockchain data to parquet, csv, or json cryo is also extremely flexible, with many different opti

Paradigm 287 Jul 12, 2023
🗄️ A simple CLI for converting WARC to Parquet.

warc-parquet ??️ A utility for converting WARC to Parquet. ?? Install The binary may be installed via cargo: $ cargo install warc-parquet To use the c

Max Countryman 89 Jun 5, 2023
Command line tool to extract various data from Blender .blend files

blendtool Command line tool to extract various data from Blender .blend files. Currently supports dumping Eevee irradiance volumes to .dds, new featur

null 2 Sep 26, 2021
A fast and simple command-line tool for common operations over JSON-lines files

rjp: Rapid JSON-lines processor A fast and simple command-line tool for common operations over JSON-lines files, such as: converting to and from text

Ales Tamchyna 3 Jul 8, 2022
apkeep - A command-line tool for downloading APK files from various sources

apkeep - A command-line tool for downloading APK files from various sources Installation Precompiled binaries for apkeep on various platforms can be d

Electronic Frontier Foundation 561 Dec 29, 2022
Rust command-line tool to encrypt and decrypt files or directories with age

Bottle A Rust command-line tool that can compress and encrypt (and decrypt and extract) files or directories using age, gzip, and tar. Bottle has no c

Sam Schlinkert 1 Aug 1, 2022
RnR is a command-line tool to securely rename multiple files and directories that supports regular expressions

RnR is a command-line tool to securely rename multiple files and directories that supports regular expressions. Features Batch rename files and direct

Ismael González Valverde 219 Dec 31, 2022
Anglosaxon is a command line tool to parse XML files using SAX

anglosaxon - Convert large XML files to other formats anglosaxon is a command line tool to parse XML files using SAX. You can do simple transformation

Amanda 8 Oct 7, 2022