๐Ÿ—„๏ธ A simple CLI for converting WARC to Parquet.

Overview

warc-parquet

๐Ÿ—„๏ธ A utility for converting WARC to Parquet.

๐Ÿ“ฆ Install

The binary may be installed via cargo:

$ cargo install warc-parquet

To use the crate in your project, add the following to your Cargo.toml file:

[dependencies]
warc-parquet = "0.4"

๐Ÿคธ Usage

The Binary

Once installed, the warc-parquet utility can be used to transform WARC into Parquet:

$ wget --warc-file example 'https://example.com'
$ cat example.warc.gz | warc-parquet --gzipped > example.snappy.parquet

warc-parquet is meant to fit organically into the UNIX ecosystem. As such processing multiple WARCs at once is straightforward:

$ wget --warc-file github 'https://github.com'
$ cat example.warc.gz github.warc.gz | warc-parquet --gzipped > combined.snappy.parquet

It's also simple to preprocess via standard UNIX piping:

$ cat example.warc.gz | gzip -d | warc-parquet > example.snappy.parquet

Various compression options, including the option to forego compression altogether, are also available:

$ cat example.warc.gz | warc-parquet --gzipped --compression brotli > example.brotli.parquet

๐Ÿ’ก warc-parquet --help displays complete options and usage information.

The Crate

Refer to the docs for more details about how to use the Reader within your own programs.

DuckDB

There are any number of ways to consume Parquet once you have it. However a natural fit might be DuckDB:

$ duckdb
v0.3.3 fe9ba8003
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.

D select type, id from 'example.snappy.parquet';
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚   type   โ”‚                       id                        โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ warcinfo โ”‚ <urn:uuid:A8063499-7675-4D8D-A736-A1D7DAE84C84> โ”‚
โ”‚ request  โ”‚ <urn:uuid:3EB20966-D74F-4949-AACB-23DB3A0733A7> โ”‚
โ”‚ response โ”‚ <urn:uuid:8B92CADC-F770-45BE-8B72-E13A61CD6D1C> โ”‚
โ”‚ metadata โ”‚ <urn:uuid:4C0E9E17-E21B-49E0-859A-D1016FBDE636> โ”‚
โ”‚ resource โ”‚ <urn:uuid:14F502A5-3BDE-4D0B-8A43-95F4BB8398C6> โ”‚
โ”‚ resource โ”‚ <urn:uuid:6B6D6ADD-52FF-4760-AA00-FB9E739CABBE> โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

D describe select * from 'example.snappy.parquet';
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚       column_name       โ”‚ column_type โ”‚ null โ”‚ key โ”‚ default โ”‚ extra โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ id                      โ”‚ VARCHAR     โ”‚ YES  โ”‚     โ”‚         โ”‚       โ”‚
โ”‚ content_length          โ”‚ UINTEGER    โ”‚ YES  โ”‚     โ”‚         โ”‚       โ”‚
โ”‚ date                    โ”‚ TIMESTAMP   โ”‚ YES  โ”‚     โ”‚         โ”‚       โ”‚
โ”‚ type                    โ”‚ VARCHAR     โ”‚ YES  โ”‚     โ”‚         โ”‚       โ”‚
โ”‚ content_type            โ”‚ VARCHAR     โ”‚ YES  โ”‚     โ”‚         โ”‚       โ”‚
โ”‚ concurrent_to           โ”‚ VARCHAR     โ”‚ YES  โ”‚     โ”‚         โ”‚       โ”‚
โ”‚ block_digest            โ”‚ VARCHAR     โ”‚ YES  โ”‚     โ”‚         โ”‚       โ”‚
โ”‚ payload_digest          โ”‚ VARCHAR     โ”‚ YES  โ”‚     โ”‚         โ”‚       โ”‚
โ”‚ ip_address              โ”‚ VARCHAR     โ”‚ YES  โ”‚     โ”‚         โ”‚       โ”‚
โ”‚ refers_to               โ”‚ VARCHAR     โ”‚ YES  โ”‚     โ”‚         โ”‚       โ”‚
โ”‚ target_uri              โ”‚ VARCHAR     โ”‚ YES  โ”‚     โ”‚         โ”‚       โ”‚
โ”‚ truncated               โ”‚ VARCHAR     โ”‚ YES  โ”‚     โ”‚         โ”‚       โ”‚
โ”‚ warc_info_id            โ”‚ VARCHAR     โ”‚ YES  โ”‚     โ”‚         โ”‚       โ”‚
โ”‚ filename                โ”‚ VARCHAR     โ”‚ YES  โ”‚     โ”‚         โ”‚       โ”‚
โ”‚ profile                 โ”‚ VARCHAR     โ”‚ YES  โ”‚     โ”‚         โ”‚       โ”‚
โ”‚ identified_payload_type โ”‚ VARCHAR     โ”‚ YES  โ”‚     โ”‚         โ”‚       โ”‚
โ”‚ segment_number          โ”‚ UINTEGER    โ”‚ YES  โ”‚     โ”‚         โ”‚       โ”‚
โ”‚ segment_origin_id       โ”‚ VARCHAR     โ”‚ YES  โ”‚     โ”‚         โ”‚       โ”‚
โ”‚ segment_total_length    โ”‚ UINTEGER    โ”‚ YES  โ”‚     โ”‚         โ”‚       โ”‚
โ”‚ body                    โ”‚ BLOB        โ”‚ YES  โ”‚     โ”‚         โ”‚       โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
You might also like...
Work out how to read Parquet files in a browser using web assembly (via the Rust toolchain)

wasm-pack-template A template for kick starting a Rust and WebAssembly project using wasm-pack. Tutorial | Chat Built with ๐Ÿฆ€ ๐Ÿ•ธ by The Rust and WebAs

Rust-based WebAssembly bindings to read and write Apache Parquet files

parquet-wasm WebAssembly bindings to read and write the Parquet format to Apache Arrow. This is designed to be used alongside a JavaScript Arrow imple

PostQuet: Stream PostgreSQL tables/queries to Parquet files seamlessly with this high-performance, Rust-based command-line tool.

STATUS: IN DEVELOPMENT PostQuet: Streaming PostgreSQL to Parquet Exporter PostQuet is a powerful and efficient command-line tool written in Rust that

cryo is the easiest way to extract blockchain data to parquet, csv, or json

โ„๏ธ ๐ŸงŠ cryo ๐ŸงŠ โ„๏ธ cryo is the easiest way to extract blockchain data to parquet, csv, or json cryo is also extremely flexible, with many different opti

Fast abstraction for converting human-like times into milliseconds.

MS converter library Fast abstraction for converting human-like times into milliseconds. Like, are 1d to 86400000. There are two ways to calculate mil

Explain semver requirements by converting them into less than, greater than, and/or equal to form.

semver-explain Convert SemVer requirements to their most-obvious equivalents. semver-explain is a CLI tool to explain Semantic Versioning requirements

An efficient method of heaplessly converting numbers into their string representations, storing the representation within a reusable byte array.

NumToA #![no_std] Compatible with Zero Heap Allocations The standard library provides a convenient method of converting numbers into strings, but thes

๐Ÿฆ€๏ธatos for linux by rust - A partial replacement for Apple's atos tool for converting addresses within a binary file to symbols.

atosl-rs ๐Ÿฆ€๏ธ atos for linux by rust - A partial replacement for Apple's atos tool for converting addresses within a binary file to symbols. tested on

hubpack is an algorithm for converting Rust values to bytes and back.

hubpack is an algorithm for converting Rust values to bytes and back. It was originally designed for encoding messages sent between embedded programs. It is designed for use with serde.

A blazingly fast command-line tool for converting Chinese punctuations to English punctuations
A blazingly fast command-line tool for converting Chinese punctuations to English punctuations

A blazingly fast command-line tool for converting Chinese punctuations to English punctuations

A crate for converting an ASCII text string or file to a single unicode character

A crate for converting an ASCII text string or file to a single unicode character. Also provides a macro to embed encoded source code into a Rust source file. Can also do the same to Python code while still letting the code run as before by wrapping it in a decoder.

Utilities for converting Vega-Lite specs from the command line and Python

VlConvert VlConvert provides a Rust library, CLI utility, and Python library for converting Vega-Lite chart specifications into static images (SVG or

A CLI tool to get help with CLI tools ๐Ÿ™
A CLI tool to get help with CLI tools ๐Ÿ™

A CLI tool to get help with CLI tools ๐Ÿ™ halp aims to help find the correct arguments for command-line tools by checking the predefined list of common

A simple CLI for UPnP media file streaming

Slingr A simple CLI for streaming media files over a local network to UPnP media renderers. Designed to work with cheap HDMI/DLNA/UPnP/Miracast Dongle

๐ŸŽญ A CLI task runner defined by a simple markdown file
๐ŸŽญ A CLI task runner defined by a simple markdown file

mask is a CLI task runner which is defined by a simple markdown file. It searches for a maskfile.md in the current directory which it then parses for

A simple CLI pomodoro timer written in Rust.

Pomodoro A simple CLI pomodoro timer written in Rust. Based on the Pomodoro Technique. Works on any platform that supports desktop notifications. Exam

A simple cli prompter.

@yukikaze-bot/prompt A simple prompter for your cli. Table of Contents @yukikaze-bot/prompt Description Features Installation Usage Basic Usage API Do

Simple cli clipboard manager written in rust

Simple cli clipboard manager written in rust

Simple CLI program to generate zoomable tiled images
Simple CLI program to generate zoomable tiled images

zoomtiler Simple CLI program to generate deepzoom zoomable tiled images. The input can either be a single image or multiple images of the same height,

Comments
  • Bump lz4-sys from 1.9.3 to 1.9.4

    Bump lz4-sys from 1.9.3 to 1.9.4

    Bumps lz4-sys from 1.9.3 to 1.9.4.

    Changelog

    Sourced from lz4-sys's changelog.

    1.24.0:

    • Update to lz4 1.9.4 (lz4-sys 1.9.4) - this fixes CVE-2021-3520, which was a security vulnerability in the core lz4 library
    • export the include directory of lz4 from build.rs

    1.23.3 (March 5, 2022):

    • Update lz4 to 1.9.3
    • Add [de]compress_to_buffer to block API to allow reusing buffers (#16)
    • Windows static lib support
    • Support favor_dec_speed
    • Misc small fixes

    1.23.2:

    • Update lz4 to 1.9.2
    • Remove dependency on skeptic (replace with build-dependency docmatic for README testing)
    • Move to Rust 2018 edition

    1.23.0:

    • Update lz4 to v1.8.2
    • Add lz4 block mode api

    1.22.0:

    • Update lz4 to v1.8.0
    • Remove lz4 redundant dependency to gcc #22 (thanks to Xidorn Quan)

    1.21.1:

    • Fix always rebuild issue #21

    1.21.0:

    • Fix smallest 11-byte stream decoding (thanks to Niklas Hambรผchen)
    • Update lz4 to v1.7.5

    1.20.0:

    • Split out separate sys package #16 (thanks to Thijs Cadier)

    1.19.173:

    • Update lz4 to v1.7.3

    1.19.131:

    • Update dependencies for correct work with change build environmet via rustup override

    1.18.131:

    • Implemented Send for Encoder/Decoder #15 (thanks to Maxime Lenoir)

    ... (truncated)

    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    opened by dependabot[bot] 0
Owner
Max Countryman
People first leader and indie hacker.
Max Countryman
Work out how to read Parquet files in a browser using web assembly (via the Rust toolchain)

wasm-pack-template A template for kick starting a Rust and WebAssembly project using wasm-pack. Tutorial | Chat Built with ?? ?? by The Rust and WebAs

null 5 Oct 11, 2022
PostQuet: Stream PostgreSQL tables/queries to Parquet files seamlessly with this high-performance, Rust-based command-line tool.

STATUS: IN DEVELOPMENT PostQuet: Streaming PostgreSQL to Parquet Exporter PostQuet is a powerful and efficient command-line tool written in Rust that

Per Arneng 4 Apr 11, 2023
A simple CLI tool for converting CSV file content to JSON.

fast-csv-to-json A simple CLI tool for converting CSV file content to JSON. ๆˆ‘่Šฑไบ†ไธ€ๅ€‹ๅฐๆ™‚ๆ“ๅ‡บไพ†๏ผŒๆŽฅ่‘—ๅ„ชๅŒ–ไบ†ๅ…ฉๅคฉ็š„ๅฟซ้€Ÿ CSV ่ฝ‰ JSON CLI ๅฐๅทฅๅ…ท Installation Install Rust with ru

Ming Chang 3 Apr 5, 2023
Turbine is a toy CLI app for converting Rails schema declarations into equivalent type declarations in other languages.

Turbine Turbine is a toy CLI app for converting Rails schema declarations into equivalent type declarations in other languages. Itโ€™s described as a to

Justin 2 Jan 21, 2022
CLI and utilities for converting media files (images/videos) to ascii outputs (output media file or print to console)

CLI and utilities for converting media files (images/videos) to ascii outputs (output media file or print to console). Supports most standard image formats, and some video formats.

Michael 30 Jan 1, 2023
A simple command-line utility (and Rust crate!) for converting from a conventional image file (e.g. a PNG file) into a pixel-art version constructed with emoji

EmojiPix This is a simple command-line utility (and Rust crate!) for converting from a conventional image file (e.g. a PNG file) into a pixel-art vers

Michael Milton 22 Dec 6, 2022
Command line tool for inspecting Parquet files

pqrs pqrs is a command line tool for inspecting Parquet files This is a replacement for the parquet-tools utility written in Rust Built using the Rust

Manoj Karthick 127 Dec 23, 2022
Fastest and safest Rust implementation of parquet. `unsafe` free. Integration-tested against pyarrow

Parquet2 This is a re-write of the official parquet crate with performance, parallelism and safety in mind. The five main differentiators in compariso

Jorge Leitao 237 Jan 1, 2023
Benchmarks to read parquet to arrow

Parquet benchmarks This repository contains a set of benchmarks of different implementations of Parquet (storage format) <-> Arrow (in-memory format).

null 11 Dec 21, 2022