An ORC reader for Rust

Overview

An ORC reader for Rust

Rust build status Java build status Coverage status

This project contains tools for working with Apache ORC files from the Rust programming language.

ORC is an open source data format that lets you represent tables of data efficiently (think CSV, but with types, compression, indexing, etc.).

Please note that this software is not "open source", but the source is available for use and modification by individuals, non-profit organizations, and worker-owned businesses (see the license section below for details).

Example use case

I've recently been working with the Twitter Stream Grab, a data set published by the Archive Team and the Internet Archive that includes billions of tweets and Twitter user profiles collected between 2011 and 2021.

The Twitter Stream Grab is 5.2 terabytes of compressed JSON data, and around 50 terabytes uncompressed. It takes many hundreds of hours of computing time to parse this data, which makes repeated processing impractical for personal projects, or for projects by activist groups with limited resources.

Storing this much data can also be impractical. I personally spent several hundred dollars just getting a copy from the Internet Archive's servers to Berlin, and storing a (compressed) copy in S3 currently costs about $122 per month.

There are many kinds of derived data sets and products you might want to build from data like the Twitter Stream Grab. One example is this collection of several million Twitter user profile snapshots for accounts that were active in spreading false claims about voter fraud in 2020. I'm also running a web service that allows users to look up past screen names for Twitter accounts.

I'm using the ORC format to make building projects like these from this data more practical. The basic idea is that instead of re-processing the entire 50 terabytes of JSON data for each application, you parse it once to extract the user profiles (and other information) into a set of ORC tables.

This intermediate representation is slightly more compact: for example the original compressed data for December 2020 takes up about 60 gigabytes, but the ORC table I've built for data from that month only takes up about 21 gigabytes. This means storing the ORC representation of the full 10 years of data only costs around $40 per month using a service like S3, but more importantly it means that it's much, much cheaper and easier to process or query the data.

AWS's Athena lets you run SQL queries directly against ORC files stored in S3, for example. You can also use Athena to process CSV files in S3, but running any SQL query against compressed CSV files for the entire Twitter Stream Grab would cost at least $2.50 (since all of the two or three terabytes of compressed data have to be scanned), while querying ORC in Athena generally costs a tiny fraction of that, since the ORC format makes it possible to avoid scanning data that isn't relevant to the query.

Products like Athena are useful for exploring data like the Twitter Stream Grab, and ORC makes this practical in terms of cost and time, but it's also possible to process the ORC files directly, so that for example instead of spending hundreds of hours of computing time to build a relational database of Twitter user info from the raw JSON data, you can spend a few hours and extract the data from the ORC files.

Why this project?

The ORC format was developed to be a native storage format for Apache Hive, which is built on Hadoop, which is firmly in the Java ecosystem. I personally find Hive to be extremely annoying and painful to work with, and I don't prefer writing Java.

There is also a C++ API for ORC, but I have a fair amount of related tooling already written in Rust, and I wanted to learn more about the internals of the ORC spec, so I decided to try to put together this implementation, and it only took a couple of days.

Use

The project currently provides one command-line tool that does a couple of things:

$ target/release/orcrs --help
orcrs 0.1.0
Travis Brown <[email protected]>

USAGE:
    orcrs [OPTIONS] <SUBCOMMAND>

OPTIONS:
    -h, --help       Print help information
    -v, --verbose    Level of verbosity
    -V, --version    Print version information

SUBCOMMANDS:
    export    Export the contents of the ORC file
    help      Print this message or the help of the given subcommand(s)
    info      Dump raw info about the ORC file

To list all profiles for verified Twitter accounts from the provided sample data, for example:

target/release/orcrs -vvv export --header --columns 0,3,9 examples/ts-10k-2020-09-20.orc | egrep -v "(false|,)$"
id,screen_name,verified
561595762,morinaga_pino,true
1746230882607849472,weareoneEXO,true
29363584,Sandi,true
2067989391190130694,WayV_official,true
36764368,AdamParkhomenko,true
53970806,stephengrovesjr,true
15327404,fox32news,true
1678598579585548288,Mippcivzla,true
158278844,fadlizon,true
79721594,alfredodelmazo,true

This tool can currently export around 10 million rows of this data from a 886 megabyte ORC file (representing one day from 2020) in about 6 seconds:

$ time target/release/orcrs -vvv export --header --columns 0,3,9 /data/tsg/users/v2/2020-09-20.orc | wc
9705227 9705227 314287998

real    0m5.088s
user    0m6.048s
sys     0m0.349s

This is currently completely unoptimized and could be made at least a little faster.

Features

This project currently only supports reading ORC files (writing will probably stay out of scope unless I switch to using bindings to the ORC C++ API at some point).

Feature Status Notes
Integer types ✔️
String types ✔️
Floating point types Coming soon
Date types
Compound types
Zlib compression ✔️
Zstandard compression ✔️
Snappy compression Probably trivial
Column encryption Almost certainly permanently out of scope

Also note that right now these tools don't use the indices: you see every row in the file. So far this is fast enough for the things I need to do, but that will probably change in the future.

Known issues

This software is largely untested, undocumented, and unoptimized.

Developing

You'll need to install Rust and Cargo to build the project. Once you've got them, you can check out this repository and run cargo test (to run the tests) and cargo build --release (to build the command-line tool, which will be available as target/release/orcrs).

The Protobuf schemas for the metadata in the ORC file are not distributed with this repository, but they will be downloaded to $OUT_DIR/proto/ during the build. You can update this file as needed either manually or by changing the commit in build.rs. I got frustrated after 15 minutes of trying to figure out how to make the Protobuf code generation work properly with the build, so it's gone. You'll need to copy the scripts/build.rs file into the project directory in order to update the Protobuf schemas (but this shouldn't be necessary very often).

This repository also includes a Java project with some code that I used for generating ORC test data during development.

Previous work

There's a partial implementation of a few pieces of an ORC reader for Rust here. I've borrowed a couple of test cases for the byte run length encoding reader, but my implementation is otherwise unrelated.

Future work

I'll probably continue to add support for ORC format features as I need them. Eventually it'd be nice to have Rust bindings for the C++ API, and I may end up doing that here.

License

This software is published under the Anti-Capitalist Software License (v. 1.4).

You might also like...
unrust - A pure rust based (webgl 2.0 / native) game engine

unrust A pure rust based (webgl 2.0 / native) game engine Current Version : 0.1.1 This project is under heavily development, all api are very unstable

Rust bindings for GDNative

GDNative bindings for Rust Rust bindings to the Godot game engine. Website | User Guide | API Documentation Stability The bindings cover most of the e

SDL bindings for Rust

Rust-SDL Bindings for SDL in Rust Overview Rust-SDL is a library for talking to SDL from Rust. Low-level C components are wrapped in Rust code to make

SDL2 bindings for Rust

Rust-SDL2 Bindings for SDL2 in Rust Changelog for 0.34.2 Overview Rust-SDL2 is a library for talking to the new SDL2.0 libraries from Rust. Low-level

SFML bindings for Rust

rust-sfml Rust bindings for SFML, the Simple and Fast Multimedia Library. Requirements Linux, Windows, or OS X Rust 1.42 or later SFML 2.5 CSFML 2.5 D

Rust bindings for libtcod 1.6.3 (the Doryen library/roguelike toolkit)

Warning: Not Maintained This project is no longer actively developed or maintained. Please accept our apologies. Open pull requests may still get merg

Victorem - easy UDP game server and client framework for creating simple 2D and 3D online game prototype in Rust.

Victorem Easy UDP game server and client framework for creating simple 2D and 3D online game prototype in Rust. Example Cargo.toml [dependencies] vict

Rust-based replacement for the default Minecraft renderer

wgpu-mc 🚀 A blazing fast alternative renderer for Minecraft Intro WebGPU is a new web specification designed to provide modern graphics and compute c

grr and rust-gpu pbr rendering
grr and rust-gpu pbr rendering

grr-gltf Barebone gltf viewer using grr and rust-gpu. Currently only supports a single gltf model! Assets These files need to be downloaded and placed

Comments
  • Small optimization

    Small optimization

    Optimization isn't a priority right now, but I just glanced at a flamegraph, and this function was the one obvious problem (20.23% of total time, etc.). The quick clean-up here results in almost a 20% speed-up in running times for real data.

    flamegraph

    opened by travisbrown 0
  • Support Snappy compression

    Support Snappy compression

    This should be pretty easy to add, probably via this Rust library.

    At a glance there are a couple of issues I see:

    1. I'm not sure whether ORC uses the Snappy frame format or just the raw encoded bytes.
    2. Unlike the Zlib and Zstandard decoders I'm using now, snap::read::FrameDecoder doesn't provide an into_inner to take back ownership of the reader, which means we might have to adjust a few things higher up in the implementation here.

    I don't need this so it probably won't get worked on, but it'd be nice to have.

    opened by travisbrown 1
Releases(v0.4.0)
  • v0.4.0(Jan 26, 2022)

    This release introduces Serde support (#7), which makes working with the library much nicer. Here's a complete example program:

    use orcrs::OrcFile;
    
    #[derive(serde::Deserialize, Debug)] 
    struct TwitterUser {
        id: u64,
        screen_name: String,
        verified: Option<bool>,
        url: Option<String>
    }
    
    fn main() -> Result<(), Box<dyn std::error::Error>> {
        let args = std::env::args().collect::<Vec<_>>();
        let mut file = OrcFile::open(&args[1])?;
    
        for user in file.deserialize::<TwitterUser>() {
            println!("{:?}", user?);
        }
    
        Ok(())
    }
    

    It works like this on the provided Twitter Stream Grab example file:

    $ target/debug/demo examples/ts-1k-zlib-2020-09-20.orc | tail
    TwitterUser { id: 101367782, screen_name: "DopeyDisneyMom", verified: Some(false), url: None }
    TwitterUser { id: 21055382, screen_name: "dlnt", verified: Some(false), url: Some("http://www.disneylandnewstoday.com") }
    TwitterUser { id: 794014875529998341, screen_name: "cucknana", verified: Some(false), url: None }
    TwitterUser { id: 1163746138731585537, screen_name: "anipoke_PR", verified: Some(true), url: Some("https://www.tv-tokyo.co.jp/anime/pocketmonster/") }
    TwitterUser { id: 1094312216243822592, screen_name: "6mVrFYB5pV7FqsZ", verified: None, url: None }
    TwitterUser { id: 840474668855316481, screen_name: "6NNE6Bjwfh8RwZo", verified: Some(false), url: None }
    TwitterUser { id: 996378086932279298, screen_name: "marimomaster731", verified: Some(false), url: None }
    TwitterUser { id: 1252033298114555904, screen_name: "FDtYsWhT0CtspRl", verified: None, url: None }
    TwitterUser { id: 1235390933228609538, screen_name: "famima_reply", verified: Some(true), url: Some("http://www.family.co.jp/index.html") }
    TwitterUser { id: 2557161844, screen_name: "lemonteafloat", verified: None, url: None }
    

    The struct names are matched to column names, and columns that are unused are not read.

    Please note that this software is not "open source", but the source is available for use and modification by individuals, non-profit organizations, and worker-owned businesses (see the license for details).

    Source code(tar.gz)
    Source code(zip)
Owner
Travis Brown
Functional programmer mostly.
Travis Brown
A lightweight, cross-platform epub reader.

Pend Pend is a program for reading EPUB files. Check out the web demo! Preview Image(s) Installation Building Pend is simple & easy. You should be abl

bx100 11 Oct 17, 2022
Rust-raytracer - 🔭 A simple ray tracer in Rust 🦀

rust-raytracer An implementation of a very simple raytracer based on Ray Tracing in One Weekend by Peter Shirley in Rust. I used this project to learn

David Singleton 159 Nov 28, 2022
Rust-and-opengl-lessons - Collection of example code for learning OpenGL in Rust

rust-and-opengl-lessons Project requires Rust 1.31 Collection of example code for learning OpenGL in Rust 00 - Setup 01 - Window 02 - OpenGL Context 0

Nerijus Arlauskas 348 Dec 11, 2022
Simple retro game made using Rust bracket-lib by following "Herbert Wolverson's Hands on Rust" book.

Flappy Dragon Code from This program is a result of a tutorial i followed from Herbert Wolverson's Hands-on Rust Effective Learning through 2D Game De

Praneeth Chinthaka Ranasinghe 1 Feb 7, 2022
A rust chess implementation using a neural network scoring function built on huggingface/candle + rust + wasm

Rusty Chess What is it? Rusty Chess aims to be a high quality embeddable chess engine that runs entirely locally in the browser (no backend required).

Gareth 3 Nov 3, 2023
A Rust wrapper and bindings of Allegro 5 game programming library

RustAllegro A thin Rust wrapper of Allegro 5. Game loop example extern crate allegro; extern crate allegro_font; use allegro::*; use allegro_font::*;

null 80 Dec 31, 2022
High performance Rust ECS library

Legion aims to be a feature rich high performance Entity component system (ECS) library for Rust game projects with minimal boilerplate. Getting Start

Amethyst Engine 1.4k Jan 5, 2023
A refreshingly simple data-driven game engine built in Rust

What is Bevy? Bevy is a refreshingly simple data-driven game engine built in Rust. It is free and open-source forever! WARNING Bevy is still in the ve

Bevy Engine 21.1k Jan 4, 2023
Rust library to create a Good Game Easily

ggez What is this? ggez is a Rust library to create a Good Game Easily. The current version is 0.6.0-rc0. This is a RELEASE CANDIDATE version, which m

null 3.6k Jan 7, 2023
RTS game/engine in Rust and WebGPU

What is this? A real time strategy game/engine written with Rust and WebGPU. Eventually it will be able to run in a web browser thanks to WebGPU. This

Thomas SIMON 258 Dec 25, 2022