A dataframe manipulation tool inspired by dplyr and powered by polars.

Overview

dply is a command line tool for viewing, querying, and writing csv and parquet files, inspired by dplyr and powered by polars.

Usage overview

A dply pipeline consists of a number of functions to read, transform, or write data to disk.

The following pipeline reads a parquet file1, computes the minimum, mean, and maximum fare for each payment type, saves the result to fares.csv CSV file, and shows the result:

$ dply -c 'parquet("nyctaxi.parquet") |
    group_by(payment_type) |
    summarize(
        min_price = min(total_amount),
        mean_price = mean(total_amount),
        max_price = max(total_amount)
    ) |
    arrange(payment_type) |
    csv("fares.csv") |
    show()'
shape: (5, 4)
┌──────────────┬───────────┬────────────┬───────────┐
│ payment_type ┆ min_price ┆ mean_price ┆ max_price │
│ ---          ┆ ---       ┆ ---        ┆ ---       │
│ str          ┆ f64       ┆ f64        ┆ f64       │
╞══════════════╪═══════════╪════════════╪═══════════╡
│ Cash         ┆ -61.85    ┆ 18.07      ┆ 86.55     │
│ Credit card  ┆ 4.56      ┆ 22.969491  ┆ 324.72    │
│ Dispute      ┆ -55.6     ┆ -0.145161  ┆ 54.05     │
│ No charge    ┆ -16.3     ┆ 0.086667   ┆ 19.8      │
│ Unknown      ┆ 9.96      ┆ 28.893333  ┆ 85.02     │
└──────────────┴───────────┴────────────┴───────────┘

Supported functions

dply supports the following functions:

  • arrange Sorts rows by column values
  • count Counts columns unique values
  • csv Reads or writes a dataframe in CSV format
  • distinct Retains unique rows
  • filter Filters rows that satisfy given predicates
  • glimpse Shows a dataframe overview
  • group by and summarize Performs grouped aggregations
  • head Shows the first few dataframe rows in table format
  • joins Left, inner, outer and cross joins
  • mutate Creates or mutate columns
  • parquet Reads or writes a dataframe in Parquet format
  • relocate Moves columns positions
  • rename Renames columns
  • select Selects columns
  • show Shows all dataframe rows
  • unnest Expands list columns into rows

more examples can be found in the tests folder.

Installation

Binaries generated by the release Github action for Linux, macOS (x86), and Windows are available in the releases page.

You can also install dply using Cargo:

cargo install dply

or by building it from this repository:

git clone https://github.com/vincev/dply-rs
cd dply-rs
cargo install --path .

Footnotes

  1. The file nyctaxi.parquet in the tests/data folder is a 250 rows parquet file sampled from the NYC trip record data.

You might also like...
PostQuet: Stream PostgreSQL tables/queries to Parquet files seamlessly with this high-performance, Rust-based command-line tool.

STATUS: IN DEVELOPMENT PostQuet: Streaming PostgreSQL to Parquet Exporter PostQuet is a powerful and efficient command-line tool written in Rust that

Perhaps the fastest and most memory efficient way to pull data from PostgreSQL into pandas and numpy. 🚀

flaco Perhaps the fastest and most memory efficient way to pull data from PostgreSQL into pandas and numpy. 🚀 Have a gander at the initial benchmarks

ndarray: an N-dimensional array with array views, multidimensional slicing, and efficient operations

ndarray The ndarray crate provides an n-dimensional container for general elements and for numerics. Please read the API documentation on docs.rs or t

A Rust crate that reads and writes tfrecord files

tfrecord-rust The crate provides the functionality to serialize and deserialize TFRecord data format from TensorFlow. Features Provide both high level

Apache Arrow DataFusion and Ballista query engines
Apache Arrow DataFusion and Ballista query engines

DataFusion is an extensible query execution framework, written in Rust, that uses Apache Arrow as its in-memory format.

Fastest and safest Rust implementation of parquet. `unsafe` free. Integration-tested against pyarrow

Parquet2 This is a re-write of the official parquet crate with performance, parallelism and safety in mind. The five main differentiators in compariso

Orkhon: ML Inference Framework and Server Runtime
Orkhon: ML Inference Framework and Server Runtime

Orkhon: ML Inference Framework and Server Runtime Latest Release License Build Status Downloads Gitter What is it? Orkhon is Rust framework for Machin

Tiny, no-nonsense, self-contained, Tensorflow and ONNX inference
Tiny, no-nonsense, self-contained, Tensorflow and ONNX inference

Sonos' Neural Network inference engine. This project used to be called tfdeploy, or Tensorflow-deploy-rust. What ? tract is a Neural Network inference

ConnectorX - Fastest library to load data from DB to DataFrames in Rust and Python
ConnectorX - Fastest library to load data from DB to DataFrames in Rust and Python

ConnectorX enables you to load data from databases into Python in the fastest and most memory efficient way.

Comments
  • How to specify select columns with names containing `-`

    How to specify select columns with names containing `-`

    Thanks for the great tool!! I've been looking for something like this for a while, the pandas-backed options weren't working well for me!

    I am wondering how best to specify column names that aren't valid python names. Currently if a column name includes for example a minus sign (-), it doesn't appear to be possible to select the column by simply quoting the name.

    I was hoping this would work, but get an error.

    $ dply -c 'parquet("myfile.parquet") | select( column1, column2, "column-3", column_4) | show()'
    Error: Match failure '"column-3"' must be an identifier
    

    In the meantime I've been able to use the following as a workaround:

    $ dply -c 'parquet("myfile.parquet") | select( column1, column2, starts_with("column-3"), column_4) | show()'
    
    enhancement 
    opened by tikkanz 2
  • Add filter for list columns

    Add filter for list columns

    Add support for filtering on list column values:

    $ dply -c 'parquet("lists.parquet") |
        filter(list_contains(tags, "ag9")) |
        head(5)'
    shape: (5, 4)
    ┌──────────┬───────────────┬───────────────────┬────────────────────────────┐
    │ shape_id ┆ ints          ┆ floats            ┆ tags                       │
    │ ---      ┆ ---           ┆ ---               ┆ ---                        │
    │ u32      ┆ list[u32]     ┆ list[f64]         ┆ list[str]                  │
    ╞══════════╪═══════════════╪═══════════════════╪════════════════════════════╡
    │ 2        ┆ [73]          ┆ [3.5, 15.0, 23.0] ┆ ["tag9"]                   │
    │ 10       ┆ [6]           ┆ [2.5, 3.5, … 5.0] ┆ ["tag1", "tag3", … "tag9"] │
    │ 13       ┆ [9, 23, … 92] ┆ null              ┆ ["tag1", "tag5", … "tag9"] │
    │ 20       ┆ [4]           ┆ null              ┆ ["tag1", "tag5", "tag9"]   │
    │ 22       ┆ [52, 96]      ┆ [6.0]             ┆ ["tag4", "tag6", "tag9"]   │
    └──────────┴───────────────┴───────────────────┴────────────────────────────┘
    
    opened by vincev 0
Releases(v0.1.5)
Owner
null
A fast, powerful, flexible and easy to use open source data analysis and manipulation tool written in Rust

fisher-rs fisher-rs is a Rust library that brings powerful data manipulation and analysis capabilities to Rust developers, inspired by the popular pan

Syed Vilayat Ali Rizvi 5 Aug 31, 2023
A fast, powerful, flexible and easy to use open source data analysis and manipulation tool written in Rust

fisher-rs fisher-rs is a Rust library that brings powerful data manipulation and analysis capabilities to Rust developers, inspired by the popular pan

null 5 Sep 6, 2023
Dataframe structure and operations in Rust

Utah Utah is a Rust crate backed by ndarray for type-conscious, tabular data manipulation with an expressive, functional interface. Note: This crate w

Suchin 139 Sep 26, 2022
Rust DataFrame library

Polars Blazingly fast DataFrames in Rust & Python Polars is a blazingly fast DataFrames library implemented in Rust. Its memory model uses Apache Arro

Ritchie Vink 11.9k Jan 8, 2023
A Rust DataFrame implementation, built on Apache Arrow

Rust DataFrame A dataframe implementation in Rust, powered by Apache Arrow. What is a dataframe? A dataframe is a 2-dimensional tabular data structure

Wakahisa 287 Nov 11, 2022
DataFrame / Series data processing in Rust

black-jack While PRs are welcome, the approach taken only allows for concrete types (String, f64, i64, ...) I'm not sure this is the way to go. I want

Miles Granger 30 Dec 10, 2022
DataFrame & its adaptors

Fabrix Fabrix is a lib crate, who uses Polars Series and DataFrame as fundamental data structures, and is capable to communicate among different data

Jacob Xie 18 Dec 12, 2022
Provides multiple-dtype columner storage, known as DataFrame in pandas/R

brassfibre Provides multiple-dtype columner storage, known as DataFrame in pandas/R. Series Single-dtype 1-dimentional vector with label (index). Crea

Sinhrks 21 Nov 28, 2022
A tool to stream the chats of Twitch channels as a CSV.

twitch2csv A tool to stream the chats of Twitch channels as a CSV. Installation You can use cargo to install this tool: cargo install -f twitch2csv Us

Clément Renault 2 Nov 20, 2021
Rustic - a backup tool that provides fast, encrypted, deduplicated backups

Rustic is a backup tool that provides fast, encrypted, deduplicated backups. It can read the restic repo format desribed in the design document and writes a compatible repo format which can also be read by restic.

null 266 Jan 2, 2023