A dataframe manipulation tool inspired by dplyr and powered by polars.

Last update: May 29, 2023

Related tags

Data processing rust csv dplyr parquet dataframe polars

Overview

dply is a command line tool for viewing, querying, and writing csv and parquet files, inspired by dplyr and powered by polars.

Usage overview

A dply pipeline consists of a number of functions to read, transform, or write data to disk.

The following pipeline reads a parquet file¹, computes the minimum, mean, and maximum fare for each payment type, saves the result to fares.csv CSV file, and shows the result:

$ dply -c 'parquet("nyctaxi.parquet") |
    group_by(payment_type) |
    summarize(
        min_price = min(total_amount),
        mean_price = mean(total_amount),
        max_price = max(total_amount)
    ) |
    arrange(payment_type) |
    csv("fares.csv") |
    show()'
shape: (5, 4)
┌──────────────┬───────────┬────────────┬───────────┐
│ payment_type ┆ min_price ┆ mean_price ┆ max_price │
│ ---          ┆ ---       ┆ ---        ┆ ---       │
│ str          ┆ f64       ┆ f64        ┆ f64       │
╞══════════════╪═══════════╪════════════╪═══════════╡
│ Cash         ┆ -61.85    ┆ 18.07      ┆ 86.55     │
│ Credit card  ┆ 4.56      ┆ 22.969491  ┆ 324.72    │
│ Dispute      ┆ -55.6     ┆ -0.145161  ┆ 54.05     │
│ No charge    ┆ -16.3     ┆ 0.086667   ┆ 19.8      │
│ Unknown      ┆ 9.96      ┆ 28.893333  ┆ 85.02     │
└──────────────┴───────────┴────────────┴───────────┘

Supported functions

dply supports the following functions:

arrange Sorts rows by column values
count Counts columns unique values
csv Reads or writes a dataframe in CSV format
distinct Retains unique rows
filter Filters rows that satisfy given predicates
glimpse Shows a dataframe overview
group by and summarize Performs grouped aggregations
head Shows the first few dataframe rows in table format
joins Left, inner, outer and cross joins
mutate Creates or mutate columns
parquet Reads or writes a dataframe in Parquet format
relocate Moves columns positions
rename Renames columns
select Selects columns
show Shows all dataframe rows
unnest Expands list columns into rows

more examples can be found in the tests folder.

Installation

Binaries generated by the release Github action for Linux, macOS (x86), and Windows are available in the releases page.

You can also install dply using Cargo:

cargo install dply

or by building it from this repository:

git clone https://github.com/vincev/dply-rs
cd dply-rs
cargo install --path .

The file nyctaxi.parquet in the tests/data folder is a 250 rows parquet file sampled from the NYC trip record data. ↩

PostQuet: Stream PostgreSQL tables/queries to Parquet files seamlessly with this high-performance, Rust-based command-line tool.

STATUS: IN DEVELOPMENT PostQuet: Streaming PostgreSQL to Parquet Exporter PostQuet is a powerful and efficient command-line tool written in Rust that

4 Apr 11, 2023

Perhaps the fastest and most memory efficient way to pull data from PostgreSQL into pandas and numpy. 🚀

flaco Perhaps the fastest and most memory efficient way to pull data from PostgreSQL into pandas and numpy. 🚀 Have a gander at the initial benchmarks

14 Oct 31, 2022

ndarray: an N-dimensional array with array views, multidimensional slicing, and efficient operations

ndarray The ndarray crate provides an n-dimensional container for general elements and for numerics. Please read the API documentation on docs.rs or t

2.6k Jan 7, 2023

A Rust crate that reads and writes tfrecord files

tfrecord-rust The crate provides the functionality to serialize and deserialize TFRecord data format from TensorFlow. Features Provide both high level

22 Nov 3, 2022

Apache Arrow DataFusion and Ballista query engines

DataFusion is an extensible query execution framework, written in Rust, that uses Apache Arrow as its in-memory format.

2.9k Jan 2, 2023

Fastest and safest Rust implementation of parquet. `unsafe` free. Integration-tested against pyarrow

Parquet2 This is a re-write of the official parquet crate with performance, parallelism and safety in mind. The five main differentiators in compariso

237 Jan 1, 2023

Orkhon: ML Inference Framework and Server Runtime

Orkhon: ML Inference Framework and Server Runtime Latest Release License Build Status Downloads Gitter What is it? Orkhon is Rust framework for Machin

129 Dec 21, 2022

Tiny, no-nonsense, self-contained, Tensorflow and ONNX inference

Sonos' Neural Network inference engine. This project used to be called tfdeploy, or Tensorflow-deploy-rust. What ? tract is a Neural Network inference

1.5k Jan 2, 2023

ConnectorX - Fastest library to load data from DB to DataFrames in Rust and Python

ConnectorX enables you to load data from databases into Python in the fastest and most memory efficient way.

939 Jan 5, 2023

Comments

How to specify select columns with names containing `-`
Thanks for the great tool!! I've been looking for something like this for a while, the pandas-backed options weren't working well for me!

I am wondering how best to specify column names that aren't valid python names. Currently if a column name includes for example a minus sign (-), it doesn't appear to be possible to select the column by simply quoting the name.

I was hoping this would work, but get an error.

$ dply -c 'parquet("myfile.parquet") | select( column1, column2, "column-3", column_4) | show()' Error: Match failure '"column-3"' must be an identifier

In the meantime I've been able to use the following as a workaround:

$ dply -c 'parquet("myfile.parquet") | select( column1, column2, starts_with("column-3"), column_4) | show()'
enhancement
opened by tikkanz 2

Add filter for list columns

Add support for filtering on list column values:

$ dply -c 'parquet("lists.parquet") |
    filter(list_contains(tags, "ag9")) |
    head(5)'
shape: (5, 4)
┌──────────┬───────────────┬───────────────────┬────────────────────────────┐
│ shape_id ┆ ints          ┆ floats            ┆ tags                       │
│ ---      ┆ ---           ┆ ---               ┆ ---                        │
│ u32      ┆ list[u32]     ┆ list[f64]         ┆ list[str]                  │
╞══════════╪═══════════════╪═══════════════════╪════════════════════════════╡
│ 2        ┆ [73]          ┆ [3.5, 15.0, 23.0] ┆ ["tag9"]                   │
│ 10       ┆ [6]           ┆ [2.5, 3.5, … 5.0] ┆ ["tag1", "tag3", … "tag9"] │
│ 13       ┆ [9, 23, … 92] ┆ null              ┆ ["tag1", "tag5", … "tag9"] │
│ 20       ┆ [4]           ┆ null              ┆ ["tag1", "tag5", "tag9"]   │
│ 22       ┆ [52, 96]      ┆ [6.0]             ┆ ["tag4", "tag6", "tag9"]   │
└──────────┴───────────────┴───────────────────┴────────────────────────────┘

opened by vincev 0

Releases(v0.1.5)

v0.1.5(May 29, 2023)
0.1.5 - 2023-05-29

Changed 🔧

Enable unnest to work on struct columns.

Add inner_join, left_join, cross_join, and outer_join.

Add semicolon pipeline separator.

summarize: Add list function for creating list columns from grouped data.

Update to Polars 0.30

Source code(tar.gz)
Source code(zip)
dply-0.1.5-x86_64-macos-latest.zip(10.71 MB)
dply-0.1.5-x86_64-ubuntu-latest.zip(13.80 MB)
dply-0.1.5-x86_64-windows-latest.zip(9.79 MB)
v0.1.4(May 16, 2023)
0.1.4 - 2023-05-16

Changed 🔧

Add unnest function for list columns.

Source code(tar.gz)
Source code(zip)
dply-v0.1.4-x86_64-macos-latest.zip(10.34 MB)
dply-v0.1.4-x86_64-ubuntu-latest.zip(13.40 MB)
dply-v0.1.4-x86_64-windows-latest.zip(9.37 MB)
v0.1.3(May 15, 2023)
0.1.3 - 2023-05-15

Changed 🔧

Update to Polars 0.29

filter: Add contains predicate for string and list columns.

filter: Add is_null predicate.

summarize: Now works on ungrouped data.

mutate: Add len function for list columns.

Source code(tar.gz)
Source code(zip)
dply-0.1.3-x86_64-macos-latest.zip(10.34 MB)
dply-0.1.3-x86_64-ubuntu-latest.zip(13.41 MB)
dply-0.1.3-x86_64-windows-latest.zip(9.36 MB)
v0.1.2(May 9, 2023)

Add support for quoting column names #8
Source code(tar.gz)
Source code(zip)
dply-0.1.2-x86_64-macos-latest.zip(9.52 MB)
dply-0.1.2-x86_64-ubuntu-latest.zip(12.13 MB)
dply-0.1.2-x86_64-windows-latest.zip(8.84 MB)
v0.1.0(May 8, 2023)

v0.1.0
Source code(tar.gz)
Source code(zip)
dply-0.1.0-x86_64-macos-latest.zip(9.51 MB)
dply-0.1.0-x86_64-ubuntu-latest.zip(12.11 MB)
dply-0.1.0-x86_64-windows-latest.zip(8.83 MB)

Owner

GitHub

A dataframe manipulation tool inspired by dplyr and powered by polars.

Related tags

Overview

Usage overview

Supported functions

Installation

Footnotes

You might also like...

PostQuet: Stream PostgreSQL tables/queries to Parquet files seamlessly with this high-performance, Rust-based command-line tool.

Perhaps the fastest and most memory efficient way to pull data from PostgreSQL into pandas and numpy. 🚀

ndarray: an N-dimensional array with array views, multidimensional slicing, and efficient operations

A Rust crate that reads and writes tfrecord files

Apache Arrow DataFusion and Ballista query engines

Fastest and safest Rust implementation of parquet. `unsafe` free. Integration-tested against pyarrow

Orkhon: ML Inference Framework and Server Runtime

Tiny, no-nonsense, self-contained, Tensorflow and ONNX inference

ConnectorX - Fastest library to load data from DB to DataFrames in Rust and Python

Comments

How to specify select columns with names containing `-`

Add filter for list columns

Releases(v0.1.5)

v0.1.5(May 29, 2023)

0.1.5 - 2023-05-29

Changed 🔧

v0.1.4(May 16, 2023)

0.1.4 - 2023-05-16

Changed 🔧

v0.1.3(May 15, 2023)

0.1.3 - 2023-05-15

Changed 🔧

v0.1.2(May 9, 2023)

v0.1.0(May 8, 2023)

Owner

A fast, powerful, flexible and easy to use open source data analysis and manipulation tool written in Rust

A fast, powerful, flexible and easy to use open source data analysis and manipulation tool written in Rust

Dataframe structure and operations in Rust

Rust DataFrame library

A Rust DataFrame implementation, built on Apache Arrow

DataFrame / Series data processing in Rust

DataFrame & its adaptors

Provides multiple-dtype columner storage, known as DataFrame in pandas/R

A tool to stream the chats of Twitch channels as a CSV.

Rustic - a backup tool that provides fast, encrypted, deduplicated backups