Peakrs Dataframe is a library and framework facilitates the extraction, transformation, and loading (ETL) of data.

Related tags

Command-line peakrs
Overview

Peakrs Dataframe

Peakrs Dataframe is a library and framework facilitates the extraction, transformation, and loading (ETL) of data. Its first application:-

import peakrs as pr

df = pr.get_csv_sample(file_path, 1000)

1,000 represents number of sample row you want to get. The file can be split into 1,000 or more partitions to extract and validate the first row of each partition. In many cases, the entire process of this application runs instantly, regardless of whether the file size exceeds 10GB or contains billions of rows.

It can verify whether a file is a comma-separated values (CSV) file and determine its delimiter other than comma. If the file passes validation, it can instantly preview a billion-row file.

pr.view_csv(df)

And you can output all validated rows to a disk file

df = pr.write_csv(df)

You can print the meta information.

print("File Size: " + format(df.file_size) + " bytes", end =" ")

print(" Total Column: ", format(df.total_column))

print("Validated Row: ", format(df.validate_row), end =" ")

print(" Estimated Row: ",format(df.estimate_row))

print("Delimiter: " + format(df.delimiter) + " [" + chr(df.delimiter) + "]")

print("Is Line Br 10/13 Exist: ", df.is_line_br_10_exist, "/", df.is_line_br_13_exist)

Like the Peaks Consolidation project https://github.com/hkpeaks/peaks-consolidation, you can easily configure complex and high-performance operations using a new ETL framework for data transformation. The streaming engine takes care of allocating and distributing file partitions to the query engine, preventing your machine from running out of memory. This makes it simple to set up ETL processes and enjoy their benefits. In addtion, the design of the streaming engine can avoid generating many temp files which make your disk run out of disk space.

Peaks Consolidation is written in Go, while Peakrs is written in Rust with Python bingings.

Peaks Consolidation is purely an ETL framework, now Peakrs extend to cover many Python and Rust APIs you run its as library.

Peakrs will also be extended to cover realtime Web by Websocket.

With the power of Python bindings, Peakrs can offer effective mean to support your machine learning exerciese interacting with Pytorch and Tensorflow.

The Folder "py-peakrs" is a Rust app with Python bindings

This app is written in Rust with Python binding using Pyo3.

Please refer to the instructions in the ‘run.py’ file. This file allows you to preview CSV files and their metadata instantly, even if the file size exceeds 10GB. Demo video: https://youtu.be/71GHzDnEYno

Command List

Double quote represents the syntax of the data transformation framework.

df represents dataframe, you can use alternative name

df = pr.add_column(df, "column, column => math(new_col_name)")

    where math includes add, subtract, multiply and divide

df = pr.build_keyvalue(df, "column, column => keyvalue_tablename")

df = pr.distinct(df, "column, column")

df = pr.filter(df, "column(compare_operator value) column(compare_operator value)")

df = pr.filter_unmatch(df, "column(compare_operator value) column(compare_operator value)")

    where compare_operator includes >,<,>=,<=,=,!= & Range e.g. 100..200
          compare integer or float e.g. Float > number, float100..200

df = pr.groupby(df, "column, column => count() sum(column) max(column) min(column)")

df = pr.join_keyvalue(df, "column, column => join_type(keyvalue_table_name)")

df = pr.jointable(df, "column, column => join_type(keyvalue_table_name)")

    where join_type includes all_match & inner

df = pr.orderby(df,"primary_col(sorting order) secondary_col(sorting order)")

df = pr.orderby{df, "secondaryCol(sorting order) => create_folder_lake(primary_col,folder_name or file_name.csv)")

    where sorting order represents by A or D, to sort real numbers, use either floatA or floatD

df = pr.read_csv(file_path or file_name.csv)

df = pr.select(df, "column, column")

df = pr.select_unmatch(df, "column, column")

df = pr.split_file(file_path or file_name.csv, number_of_split)

df = pr.create_folder_lake(df, "column, column => split_folder_name")

pr.view(df)

df = pr.write_csv(df, file_name.csv or %expand_by_100_time.csv)

You might also like...
⚡️Highly efficient data and string formatting library for Rust.

⚡️Highly efficient data and string formatting library for Rust. 🔎 Overview Pad and format string slices and generic vectors efficiently with minimal

Super-lightweight Immediate-mode Embedded GUI framework, based on the awesome embedded-graphics library. Written in Rust.

Kolibri - A GUI framework made to be as lightweight as its namesake What is Kolibri? Kolibri is an embedded Immediate Mode GUI mini-framework very str

Data graphing library for command line.

DataBrush DataBrush is a simple library for displaying structured data in the command line. In supports dividing data into chunks and highlighting cer

Pure rust library for reading / writing DNG files providing access to the raw data in a zero-copy friendly way.

DNG-rs   A pure rust library for reading / writing DNG files providing access to the raw data in a zero-copy friendly way. Also containing code for re

A high-performance WebSocket integration library for streaming public market data. Used as a key dependency of the `barter-rs` project.

Barter-Data A high-performance WebSocket integration library for streaming public market data from leading cryptocurrency exchanges - batteries includ

A lightweight and high-performance order-book designed to process level 2 and trades data. Available in Rust and Python

ninjabook A lightweight and high-performance order-book implemented in Rust, designed to process level 2 and trades data. Available in Python and Rust

A general-purpose, transactional, relational database that uses Datalog and focuses on graph data and algorithms

cozo A general-purpose, transactional, relational database that uses Datalog for query and focuses on graph data and algorithms. Features Relational d

A fast and robust MLOps tool for managing data and pipelines

xvc A Fast and Robust MLOps Swiss-Army Knife in Rust ⌛ When to use xvc? Machine Learning Engineers: When you manage large quantities of unstructured d

Nodium is an easy-to-use data analysis and automation platform built using Rust, designed to be versatile and modular.
Nodium is an easy-to-use data analysis and automation platform built using Rust, designed to be versatile and modular.

Nodium is an easy-to-use data analysis and automation platform built using Rust, designed to be versatile and modular. Nodium aims to provide a user-friendly visual node-based interface for various tasks.

Owner
Max Yu
He is researching how to accelerate dataframes to support billions of rows running in memory-limited environments.
Max Yu
Facilitates navigating between tmux and nvim with C-hjkl

neovim-tmux-navigator Usage Use C-<hjkl> to navigate left, down, up, right, respectively. neovim-tmux-navigator will switch between vim splits and tmu

Amiel Martin 1 Dec 2, 2021
A cli prepared with TUI that facilitates your operations.

⚠️ For linux only ⚠️ Helper CLI A cli prepared with TUI that facilitates your operations. Click me to learn more about the theme system. If you just w

Yiğit 4 Feb 1, 2022
ObfusEval is the benchmarking tool to evaluate the reliability of the code obfuscating transformation.

ObfusEval ObfusEval is the benchmarking tool to evaluate the reliability of the code obfuscating transformation. The following two metrics related the

Software Engineering Lab @ NAIST 4 Dec 15, 2022
A library for loading and executing PE (Portable Executable) from memory without ever touching the disk

memexec A library for loading and executing PE (Portable Executable) from memory without ever touching the disk This is my own version for specific pr

FssAy 5 Aug 27, 2022
A Rust 🦀️ font loading, positioning and rendering toolkit

Toolkit used to load, match, measure and render texts. NOTE: This project is a work in progress. Text measuring and positioning is a complex topic. A

Alibaba 63 Dec 27, 2022
Safer Nostr is a service that helps protect users by loading sensitive information (IP leak) and using AI to prevent inappropriate images from being uploaded.

Safer Nostr is a service that helps protect users by loading sensitive information (IP leak) and using AI to prevent inappropriate images from being uploaded. It also offers image optimization and storage options. It has configurable privacy and storage settings, as well as custom cache expiration.

Thomas 4 Dec 29, 2022
Concurrent and multi-stage data ingestion and data processing with Rust+Tokio

TokioSky Build concurrent and multi-stage data ingestion and data processing pipelines with Rust+Tokio. TokioSky allows developers to consume data eff

DanyalMh 29 Dec 11, 2022
Infer a JSON schema from example data, produce nonsense synthetic data (drivel) according to the schema

drivel drivel is a command-line tool written in Rust for inferring a schema from an example JSON (or JSON lines) file, and generating synthetic data (

Daniël 36 Jul 5, 2024
A Rust library for building modular, fast and compact indexes over genomic data

mazu A Rust library for building modular, fast and compact indexes over genomic data Mazu (媽祖)... revered as a tutelary deity of seafarers, including

COMBINE lab 6 Aug 15, 2023