A fast and robust MLOps tool for managing data and pipelines

Overview

xvc

codecov build crates.io docs.rs unsafe forbidden

A Fast and Robust MLOps Swiss-Army Knife in Rust

When to use xvc?

  • Machine Learning Engineers: When you manage large quantities of unstructured data, like images, documents, audio files. When you create data pipelines on top of this data and want to run these pipelines when the data, code or other dependencies change.
  • Data Engineers: When you want to version data files, and want to track versions across datasets. When you have to provide this data in multiple remote locations, like S3 or local files.
  • Data Scientists: When you want to track which subset of the data you're working with, and how it changes by your operations.
  • Software Engineers: When you have binary artifacts that you use as dependencies and would like to have a make alternative that considers content changes rather than timestamps.
  • Everyone: When you have photo, audio, media, document files to backup on Git, but don't want to copy that huge data to all Git clones. When you want to run a command when any of these files change.

✳️ What is xvc for?

  • (for x = files) Track large files on Git, store them on the cloud, retrieve when necessary, label and query for subsets
  • (for x = pipelines) Define and run data -> model pipelines whose dependencies may be files, hyperparameters, regex searches, arbitrary URLs and more.
  • (for x = experiments) Run isolated experiments, share them and store them in Git when necessary (TODO)
  • (for x = data) Annotate data with arbitrary JSON and run queries and retrieve subsets of it. (TODO)
  • (for x = models) Associate models with datasets, metadata and features, then track, store, and deploy them (TODO)

🔽 Installation

You can get the binary files for Linux, macOS and Windows from releases page. Extract and copy the file to your $PATH.

Alternatively, if you have Rust installed, you can build xvc:

$ cargo install xvc

🏃🏾 Quick Start

Xvc tracks your files and directories on top of Git. To start run the following command in the repository.

$ xvc init

It initializes the metafiles in .xvc/ directory and adds .xvcignore file in case you want to hide certain elements from Xvc.

Add your data files and directories for tracking.

$ xvc file track my-data/
$ git add .xvc
$ git commit -m "Began to track my-data/ with Xvc"
$ git push

The command calculates data content hashes (with BLAKE-3, by default) and records them. It also copies files to content addressed directories under .xvc/b3

Define a file storage to share the files you added.

$ xvc storage new s3 --name my-remote --region us-east-1 --bucket-name my-xvc-remote

You can push the files you added to this remote.

$ xvc file push --to my-remote

You can now delete the files.

$ rm -r my-data/

When you want to access this data later, you can clone the repository and get back the files from file storage.

$ xvc file pull my-data/

If you have commands that depend on data or code elements, Xvc allows to define steps to its default pipeline.

$ xvc pipeline step new --name my-data-update --command 'python3 preprocess.py'
$ xvc pipeline step dependency --step my-data-update --files my-data/ \
                                                     --files preprocess.py \
                                                     --regex 'names.txt:/^Name:' \
                                                     --lines a-long-file.csv::-1000
$ xvc pipeline step output --step-name my-data-update --output-file preprocessed-data.npz

The above commands define a new step in the default pipeline that depends on files in my-data/ directory, and preprocess.py; lines that start with Name: in names.txt; and the first 1000 lines in a-long-file.csv. When any of these change, or the output is missing, the step command (python3 preprocess.py) will run.

$ xvc pipeline run

If none of the dependencies change, and the output is available the above command will do nothing.

You can define fairly complex dependencies with globs, files, directories, regular expression searches in files, lines in files, other steps and pipelines with xvc pipeline step dependency commands. More dependency types like database queries, content from URLs, S3 (and compatible) buckets, REST and GraphQL results are in my mental backlog.

Please check xvc.netlify.app for documentation.

🤟 Big Thanks

xvc stands on the following (giant) crates:

  • serde allows all data structures to be stored in text files. Special thanks from xvc-ecs for serializing components in an ECS with a single line of code.
  • Xvc processes files in parallel with pipelines thanks to crossbeam.
  • Xvc uses rayon to calculate content hashes of millions of files in parallel.
  • Thanks to strum, Xvc uses enums extensively and converts almost everything to typed values from strings.
  • Xvc has a deep CLI that has subcommands of subcommands like xvc storage new s3, and all these work with minimum bugs thanks to clap.
  • Xvc uses rust-s3 to connect to S3 and compatible storage services. It employs excellent tokio for fast async Rust. These cloud storage features can be turned off thanks to Rust conditional compilation.
  • Without implementations of BLAKE3, BLAKE2, SHA-2 and SHA-3 from Rust crypto crate, Xvc couldn't detect file changes so fast.
  • Many thanks to small and well built crates, reflink, relative-path, path-absolutize, glob and wax for file system and glob handling.
  • Thanks to sad_machine for providing a State Machine implementation that I used in xvc pipeline run. A DAG composed of State Machines made running pipeline steps in parallel with a clean separation of process states.
  • Thanks to thiserror and anyhow for making error handling a breeze. These two crates make me feel I'm doing something good for the humanity when handling errors.
  • Xvc is split into many crates and owes this organization to cargo workspaces.

And, biggest thanks to Rust designers, developers and contributors. Although I can't see myself expert to appreciate it all, it's a fabulous language and environment to work with.

🚁 Support

👐 Contributing

  • Star this repo. I feel very happy for five minutes for every star and send my best wishes to you.
  • Really use xvc, tell me how it works for you, read the documentation, report bugs, dream about features. The greatest contribution might be this now.
  • Write a new test with your workflow to increase testing coverage. They are under workflow_tests crate.
  • Be my guest when you visit Bursa. I usually don't have time to meet with every guest in person but if you let me know you are coming, I'd like to arrange something. Also, when you visit Galata tower in İstanbul, which is close to where I live, you can buy me a coffee.

⚠️ Disclaimer

This software is fresh and ambitious. Although I use it and test it close to real world conditions, it didn't go under test of time. Xvc can eat your files and spit them to eternal void!

Comments
Releases(v0.4.2-alpha.8)
Owner
Emre Sahin
Polyglot Software and Machine Learning Engineer, Documentor, Writer, DevOps, MLOps.
Emre Sahin
Framework for large distributed pipelines

Rain Rain is an open-source distributed computational framework for processing of large-scale task-based pipelines. Rain aims to lower the entry barri

Substantic 705 Dec 27, 2022
Write CI/CD pipelines using TypeScript

Katoa Katoa is a community fork of Cidada, a tool created by Fig which was sunset in late 2023 following acquisition by AWS. This fork and the underly

Katoa 47 Oct 6, 2023
A robust, customizable, blazingly-fast, efficient and easy-to-use command line application to uwu'ify your text!

uwuifyy A robust, customizable, blazingly-fast, efficient and easy-to-use command line application to uwu'ify your text! Logo Credits: Jade Nelson Tab

Hamothy 43 Dec 12, 2022
Experimental language build in Rust to make it fast and robust

Reg-lang Experimental language build with Rust. Its aim is : To be simple to help learning programmation with, and in a second hand, to be robust enou

Gipson62 1 Dec 29, 2022
A Modern And Secure CLI Tool For Managing Environment Variables

Envio is a command-line tool that simplifies the management of environment variables across multiple profiles. It allows users to easily switch between different configurations and apply them to their current environment

Humble Penguin 536 Apr 16, 2023
A minimalist tool for managing block-lists from the terminal.

Block List A minimalist hosts-based tool for managing block lists and ad-blocking. This project uses the excellent and regularly updated Unified Hosts

Oliver Brotchie 7 Aug 14, 2022
Tool for managing dotfiles directories; Heavily based on rcm.

Paro paro : to prepare, get ready / set, put / furnish, supply. Tool for managing dotfiles directories; Heavily based on rcm. TODO Rust Boilerplate CI

Rafael Delboni 7 Nov 20, 2022
booky is a minimalstic Tui tool for managing your growing book collection.

booky booky is a minimalistic TUI tool for managing your growing book collection. It is writtin in Rust and uses diesel as it's orm together with sqli

null 3 Jul 21, 2023
💫 Easy to use, robust Rust library for displaying spinners in the terminal

spinoff an easy to use, robust library for displaying spinners in the terminal ?? Install Add as a dependency to your Cargo.toml: [dependencies] spino

ad4m 401 Jun 24, 2023
Leptos Query - a robust asynchronous state management library for Leptos,

Leptos Query is a robust asynchronous state management library for Leptos, providing simplified data fetching, integrated reactivity, server-side rendering support, and intelligent cache management.

Nico Burniske 5 Jul 24, 2023
An interface for managing collections of labeled items and generating random subsets with specified restrictions

An interface for managing collections of labeled items and generating random subsets with specified restrictions

Kaio Vieira 3 Oct 30, 2022
A utility for managing cargo dependencies from the command line.

cargo edit This tool extends Cargo to allow you to add, remove, and upgrade dependencies by modifying your Cargo.toml file from the command line. Curr

Pascal Hertleif 2.7k Jan 6, 2023
An utility application to help managing your C++ OI workspaces.

oi_helper oi_helper is an utility application to help managing your C++ OI workspaces. Why oi_helper We all know that we often need a project manager

27Onion Nebell 11 Aug 24, 2022
Work-in-progress software for managing the Azeron keypad on any operating system.

azeron-cli A small, unfinished CLI application intended to manage the Azeron Cyborg. The code is still in a very messy state and doesn't look very rus

cozyGalvinism 5 Nov 24, 2022
Concurrent and multi-stage data ingestion and data processing with Rust+Tokio

TokioSky Build concurrent and multi-stage data ingestion and data processing pipelines with Rust+Tokio. TokioSky allows developers to consume data eff

DanyalMh 29 Dec 11, 2022
Databento Binary Encoding (DBZ) - Fast message encoding and storage format for market data

dbz A library (dbz-lib) and CLI tool (dbz-cli) for working with Databento Binary Encoding (DBZ) files. Python bindings for dbz-lib are provided in the

Databento, Inc. 15 Nov 4, 2022
A Rust library for building modular, fast and compact indexes over genomic data

mazu A Rust library for building modular, fast and compact indexes over genomic data Mazu (媽祖)... revered as a tutelary deity of seafarers, including

COMBINE lab 6 Aug 15, 2023
🐎 A fast implementation of the Aho-Corasick algorithm using the compact double-array data structure. (Python wrapper for daachorse)

python-daachorse daachorse is a fast implementation of the Aho-Corasick algorithm using the compact double-array data structure. This is a Python wrap

Koichi Akabe 11 Nov 30, 2022
qsv: Ultra-fast CSV data-wrangling toolkit

qsv is a command line program for indexing, slicing, analyzing, splitting, enriching, validating & joining CSV files. Commands are simple, fast & composable

Joel Natividad 398 Jan 3, 2023