A fast and robust MLOps tool for managing data and pipelines

Emre Sahin

Last update: Dec 15, 2022

Related tags

Command-line rust devops data-science data machine-learning data-engineering command-line-tool data-pipelines machine-learning-engineering mlops

Overview

xvc

A Fast and Robust MLOps Swiss-Army Knife in Rust

⌛ When to use xvc?

Machine Learning Engineers: When you manage large quantities of unstructured data, like images, documents, audio files. When you create data pipelines on top of this data and want to run these pipelines when the data, code or other dependencies change.
Data Engineers: When you want to version data files, and want to track versions across datasets. When you have to provide this data in multiple remote locations, like S3 or local files.
Data Scientists: When you want to track which subset of the data you're working with, and how it changes by your operations.
Software Engineers: When you have binary artifacts that you use as dependencies and would like to have a make alternative that considers content changes rather than timestamps.
Everyone: When you have photo, audio, media, document files to backup on Git, but don't want to copy that huge data to all Git clones. When you want to run a command when any of these files change.

✳️ What is xvc for?

(for x = files) Track large files on Git, store them on the cloud, retrieve when necessary, label and query for subsets
(for x = pipelines) Define and run data -> model pipelines whose dependencies may be files, hyperparameters, regex searches, arbitrary URLs and more.
(for x = experiments) Run isolated experiments, share them and store them in Git when necessary (TODO)
(for x = data) Annotate data with arbitrary JSON and run queries and retrieve subsets of it. (TODO)
(for x = models) Associate models with datasets, metadata and features, then track, store, and deploy them (TODO)

🔽 Installation

You can get the binary files for Linux, macOS and Windows from releases page. Extract and copy the file to your $PATH.

Alternatively, if you have Rust installed, you can build xvc:

$ cargo install xvc

🏃🏾 Quick Start

Xvc tracks your files and directories on top of Git. To start run the following command in the repository.

$ xvc init

It initializes the metafiles in .xvc/ directory and adds .xvcignore file in case you want to hide certain elements from Xvc.

Add your data files and directories for tracking.

$ xvc file track my-data/
$ git add .xvc
$ git commit -m "Began to track my-data/ with Xvc"
$ git push

The command calculates data content hashes (with BLAKE-3, by default) and records them. It also copies files to content addressed directories under .xvc/b3

Define a file storage to share the files you added.

$ xvc storage new s3 --name my-remote --region us-east-1 --bucket-name my-xvc-remote

You can push the files you added to this remote.

$ xvc file push --to my-remote

You can now delete the files.

$ rm -r my-data/

When you want to access this data later, you can clone the repository and get back the files from file storage.

$ xvc file pull my-data/

If you have commands that depend on data or code elements, Xvc allows to define steps to its default pipeline.

$ xvc pipeline step new --name my-data-update --command 'python3 preprocess.py'
$ xvc pipeline step dependency --step my-data-update --files my-data/ \
                                                     --files preprocess.py \
                                                     --regex 'names.txt:/^Name:' \
                                                     --lines a-long-file.csv::-1000
$ xvc pipeline step output --step-name my-data-update --output-file preprocessed-data.npz

The above commands define a new step in the default pipeline that depends on files in my-data/ directory, and preprocess.py; lines that start with Name: in names.txt; and the first 1000 lines in a-long-file.csv. When any of these change, or the output is missing, the step command (python3 preprocess.py) will run.

$ xvc pipeline run

If none of the dependencies change, and the output is available the above command will do nothing.

You can define fairly complex dependencies with globs, files, directories, regular expression searches in files, lines in files, other steps and pipelines with xvc pipeline step dependency commands. More dependency types like database queries, content from URLs, S3 (and compatible) buckets, REST and GraphQL results are in my mental backlog.

Please check xvc.netlify.app for documentation.

🤟 Big Thanks

xvc stands on the following (giant) crates:

serde allows all data structures to be stored in text files. Special thanks from xvc-ecs for serializing components in an ECS with a single line of code.
Xvc processes files in parallel with pipelines thanks to crossbeam.
Xvc uses rayon to calculate content hashes of millions of files in parallel.
Thanks to strum, Xvc uses enums extensively and converts almost everything to typed values from strings.
Xvc has a deep CLI that has subcommands of subcommands like xvc storage new s3, and all these work with minimum bugs thanks to clap.
Xvc uses rust-s3 to connect to S3 and compatible storage services. It employs excellent tokio for fast async Rust. These cloud storage features can be turned off thanks to Rust conditional compilation.
Without implementations of BLAKE3, BLAKE2, SHA-2 and SHA-3 from Rust crypto crate, Xvc couldn't detect file changes so fast.
Many thanks to small and well built crates, reflink, relative-path, path-absolutize, glob and wax for file system and glob handling.
Thanks to sad_machine for providing a State Machine implementation that I used in xvc pipeline run. A DAG composed of State Machines made running pipeline steps in parallel with a clean separation of process states.
Thanks to thiserror and anyhow for making error handling a breeze. These two crates make me feel I'm doing something good for the humanity when handling errors.
Xvc is split into many crates and owes this organization to cargo workspaces.

And, biggest thanks to Rust designers, developers and contributors. Although I can't see myself expert to appreciate it all, it's a fabulous language and environment to work with.

🚁 Support

You can use Discussions to ask questions. I'll answer as much as possible. Thank you.
For consultancy and paid support, you can get in touch with me..

👐 Contributing

Star this repo. I feel very happy for five minutes for every star and send my best wishes to you.
Really use xvc, tell me how it works for you, read the documentation, report bugs, dream about features. The greatest contribution might be this now.
Write a new test with your workflow to increase testing coverage. They are under workflow_tests crate.
Be my guest when you visit Bursa. I usually don't have time to meet with every guest in person but if you let me know you are coming, I'd like to arrange something. Also, when you visit Galata tower in İstanbul, which is close to where I live, you can buy me a coffee.

⚠️ Disclaimer

This software is fresh and ambitious. Although I use it and test it close to real world conditions, it didn't go under test of time. Xvc can eat your files and spit them to eternal void!

Comments

xvc file send error (S3)

Hi,

I tried to send my files to S3 but got the following error:

Full backtrace:

thread '' panicked at 'called Option::unwrap() on a None value', file/src/send.rs:72:34 stack backtrace: 0: 0x103726302 - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::hd2e8cbde22b780fc 1: 0x1037484fa - core::fmt::write::hd6692086cdd356a7 2: 0x10371e6bc - std::io::Write::write_fmt::h6043124a2486acbb 3: 0x103727c5b - std::panicking::default_hook::{{closure}}::h87a12b8b06887dd7 4: 0x103727967 - std::panicking::default_hook::h7f68dad17e0bfaa4 5: 0x10372827f - std::panicking::rust_panic_with_hook::hd57427cbbfc3717a 6: 0x103728182 - std::panicking::begin_panic_handler::{{closure}}::h33aab6d04e2bba70 7: 0x103726798 - std::sys_common::backtrace::__rust_end_short_backtrace::h0e7a76f927db9964 8: 0x103727e8d - _rust_begin_unwind 9: 0x10377d193 - core::panicking::panic_fmt::hcf6f3c517c6f3cb3 10: 0x10377d077 - core::panicking::panic::h46977cf6deabee02 11: 0x103164a83 - core::ops::function::impls::<impl core::ops::function::FnOnce for &mut F>::call_once::h3ae85f760bd22e23 12: 0x1031cf9cb - <alloc::vec::Vec as alloc::vec::spec_from_iter::SpecFromIter<T,I>>::from_iter::hb4f07769796b97ee 13: 0x103169a52 - xvc_file::send::cmd_send::h71b8f1414f8bdc26 14: 0x103184e87 - xvc_file::run::h6f27c3454c6ddc25 15: 0x102ff6786 - core::ops::function::FnOnce::call_once{{vtable.shim}}::h86d8253b795cae5f 16: 0x102fd3323 - std::sys_common::backtrace::__rust_begin_short_backtrace::h07d3b68a3599eace 17: 0x102fc7d3c - core::ops::function::FnOnce::call_once{{vtable.shim}}::h8b20dc35e1a5fb59 18: 0x10372e1b7 - std::sys::unix::thread::Thread::new::thread_start::haa45038b11bc331d 19: 0x7fff203768fc - __pthread_start thread 'main' panicked at 'called Result::unwrap() on an Err value: Any { .. }', lib/src/cli/mod.rs:273:6 stack backtrace: 0: 0x103726302 - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::hd2e8cbde22b780fc 1: 0x1037484fa - core::fmt::write::hd6692086cdd356a7 2: 0x10371e6bc - std::io::Write::write_fmt::h6043124a2486acbb 3: 0x103727c5b - std::panicking::default_hook::{{closure}}::h87a12b8b06887dd7 4: 0x103727967 - std::panicking::default_hook::h7f68dad17e0bfaa4 5: 0x10372827f - std::panicking::rust_panic_with_hook::hd57427cbbfc3717a 6: 0x1037281c3 - std::panicking::begin_panic_handler::{{closure}}::h33aab6d04e2bba70 7: 0x103726798 - std::sys_common::backtrace::__rust_end_short_backtrace::h0e7a76f927db9964 8: 0x103727e8d - _rust_begin_unwind 9: 0x10377d193 - core::panicking::panic_fmt::hcf6f3c517c6f3cb3 10: 0x10377d2f5 - core::result::unwrap_failed::ha988429942445917 11: 0x102fbe283 - xvc::cli::dispatch::h6cb0b458cd3e5b16 12: 0x102f75284 - xvc::main::h02643c3fbc259b82 13: 0x102f7518b - std::sys_common::backtrace::__rust_begin_short_backtrace::h6b49ecb8fbe72698 14: 0x102f751b8 - std::rt::lang_start::{{closure}}::h76c1ee7dd2921ad7 15: 0x10371890e - std::rt::lang_start_internal::ha5deaf08dab8765b 16: 0x102f752cf - _main
bug

opened by OmerTrc 9

Add storage tests to Github Actions

Remote tests use feature flags to prevent to run every time the test suite runs.

These test could be run manually, either by per remote or in total.

[ ] Enter all relevant credentials as secrets to Github Actions

[ ] Create a new CI file that has separate testing for each of the remotes, via matrix

automation

opened by iesahin 5
Rename `xvc file push` to `xvc file send` and `xvc file pull` to `xvc file bring`.
Remove xvc file fetch and merge its functionality as an option in xvc file bring.

Add --checkout-as option to xvc file bring to pass to xvc file checkout.

enhancement
opened by iesahin 2
converted all CLI structs to clap 4.
parse(from_occurrences) -> ArgMatches::Count

command and subcommand #[clap(...)] -> #[command(...)]

argument #[clap(...)] -> #[arg(...)]

refactor
opened by iesahin 2
skip init when storage dir contains xvc guid file for xvc storage new local

This PR also makes changes to XvcStorageOperations trait to return a new struct in init, with an older Guid. This allows to reinit with an already existing directory.
enhancement

opened by iesahin 2
Remove unused deps and upgrade used ones
Removes unused dependencies with cargo udeps

Upgrade all dependencies, including breaking serde_yaml and tokio

Adds clean up to tests to delete the repository director

refactor
opened by iesahin 1
new `xvc file list` options and default behavior

The default behavior for xvc file list should be to show actual info.

Add a --cached option to show the cached info.

Add a --diff option to show the differences between cached and actual info only.

opened by iesahin 0
`xvc file versions`

Show versions of a specific file. It can show other metadata like dates and size as well.

In the future, other commands like recheck, copy, move may also receive version strings to operate on specific versions.

opened by iesahin 0
Test XvcMetadata differences on Linux and Windows
Tests on Github Actions fail sometimes for the different metadata reported.

For example, the tests for git-branches.md fails when run for xvc file list

---- expected: stdout ++++ actual: stdout 1 - C=[..] 1 + C< 2022-12-07 18:57:05 19 data.txt 2 2 | stderr:

This test passes on my local mac. It looks the fs behavior to report the results are different per OS.
bug
opened by iesahin 0
If store or ec didn't change, it shouldn't record any files

Entity counter record itself even if it's not changed. We need some kind of "dirty bit" mechanism to check whether to save to a new file.

This will prevent most of the spurious Git commits after read only commands.

opened by iesahin 0
`xvc experiment`
This gist of the experiments is to compare them. We need a diff facility to compare inputs and results across runs.

We don't have to limit this comparison to defined outputs in the pipeline. Any file changed between two runs can be diffed.

There can be three types of diffs:

Unstructured diffs: This is for binary files that we don't recognize. Only the content digest is reported.

Structured diffs: For a file format that we can parse, we can report the individual differences across runs. JSON, YAML or any other format that we can parse for results can be reported as structured diff.

Text diffs: This is for the source code files that may have lead to changes in other files.

The workflow is as follows:

User has a bunch of files, source, params, data, model, etc.

User modifies some of these manually. e.g. updating the source code.

User modifies some of these with xvc exp run --input-param command.

User runs a command (or pipeline) on the files.

Xvc clones/rechecks/copies files from original to a directory in .xvc/exp/KEYWORD-RANDOMSTRING-TIMESTAMP directory.

Xvc links the original cache.

Xvc creates a .xvc-exp directory to store experiment specific data.

Xvc modifies the files with the given modification option.

--input-param params.yaml params.my-param 123,124,135 creates 3 experiments, each changing params.yaml::params.my-param to a given value.

Xvc runs the given command (or pipeline) in the directory

Xvc stores the updated artifacts in the common cache, symlinking the results.

User asks for results diffed from the original.

Xvc compares each of the directories for the changed files.

Xvc shows unstructured files digest strings.

Xvc shows structured files changed values.

Xvc shows text file diffs similar to Git.

All results must be reported in JSON. Tables may be built from this JSON.

The second facility xvc exp provides is to modify structured files quickly for each experiment.

xvc exp run --input-param file.yaml dict.key value1,value2,value3 will parse file.yaml, update dict.key with value1 and run an experiment, update with value2 and run another, update with value3 and run another.

xvc exp run --input-param file.json dict.key '0;5;100' will run experiments with 0,5,10,15,20,...,100 (inclusive).

Files to be modified are JSON, YAML1.2 and TOML files. (Anything serde can read/write is possible in theory.)

We can extend this functionality to regex. --input-regex file.txt 'my_var = (.*)' 0;0.1;1 updates $1 in regex with the values.

We can also use --command-template for this. xvc exp run --command-template 'python train.py ${{EXP_VALUE}}' 0;0.2;10 will run python train.py with parameters 0, 0.2, 0.4, .... in different experiments.

If there are more than one --input-param, --input-regex, --command-template parameters, we build permutations of values. xvc exp run --input-param file.yaml dict.key 1,2,3 --input-param another.yaml another.key 5,6,7 will run 9 experiments.

There may be three subcommands for xvc exp run.

xvc exp run pipeline --name: (xvcerp) Runs a pipeline command with the given parameters. (xvc pipeline run --name)

xvc exp run command 'cmd': Runs a generic command as experiment

xvc exp run template 'cmd ${{EXP_VALUE_1}} ${{EXP_VALUE_2}} 1,2,3 4,5,6 runs a command by substituing values to the command string.

--input-param and --input-regex options are available to all three of these. Maybe instead of --input-param, it's better to use --update-param and --update-regex. Maybe we can merge these, but I don't like to have corner cases.

--keyword will set the KEYWORD portion of experiment names. By default, this is exp. User may want to set to a searchable name.

The updated params, and run commands are stored in .xvc-exp directory. It may contain the exact script that was run.
opened by iesahin 0

Releases(v0.4.2-alpha.8)

Owner

Emre Sahin

Polyglot Software and Machine Learning Engineer, Documentor, Writer, DevOps, MLOps.

GitHub

EmbedAnything is a powerful python library designed to streamline the creation and management of embedding pipelines

EmbedAnything is a powerful python library designed to streamline the creation and management of embedding pipelines. Built in Rust with no heavy dependencies.

39 May 7, 2024

A robust, customizable, blazingly-fast, efficient and easy-to-use command line application to uwu'ify your text!

uwuifyy A robust, customizable, blazingly-fast, efficient and easy-to-use command line application to uwu'ify your text! Logo Credits: Jade Nelson Tab

43 Dec 12, 2022

Experimental language build in Rust to make it fast and robust

Reg-lang Experimental language build with Rust. Its aim is : To be simple to help learning programmation with, and in a second hand, to be robust enou

1 Dec 29, 2022

Framework for large distributed pipelines

Rain Rain is an open-source distributed computational framework for processing of large-scale task-based pipelines. Rain aims to lower the entry barri

705 Dec 27, 2022

Write CI/CD pipelines using TypeScript

Katoa Katoa is a community fork of Cidada, a tool created by Fig which was sunset in late 2023 following acquisition by AWS. This fork and the underly

47 Oct 6, 2023

A Modern And Secure CLI Tool For Managing Environment Variables

Envio is a command-line tool that simplifies the management of environment variables across multiple profiles. It allows users to easily switch between different configurations and apply them to their current environment

536 Apr 16, 2023

A minimalist tool for managing block-lists from the terminal.

Block List A minimalist hosts-based tool for managing block lists and ad-blocking. This project uses the excellent and regularly updated Unified Hosts

7 Aug 14, 2022

Tool for managing dotfiles directories; Heavily based on rcm.

Paro paro : to prepare, get ready / set, put / furnish, supply. Tool for managing dotfiles directories; Heavily based on rcm. TODO Rust Boilerplate CI

7 Nov 20, 2022

booky is a minimalstic Tui tool for managing your growing book collection.

booky booky is a minimalistic TUI tool for managing your growing book collection. It is writtin in Rust and uses diesel as it's orm together with sqli

3 Jul 21, 2023

Command-line tool designed to simplify the process of managing multiple .NET SDK versions on your system

.NET Version Manager (dver) Overview dver is a command-line tool designed to simplify the process of managing multiple .NET SDK versions on your syste

5 Aug 23, 2024

💫 Easy to use, robust Rust library for displaying spinners in the terminal

spinoff an easy to use, robust library for displaying spinners in the terminal ?? Install Add as a dependency to your Cargo.toml: [dependencies] spino

401 Jun 24, 2023

Leptos Query - a robust asynchronous state management library for Leptos,

Leptos Query is a robust asynchronous state management library for Leptos, providing simplified data fetching, integrated reactivity, server-side rendering support, and intelligent cache management.

5 Jul 24, 2023

An interface for managing collections of labeled items and generating random subsets with specified restrictions

3 Oct 30, 2022

KAIVM is a multiplatform Command Line Interface (CLI) designed to simplify the process of downloading, managing, configuring, and running different versions of Shinkai Node

KAIVM - Shinkai Version Manager KAIVM is a multiplatform Command Line Interface (CLI) designed to simplify the process of downloading, managing, confi

7 May 1, 2024

A fast and robust MLOps tool for managing data and pipelines

Related tags

Overview

xvc

⌛ When to use xvc?

✳️ What is xvc for?

🔽 Installation

🏃🏾 Quick Start

🤟 Big Thanks

🚁 Support

👐 Contributing

⚠️ Disclaimer

Comments

Releases(v0.4.2-alpha.8)

v0.4.2-alpha.8(Dec 9, 2022)

v0.4.2-alpha.7(Dec 9, 2022)

v0.4.2-alpha.6(Dec 9, 2022)

v0.4.2-alpha.5(Dec 9, 2022)

v0.4.2-alpha.0(Nov 27, 2022)

v0.4.1-alpha.0(Nov 26, 2022)

v0.4.0(Nov 26, 2022)

v0.3.4(Nov 20, 2022)

v0.3.3(Nov 8, 2022)

v0.3.2(Oct 26, 2022)

v0.3.1(Oct 18, 2022)

Owner

Emre Sahin

EmbedAnything is a powerful python library designed to streamline the creation and management of embedding pipelines

A robust, customizable, blazingly-fast, efficient and easy-to-use command line application to uwu'ify your text!

Experimental language build in Rust to make it fast and robust

Framework for large distributed pipelines

Write CI/CD pipelines using TypeScript

A Modern And Secure CLI Tool For Managing Environment Variables

A minimalist tool for managing block-lists from the terminal.

Tool for managing dotfiles directories; Heavily based on rcm.

booky is a minimalstic Tui tool for managing your growing book collection.

Command-line tool designed to simplify the process of managing multiple .NET SDK versions on your system

💫 Easy to use, robust Rust library for displaying spinners in the terminal

Leptos Query - a robust asynchronous state management library for Leptos,

An interface for managing collections of labeled items and generating random subsets with specified restrictions

KAIVM is a multiplatform Command Line Interface (CLI) designed to simplify the process of downloading, managing, configuring, and running different versions of Shinkai Node

A utility for managing cargo dependencies from the command line.

An utility application to help managing your C++ OI workspaces.

Work-in-progress software for managing the Azeron keypad on any operating system.

🛜 TUI for managing bluetooth devices

Concurrent and multi-stage data ingestion and data processing with Rust+Tokio