A Fast and Robust MLOps Swiss-Army Knife in Rust
When to use xvc?
- Machine Learning Engineers: When you manage large quantities of unstructured data, like images, documents, audio files. When you create data pipelines on top of this data and want to run these pipelines when the data, code or other dependencies change.
- Data Engineers: When you want to version data files, and want to track versions across datasets. When you have to provide this data in multiple remote locations, like S3 or local files.
- Data Scientists: When you want to track which subset of the data you're working with, and how it changes by your operations.
- Software Engineers: When you have binary artifacts that you use as dependencies and would like to have a
makealternative that considers content changes rather than timestamps.
- Everyone: When you have photo, audio, media, document files to backup on Git, but don't want to copy that huge data to all Git clones. When you want to run a command when any of these files change.
What is xvc for?
- (for x = files) Track large files on Git, store them on the cloud, retrieve when necessary, label and query for subsets
- (for x = pipelines) Define and run data -> model pipelines whose dependencies may be files, hyperparameters, regex searches, arbitrary URLs and more.
- (for x = experiments) Run isolated experiments, share them and store them in Git when necessary (TODO)
- (for x = data) Annotate data with arbitrary JSON and run queries and retrieve subsets of it. (TODO)
- (for x = models) Associate models with datasets, metadata and features, then track, store, and deploy them (TODO)
You can get the binary files for Linux, macOS and Windows from releases page. Extract and copy the file to your $PATH.
Alternatively, if you have Rust installed, you can build xvc:
$ cargo install xvc
🏃🏾 Quick Start
Xvc tracks your files and directories on top of Git. To start run the following command in the repository.
$ xvc init
It initializes the metafiles in
.xvc/ directory and adds
.xvcignore file in case you want to hide certain elements from Xvc.
Add your data files and directories for tracking.
$ xvc file track my-data/
$ git add .xvc
$ git commit -m "Began to track my-data/ with Xvc"
$ git push
The command calculates data content hashes (with BLAKE-3, by default) and records them. It also copies files to content addressed directories under
Define a file storage to share the files you added.
$ xvc storage new s3 --name my-remote --region us-east-1 --bucket-name my-xvc-remote
You can push the files you added to this remote.
$ xvc file push --to my-remote
You can now delete the files.
$ rm -r my-data/
When you want to access this data later, you can clone the repository and get back the files from file storage.
$ xvc file pull my-data/
If you have commands that depend on data or code elements, Xvc allows to define steps to its default pipeline.
$ xvc pipeline step new --name my-data-update --command 'python3 preprocess.py'
$ xvc pipeline step dependency --step my-data-update --files my-data/ \
--files preprocess.py \
--regex 'names.txt:/^Name:' \
$ xvc pipeline step output --step-name my-data-update --output-file preprocessed-data.npz
The above commands define a new step in the
default pipeline that depends on files in
my-data/ directory, and
preprocess.py; lines that start with
names.txt; and the first 1000 lines in
a-long-file.csv. When any of these change, or the output is missing, the step command (
python3 preprocess.py) will run.
$ xvc pipeline run
If none of the dependencies change, and the output is available the above command will do nothing.
You can define fairly complex dependencies with globs, files, directories, regular expression searches in files, lines in files, other steps and pipelines with
xvc pipeline step dependency commands. More dependency types like database queries, content from URLs, S3 (and compatible) buckets, REST and GraphQL results are in my mental backlog.
Please check xvc.netlify.app for documentation.
xvc stands on the following (giant) crates:
- serde allows all data structures to be stored in text files. Special thanks from
xvc-ecsfor serializing components in an ECS with a single line of code.
- Xvc processes files in parallel with pipelines thanks to crossbeam.
- Xvc uses rayon to calculate content hashes of millions of files in parallel.
- Thanks to strum, Xvc uses enums extensively and converts almost everything to typed values from strings.
- Xvc has a deep CLI that has subcommands of subcommands like
xvc storage new s3, and all these work with minimum bugs thanks to clap.
- Xvc uses rust-s3 to connect to S3 and compatible storage services. It employs excellent tokio for fast async Rust. These cloud storage features can be turned off thanks to Rust conditional compilation.
- Without implementations of BLAKE3, BLAKE2, SHA-2 and SHA-3 from Rust crypto crate, Xvc couldn't detect file changes so fast.
- Many thanks to small and well built crates, reflink, relative-path, path-absolutize, glob and wax for file system and glob handling.
- Thanks to sad_machine for providing a State Machine implementation that I used in
xvc pipeline run. A DAG composed of State Machines made running pipeline steps in parallel with a clean separation of process states.
- Thanks to thiserror and anyhow for making error handling a breeze. These two crates make me feel I'm doing something good for the humanity when handling errors.
- Xvc is split into many crates and owes this organization to cargo workspaces.
And, biggest thanks to Rust designers, developers and contributors. Although I can't see myself expert to appreciate it all, it's a fabulous language and environment to work with.
- You can use Discussions to ask questions. I'll answer as much as possible. Thank you.
- For consultancy and paid support, you can get in touch with me..
- Star this repo. I feel very happy for five minutes for every star and send my best wishes to you.
- Really use xvc, tell me how it works for you, read the documentation, report bugs, dream about features. The greatest contribution might be this now.
- Write a new test with your workflow to increase testing coverage. They are under
- Be my guest when you visit Bursa. I usually don't have time to meet with every guest in person but if you let me know you are coming, I'd like to arrange something. Also, when you visit Galata tower in İstanbul, which is close to where I live, you can buy me a coffee.
This software is fresh and ambitious. Although I use it and test it close to real world conditions, it didn't go under test of time. Xvc can eat your files and spit them to eternal void!