elfshaker is a low-footprint, high-performance version control system fine-tuned for binaries.

Last update: Dec 24, 2022

Overview

`elfshaker`

400 GiB -> 100 MiB, with 1s access time†; when applied to clang builds.

elfshaker is a low-footprint, high-performance version control system fine-tuned for binaries.

elfshaker is a CLI tool written in the Rust programming language.
It stores snapshots of directories into highly-compressed pack files and provides fast on-demand access to the stored files. It is particularly good for storing lots of similar files, for example object files from an incremental build.
It allows few-second access to any commit of clang with the manyclangs project. For example, this accelerates bisection of LLVM by a factor of 60x! This is done by extracting builds of LLVM on-demand from locally stored elfshaker packs, each of which contains ~1,800 builds and is about 100 MiB in size, even though the full originals would take TiBs to store! Extracting a single build takes 2-4s on modern hardware.

†Applicability

Or, "how on earth do you get such a phenomenal result?".

It works particularly well for our presented use case because storing pre-link object files has these properties:

There are many files,
Most of them don't change very often so there are a lot of duplicate files,
When they do change, the deltas of the binaries are not huge.

We achieve this in manyclangs by compiling object code with the -ffunction-sections and -fdata-sections compiler flags. This has the effect that if you 'insert' a function into a translation unit, the insertion does not cause all of the addresses to change across the whole object file.

If you looked at the binary delta on the linked executable from such a change, it will be large, because all of the absolute addresses after the insertion will change, and references to those addresses will change. These address changes are not handled well by compression algorithms, resulting in a poor compression ratio. The effect of this is large: if you compress many revisions of clang executables together, you will see a compression ratio of something like 20%. This is pretty good! But elfshaker achieves a ratio of something closer to 0.01% (or 10,000x), amortized across many builds.

Installation guide

Usage guide

Quickstart

Consult our installation and usage guide, make sure you know what you're doing.
elfshaker store -- capture the state of the current working directory into a named snapshot .
elfshaker pack -- capture all 'loose' snapshots into a single pack file (this is what gets you the compression win).
elfshaker extract -- restore the state of a previous snapshot into the current working directory.

For more detail, take a look at our workflow documentation.

System Compatibility

The following platforms are used for our CI tests:

Ubuntu 20.04 LTS

But we aim to support all popular Linux platforms, macOS and Windows in production.

We officially support the following architectures:

AArch64
x86-64

Current Status

The file format and directory structure is stable. We intend that pack files created with the current elfshaker version will remain compatible with future versions. Please kick the tyres and do your own validation, and file bugs if you find any. We have done a fair amount of validation for our use cases but there may be things we haven't needed yet, so please start a discussion and file issues/patches.

Contributing

Contributions are highly appreciated. Refer to our Contributing guide.

Contact

The best way to reach us to join the elfshaker/community on Gitter. The original authors of elfshaker are Peter Waller (@peterwaller-arm) <[email protected]> and Veselin Karaganev (@veselink1) <[email protected]> and you may also contact us via email.

Security

Refer to our Security policy.

License

elfshaker is licensed under the Apache License 2.0.

Comments

Missing/unclear docs in contrib/manyclangs
[x] Maybe put something about how to build and install compdb2line

[x] Missing requirement on git-list-between - ../elfshaker/contrib/manyclangs-build-month: line 26: git-list-between: command not found

[x] pv command not found

[x] needs git that's new enough (the one with 18.04 wasn't new enough)

[x] needs a ninja-jobserver version

[ ] document where the output goes, and what architecture it builds for

[ ] time command not found

(these are being found on a fresh docker 18.04 image)
opened by mattgodbolt 20
`elfshaker list` output changes proposal
At the moment 'elfshaker list' shows packs by default. This is inconvenient as the output is not something you can feed into 'elfshaker extract' and other tools.

@veselink1 and I discussed this briefly and here are some considerations:

The output should be sorted.

The sort incurs a delay/cost, but we think it would be worse to have something which 'often appears to be sorted but is not', which would be the alternative. I might revisit this decision if the sorting turned out to be expensive, because we have to read in the full set of snapshots across all of the packs.

The output should be in canonical PACK:SNAPSHOT form.

Should we elide [PACK:] where possible? Probably not because then we might be back to the 'looks like sorted order problem'. Unless we think it's clear enough to have the snapshot name as the sort key.

Only the fully-qualified snapshot name should be shown by default.

Extra information about snapshots (files? size?) is convenient to have but should not be at the expense of breaking machine readability by the default, so should probably be available via a flag. --format?

UI inputs

elfshaker list shows all snapshots.

elfshaker list <pack> shows snapshots within a pack.

elfshaker list <snapshot> could show files within a snapshot.

UI outputs

Machine readability applies to all of the above, they should share behaviours when it comes to --format considerations.

I think possibly we should just always drop the header, and document what it is. I think it's also somewhat discoverable in the sense that it is 'somewhat obvious' what it is. Having the header print to stderr is a neat idea but nonstandard and more often than not I end up piping it to /dev/null. I'm flexible on this point.

Unique snapshots

What to do if a snapshot name is not unique? A common way this arises right now is if you have both a loose snapshot and a snapshot living within a pack. One observation I have here is that the snapshot is unique in the presence of duplicates if the hashes of the contents of the extract would be the same. I suspect this is not too expensive to compute/maintain. One possibility is to memoize it as another field to the .pack.idx file and compute it on demand if the information is required but not present.

Documentation

We'll need to update docs, probably.

Another thing to consider is that we're talking about changing the interface here, which could be confusing to users not prepared for it. I don't think it's big deal at this stage in the project, and the failures would be pretty obvious to users in this case, but not something I would do lightly in the future.
opened by peterwaller-arm 8
Clang relies on __PRETTY_FUNCTION__ and asserts

IIRC, we observed that use of __PRETTY_FUNCTION__ resulted in binaries which depended on the full path of the sources. In order to make the binaries reproducible, we have a workaround: https://github.com/elfshaker/elfshaker/blob/f69cb7bc70e7f7579669d5e0f873632d4406c453/contrib/manyclangs-cmake#L13-L14

Unfortunately the resulting clang binary is completely broken and refuses to compile anything:

https://llvm.org/doxygen/TypeName_8h_source.html

It asserts on assert(!Name.empty() && "Unable to find the template parameter!");. I suspect that func doesn't contain the template type parameter information in it.

We'll need to find a workaround for this or stop overriding __PRETTY_FUNCTION__ and therefore have binaries which can only be reproduced if built at a specific absolute path.

opened by peterwaller-arm 8

Can't expand snapshot from a pack if pack name is omitted

Again, using the repository mentioned in #20, once I removed the loose objects, I can't extract a snapshot based by snapshot name only:

$ cat elfshaker_data/HEAD 
loose/e55674b86a10e695cea2f45c0472402b97cc2dfb:e55674b86a10e695cea2f45c0472402b97cc2dfb
$ elfshaker extract 3f585bdaa7f6fb02753ba7b4918f065357a6b7fd 
[ERROR (main) 30.874291ms]: *FATAL*: couldn't copy /tmp/elfshaker/elfshaker_data/loose/ef/34/7decdd22a4157e9966fd86b7f0a1af5ab3a5 to /tmp/elfshaker/usr/local/lib/libssp_nonshared.a (NotFound)
$ elfshaker extract all:3f585bdaa7f6fb02753ba7b4918f065357a6b7fd 
A 	0 files
D 	0 files
M 	21 files
Extracted 'all:3f585bdaa7f6fb02753ba7b4918f065357a6b7fd'
$ cat elfshaker_data/HEAD 
all:3f585bdaa7f6fb02753ba7b4918f065357a6b7fd

@Mistuke

opened by marxin 6

elfshaker can't extract pack w/o loose objects

I've created 100 loose objects and then packed all of them into all pack. Later than, I removed loose objects (rm -rf elfshaker_data/loose/) and then I can't extract a revision:

$ ls
elfshaker_data
$ elfshaker extract all:ff6b2a3e70562f4250504fc10aa83eeb98e653d7
[WARN (main) 10.36703ms]: Expected file "/tmp/elfshaker/usr/local/share/man/man1/gcov-tool.1" to be present!
[ERROR (main) 10.766437ms]: *FATAL*: Some files in the repository have been removed or modified unexpectedly! You can use --force to skip this check, but this might result in DATA LOSS!

Elfshaker repo: https://splichal.eu/tmp/elfshaker-repo.tar

opened by marxin 6

`elfshaker clone`
Motivation

The motivation for adding this command is to enable automatic fetching of remote packs.

clone

Usage

elfshaker clone <url> <directory>

Example

elfshaker clone https://github.com/elfshaker/manyclangs/releases/download/v0.9.0/aarch64-ubuntu2004-manyclangs.esi manyclangs

Implementation

Create a directory <directory>

Fetch the .esi (ElfShaker Index) file (via HTTP GET)

Store the file in <directory>/elfshaker_data/remotes/origin.esi (creating missing directories)

Fetch the .pack.idx of all packs listed in packs and store in elfshaker_data/packs/main

In case any of the steps 1-3 fails, <directory> is removed before the process exits.

update

Usage

elfshaker update

Implementation

Open elfshaker_data/remotes/*.esi

Read the property url

Fetch origin via HTTP GET (Headers: If-Modified-Since: <now> GMT)

Overwrite the .esi file with the response if Status: OK, exit if Status: Not modified, error if other

Fetch the .pack.idx listed in packs and overwrite the files elfshaker_data/packs/origin

For all .pack.idx which are not available locally

For all .pack.idx whose checksum on-disk does not match the checksum in the .esi

The above sequence of operations is carried out for all .esi files in the directory.

Any error is reported on stderr and cancels the operation for the target .esi, but not for any other indexes. The new .esi and .pack.idx are kept, the old ones are lost.

Changes to existing commands

The addition of clone changes the behaviour of existing commands.

extract

elfshaker extract [<remote>/<pack>]:<snapshot>

extract is extended to automatically fetch .pack files when those are available from a remote. <remote> and <pack> below are resolved in the usual way (by reading available .pack.idx). If a matching pack cannot be found, the process exists with an appropriate error message.

If elfshaker_data/<remote>/<pack>.pack is not found

Find <pack>.pack in the list of packs in elfshaker_data/remotes/<remote>.esi

Fetch <pack>.pack, verify its checksum, and store to elfshaker_data/packs/<remote>/<pack>.pack

Extract <pack>:<snapshot> with the usual semantics

Otherwise

Proceed with the usual semantics of extract (whether success or error).

Incompatibilities

Since we are using elfshaker_data/packs/<remote> to store the packs, users should not create a directory with the same name to store packs.

.esi file format

The elfshaker index format is a plain text file. Values are tab-separated.

It starts with the line meta v1. The second line starts with url followed by the URL of the .esi file on the hosting server, which is used to refresh the .esi during elfshaker update. The following lines are tab-separated pack checksum, pack index checksum and URL (relative to url or absolute) from which to fetch the pack file. Pack indexes must be obtainable by appending .idx to the strings in packs.

meta v1 url https://github.com/elfshaker/manyclangs/releases/download/v0.9.0/aarch64-ubuntu2004-manyclangs.esi 039c501ac8dfcac91c6f05601cee876e1cc07e17 91768d65e5095a85472378f6dece7c5fe2524e90 aarch64-ubuntu2004-manyclangs-202102.pack cfd7585fe30db8a6690cb4425b94fbaeaeceb483 7871d5a9eb7d92cf5825dff75127b7d8ebf15dd7 aarch64-ubuntu2004-manyclangs-202103.pack

Future work

The design allows for multiple remotes to be supported in the future, by having multiple .esi files and corresponding sub-directories under elfshaker_data/packs/. This makes the likelihood of a name clash between the names of the remotes and user-created directories in elfshaker_data/packs/ greater, but since those are user-defined identifiers, the expectation is that users would be able to resolve these clashes manually, by naming remotes accordingly.

The operations above are defined in terms of operation on files in elfshaker_data/remotes and should work the same regardless of the number of remotes added. (update updates all remotes, extract looks up all .pack.idx)
opened by veselink1 5
Can we consider storing the test pack another way?

1eae76179128ab1fd7cf6961765c838201634e2f added a test pack.

The function of the test pack is to ensure that future updates don't break our ability to read old files, and that's great.

I don't really like having a ~3MiB binary in the repository though. From experience I have always ended up regretting putting binaries of any significant size in a git repository, even if we don't currently expect it to change in the future.

Could we attach it in the releases and wget it, instead?

opened by peterwaller-arm 5
Instructions on building `compdb2line`

When running manyclangs-build-month it says "Note that compdb2line is part of this project and needs installing." - I can't find any reference on how to do this.

There's some .go files in contrib/compdb2line. I install eflshaker from source with cargo install --path . There's no other reference to compdb2line in the repo I can find - how might I build and install this tool?

opened by mattgodbolt 4
Decouple loose indexes
~Depends on #29, so beware full diff contains that diff.~

This builds on #29 (remove repository index).

It removes the notion of a PackId::Loose. I intend to go a bit further here and get rid of PackId as an enum but that can be done later in a no-functional-changes PR.

It removes --pack|-P from the command line interface, simplifying the user interface a bit.

There is a new API, repo.find_snapshot(maybe_canonical_snapshot) -> SnapshotId. This API returns an error if the given string does not uniquely identify a snapshot within a single index.

Users can specify a pack with the syntax pack:snapshot, or just snapshot, which feeds through this API.

Use of : as a packid:snapshot_name separator enables / to be used for namespacing packs by using directories on the filesystem, and for / to appear in snapshot in the future too, if we wanted.

Store is reimplemented. It now no longer has to do work which scales with the number of snapshots which makes it quite a bit cheaper (it was taking 10s of seconds per snapshot when you have lots of snapshots, but now it's a flat 100ms).

Stores are now independent so we can execute multiple of them in parallel sharing the object store safely.

Stores save the snapshot to an index called packs/loose/<snapshot_name>.idx. So a loose snapshot is uniquely identified with loose/snapshot_name:snapshot_name, for the time being, if snapshot_name exists in other packs, or just with the string snapshot_name otherwise.

Packing is generalized to take multiple indexes as input. If the indexes are not specified, all loose indexes are taken.

[x] TODO: Check that I remembered to sort indexes by name.

Including an existing pack during pack creation is not yet supported, but I have a route in mind to implement this: #35.

#34 For now loose packs are identified as index files lacking a pack. I intend to introduce a bit to the beginning of an index file identifying it as loose, so we can have 'indices' without having the packs.

After this PR, loose indices are not removed during pack creation, nor are loose objects cleaned. #17 is therefore a little more important. And probably we want to delete the loose indices by default.

There is one slightly confusing nomenclature issue with this PR, in that we refer to loose packs as packs, when they aren't yet packed. So the loose snapshots live in a 'loose/name.pack.idx'.

Resolves #15.
opened by peterwaller-arm 4
Remove repository index

~Draft.~

We plan to remove the RepositoryIndex for now and make it fast enough to consult the packs directly.

This removes the notion of update-index from the UI, and gives us a single source of truth: the pack indices themselves.

~Depends: #27, #28 (only look at last commit).~

opened by peterwaller-arm 4
Poor 'file not found' errors

Unfortunately when a File::open fails due to a missing file, it does not report an informative error about which file is missing.

I think we should wrap all File::open calls with something that decorates such errors with the filename, so that the user knows what the problem is. I've hit this a couple of times and haven't been able to work out what is wrong without editing the code.
good first issue help wanted

opened by peterwaller-arm 4
WIP: PackIndex version 2
This change adds the new verison of the pack index:

file paths are stored as unicode strings (platform independant)

file mode is saved in the object metadata (defaults to rw-r-r)

snapshot checksums are stored

Closes #95, #93.
opened by veselink1 0
Refactor: Version the PackIndex structure

Separated the PackIndex definition from its implementation into a PackIndex trait and PackIndexV1 struct.

Also added a VerPackIndex (versioned pack index) enum which will allow us to evolve the pack index format in the future, by handling the differences between the versions of the PackIndex in the PackIndex impl of the VerPackIndex enum.

The intention is to make it easier to evolve the pack index.

opened by veselink1 3
`elfshaker gc`
[x] snapshot deduplication

[x] object graph tracing

[x] fs-based lock

[x] performance improvements (compute_snapshot_checksums is around 96% CPU time)

[x] API design

[x] Documentation

Closes #17
opened by veselink1 5

list failed in linux ( Deserialization failed)

I have a pack that was created on windows, I wanted to check its content on my Ubuntu 20.04. If I list I found my pack,

PACK                                                                       SNAPSHOTS SIZE     
loose/ba54343841542c1964b6b4064dc19982861fb310                             1         -        
windowspack                             1         0.057MiB

But if I want to list all snapshot I have :

./elfshaker list windowspack

But I get this error :

[ERROR (main) 5.469347ms]: *FATAL*: Deserialization failed, corrupt pack index: cannot deserialize Windows OS string on Unix

An idea to succeed in using a windows pack on linux ?

enhancement help wanted

opened by Lambourl 3

Workaround git log --since missing commits

Git's walking mechanism stops as soon as it sees a commit which is older than the date specified by --since.

This is problematic in LLVM's commit history in August 2022, because there is a commit in the middle of it with a CommitDate in 2021.

Work around this by considering a large number of commits and taking the oldest one matching the target date. I'll do the August pack build and confirm that everything is OK before merging this.

Fix #89.

Signed-off-by: Peter Waller [email protected]

opened by peterwaller-arm 1

Releases(v0.9.0)

v0.9.0(Nov 18, 2021)
The first release

elfshaker is a storage system inspired by git for binary files. As an example of its use, it can store ~2,000 builds of clang in a single 100MiB file and give random access to any of those commits with a few seconds of CPU time.

Accessing the same thing would ordinarily take significant time to build, and storing 2,000 builds would usually cost hundreds of megabytes each, so the effective compression ratio is in excess of 1,000x.

New Contributors

@JohannesEbke made their first contribution in https://github.com/elfshaker/elfshaker/pull/52

Full Changelog: https://github.com/elfshaker/elfshaker/commits/v0.9.0
Source code(tar.gz)
Source code(zip)
elfshaker_v0.9.0_aarch64-unknown-linux-musl.tar.gz(2.01 MB)
elfshaker_v0.9.0_aarch64-unknown-linux-musl.tar.gz.sha256sum(117 bytes)
elfshaker_v0.9.0_x86_64-unknown-linux-musl.tar.gz(2.00 MB)
elfshaker_v0.9.0_x86_64-unknown-linux-musl.tar.gz.sha256sum(116 bytes)

Owner

GitHub

An EVM low-level language that gives full control over the control flow of the smart contract.

Meplang - An EVM low-level language Meplang is a low-level programming language that produces EVM bytecode. It is designed for developers who need ful

19 Jan 31, 2023

Lightweight alternative Discord client with a smaller footprint and some fancy extensible features.

Dorion Dorion is an alternative Discord client aimed and lower-spec or storage-sensitive PCs that supports themes, plugins, and more! Table of Content

20 Jan 2, 2023

Fine-tune your instruments.

Chromatic Fine-tune your instruments with Chromatic. Chromatic detects the frequency of audio input, converts it to a musical note with the correct se

30 Apr 13, 2023

A version control system implemented from scratch in Rust.

Version Control An experiment to write a version control system from scratch in Rust. CLI Usage Usage: revtool <COMMAND> Commands: init initia

3 May 3, 2023

(early experiments toward) a version-control system for structured data

chit: (early experiments toward) a version-control system for structured data please note, very little is actually implemented here. this is not usefu

3 Jul 24, 2023

An over-simplified version control system written in Rust, similar to Git, for local files (Incomplete)

Vault Vault will be a command line tool (if successful) similar to git which would have multiple features like brances etc etc. __ __ _ _

3 Nov 21, 2023

Vector is a high-performance, end-to-end (agent & aggregator) observability data pipeline that puts you in control of your observability data

Quickstart • Docs • Guides • Integrations • Chat • Download What is Vector? Vector is a high-performance, end-to-end (agent & aggregator) observabilit

12.1k Jan 2, 2023

elfshaker is a low-footprint, high-performance version control system fine-tuned for binaries.

Related tags

Overview

elfshaker

400 GiB -> 100 MiB, with 1s access time†; when applied to clang builds.

†Applicability

Quickstart

System Compatibility

Current Status

Contributing

Contact

Security

License

Comments

Motivation

clone

Usage

Example

Implementation

update

Usage

Implementation

Changes to existing commands

extract

If elfshaker_data/<remote>/<pack>.pack is not found

Otherwise

Incompatibilities

.esi file format

Future work

~Depends on #29, so beware full diff contains that diff.~

Releases(v0.9.0)

v0.9.0(Nov 18, 2021)

The first release

New Contributors

Owner

An EVM low-level language that gives full control over the control flow of the smart contract.

Lightweight alternative Discord client with a smaller footprint and some fancy extensible features.

Fine-tune your instruments.

A version control system implemented from scratch in Rust.

(early experiments toward) a version-control system for structured data

An over-simplified version control system written in Rust, similar to Git, for local files (Incomplete)

Vector is a high-performance, end-to-end (agent & aggregator) observability data pipeline that puts you in control of your observability data

Support SIMD low-memory overhead and high-performance adaptive radix tree.

A high performance/low-overhead OpenMetrics library for Rust

High-performance, low-level framework for composing flexible web integrations

A high-performance, high-reliability observability data pipeline.

🧑‍✈ Version control and key management for Solana programs.

Hot reload static web server for deploying mutiple static web site with version control.

Optimistic multi-version concurrency control (MVCC) for main memory databases, written in Rust.

Fast KubeJS script manager. Includes version control and compatibility with KJSPKG packages.

High-performance asynchronous computation framework for system simulation

Glommio Messaging Framework (GMF) is a high-performance RPC system designed to work with the Glommio framework.

A high performance Remote Procedure Call system.

An extremely high performance logging system for clients (iOS, Android, Desktop), written in Rust.

Get unix time (nanoseconds) in blazing low latency with high precision

`elfshaker`

`clone`

`update`

`extract`

If `elfshaker_data/<remote>/<pack>.pack` is not found

`.esi` file format