elfshaker is a low-footprint, high-performance version control system fine-tuned for binaries.

Overview

elfshaker

400 GiB -> 100 MiB, with 1s access time†; when applied to clang builds.

elfshaker is a low-footprint, high-performance version control system fine-tuned for binaries.

  • elfshaker is a CLI tool written in the Rust programming language.

  • It stores snapshots of directories into highly-compressed pack files and provides fast on-demand access to the stored files. It is particularly good for storing lots of similar files, for example object files from an incremental build.

  • It allows few-second access to any commit of clang with the manyclangs project. For example, this accelerates bisection of LLVM by a factor of 60x! This is done by extracting builds of LLVM on-demand from locally stored elfshaker packs, each of which contains ~1,800 builds and is about 100 MiB in size, even though the full originals would take TiBs to store! Extracting a single build takes 2-4s on modern hardware.

†Applicability

Or, "how on earth do you get such a phenomenal result?".

It works particularly well for our presented use case because storing pre-link object files has these properties:

  • There are many files,
  • Most of them don't change very often so there are a lot of duplicate files,
  • When they do change, the deltas of the binaries are not huge.

We achieve this in manyclangs by compiling object code with the -ffunction-sections and -fdata-sections compiler flags. This has the effect that if you 'insert' a function into a translation unit, the insertion does not cause all of the addresses to change across the whole object file.

If you looked at the binary delta on the linked executable from such a change, it will be large, because all of the absolute addresses after the insertion will change, and references to those addresses will change. These address changes are not handled well by compression algorithms, resulting in a poor compression ratio. The effect of this is large: if you compress many revisions of clang executables together, you will see a compression ratio of something like 20%. This is pretty good! But elfshaker achieves a ratio of something closer to 0.01% (or 10,000x), amortized across many builds.

Installation guide

Usage guide

Quickstart

  1. Consult our installation and usage guide, make sure you know what you're doing.
  2. elfshaker store -- capture the state of the current working directory into a named snapshot .
  3. elfshaker pack -- capture all 'loose' snapshots into a single pack file (this is what gets you the compression win).
  4. elfshaker extract -- restore the state of a previous snapshot into the current working directory.

For more detail, take a look at our workflow documentation.

System Compatibility

The following platforms are used for our CI tests:

  • Ubuntu 20.04 LTS

But we aim to support all popular Linux platforms, macOS and Windows in production.

We officially support the following architectures:

  • AArch64
  • x86-64

Current Status

The file format and directory structure is stable. We intend that pack files created with the current elfshaker version will remain compatible with future versions. Please kick the tyres and do your own validation, and file bugs if you find any. We have done a fair amount of validation for our use cases but there may be things we haven't needed yet, so please start a discussion and file issues/patches.

Contributing

Contributions are highly appreciated. Refer to our Contributing guide.

Contact

The best way to reach us to join the elfshaker/community on Gitter. The original authors of elfshaker are Peter Waller (@peterwaller-arm) <[email protected]> and Veselin Karaganev (@veselink1) <[email protected]> and you may also contact us via email.

Security

Refer to our Security policy.

License

elfshaker is licensed under the Apache License 2.0.

Comments
  • Missing/unclear docs in contrib/manyclangs

    Missing/unclear docs in contrib/manyclangs

    • [x] Maybe put something about how to build and install compdb2line
    • [x] Missing requirement on git-list-between - ../elfshaker/contrib/manyclangs-build-month: line 26: git-list-between: command not found
    • [x] pv command not found
    • [x] needs git that's new enough (the one with 18.04 wasn't new enough)
    • [x] needs a ninja-jobserver version
    • [ ] document where the output goes, and what architecture it builds for
    • [ ] time command not found

    (these are being found on a fresh docker 18.04 image)

    opened by mattgodbolt 20
  • `elfshaker list` output changes proposal

    `elfshaker list` output changes proposal

    At the moment 'elfshaker list' shows packs by default. This is inconvenient as the output is not something you can feed into 'elfshaker extract' and other tools.

    @veselink1 and I discussed this briefly and here are some considerations:

    • The output should be sorted.
      • The sort incurs a delay/cost, but we think it would be worse to have something which 'often appears to be sorted but is not', which would be the alternative. I might revisit this decision if the sorting turned out to be expensive, because we have to read in the full set of snapshots across all of the packs.
    • The output should be in canonical PACK:SNAPSHOT form.
      • Should we elide [PACK:] where possible? Probably not because then we might be back to the 'looks like sorted order problem'. Unless we think it's clear enough to have the snapshot name as the sort key.
    • Only the fully-qualified snapshot name should be shown by default.
      • Extra information about snapshots (files? size?) is convenient to have but should not be at the expense of breaking machine readability by the default, so should probably be available via a flag. --format?
    • UI inputs
      • elfshaker list shows all snapshots.
      • elfshaker list <pack> shows snapshots within a pack.
      • elfshaker list <snapshot> could show files within a snapshot.
    • UI outputs
      • Machine readability applies to all of the above, they should share behaviours when it comes to --format considerations.
      • I think possibly we should just always drop the header, and document what it is. I think it's also somewhat discoverable in the sense that it is 'somewhat obvious' what it is. Having the header print to stderr is a neat idea but nonstandard and more often than not I end up piping it to /dev/null. I'm flexible on this point.
    • Unique snapshots
      • What to do if a snapshot name is not unique? A common way this arises right now is if you have both a loose snapshot and a snapshot living within a pack. One observation I have here is that the snapshot is unique in the presence of duplicates if the hashes of the contents of the extract would be the same. I suspect this is not too expensive to compute/maintain. One possibility is to memoize it as another field to the .pack.idx file and compute it on demand if the information is required but not present.
    • Documentation
      • We'll need to update docs, probably.

    Another thing to consider is that we're talking about changing the interface here, which could be confusing to users not prepared for it. I don't think it's big deal at this stage in the project, and the failures would be pretty obvious to users in this case, but not something I would do lightly in the future.

    opened by peterwaller-arm 8
  • Clang relies on __PRETTY_FUNCTION__ and asserts

    Clang relies on __PRETTY_FUNCTION__ and asserts

    IIRC, we observed that use of __PRETTY_FUNCTION__ resulted in binaries which depended on the full path of the sources. In order to make the binaries reproducible, we have a workaround: https://github.com/elfshaker/elfshaker/blob/f69cb7bc70e7f7579669d5e0f873632d4406c453/contrib/manyclangs-cmake#L13-L14

    Unfortunately the resulting clang binary is completely broken and refuses to compile anything:

    https://llvm.org/doxygen/TypeName_8h_source.html

    It asserts on assert(!Name.empty() && "Unable to find the template parameter!");. I suspect that func doesn't contain the template type parameter information in it.

    We'll need to find a workaround for this or stop overriding __PRETTY_FUNCTION__ and therefore have binaries which can only be reproduced if built at a specific absolute path.

    opened by peterwaller-arm 8
  • Can't expand snapshot from a pack if pack name is omitted

    Can't expand snapshot from a pack if pack name is omitted

    Again, using the repository mentioned in #20, once I removed the loose objects, I can't extract a snapshot based by snapshot name only:

    $ cat elfshaker_data/HEAD 
    loose/e55674b86a10e695cea2f45c0472402b97cc2dfb:e55674b86a10e695cea2f45c0472402b97cc2dfb
    $ elfshaker extract 3f585bdaa7f6fb02753ba7b4918f065357a6b7fd 
    [ERROR (main) 30.874291ms]: *FATAL*: couldn't copy /tmp/elfshaker/elfshaker_data/loose/ef/34/7decdd22a4157e9966fd86b7f0a1af5ab3a5 to /tmp/elfshaker/usr/local/lib/libssp_nonshared.a (NotFound)
    $ elfshaker extract all:3f585bdaa7f6fb02753ba7b4918f065357a6b7fd 
    A 	0 files
    D 	0 files
    M 	21 files
    Extracted 'all:3f585bdaa7f6fb02753ba7b4918f065357a6b7fd'
    $ cat elfshaker_data/HEAD 
    all:3f585bdaa7f6fb02753ba7b4918f065357a6b7fd
    

    @Mistuke

    opened by marxin 6
  • elfshaker can't extract pack w/o loose objects

    elfshaker can't extract pack w/o loose objects

    I've created 100 loose objects and then packed all of them into all pack. Later than, I removed loose objects (rm -rf elfshaker_data/loose/) and then I can't extract a revision:

    $ ls
    elfshaker_data
    $ elfshaker extract all:ff6b2a3e70562f4250504fc10aa83eeb98e653d7
    [WARN (main) 10.36703ms]: Expected file "/tmp/elfshaker/usr/local/share/man/man1/gcov-tool.1" to be present!
    [ERROR (main) 10.766437ms]: *FATAL*: Some files in the repository have been removed or modified unexpectedly! You can use --force to skip this check, but this might result in DATA LOSS!
    

    Elfshaker repo: https://splichal.eu/tmp/elfshaker-repo.tar

    opened by marxin 6
  • `elfshaker clone`

    `elfshaker clone`

    Motivation

    The motivation for adding this command is to enable automatic fetching of remote packs.

    clone

    Usage

    elfshaker clone <url> <directory>

    Example

    elfshaker clone https://github.com/elfshaker/manyclangs/releases/download/v0.9.0/aarch64-ubuntu2004-manyclangs.esi manyclangs
    

    Implementation

    1. Create a directory <directory>
    2. Fetch the .esi (ElfShaker Index) file (via HTTP GET)
    3. Store the file in <directory>/elfshaker_data/remotes/origin.esi (creating missing directories)
    4. Fetch the .pack.idx of all packs listed in packs and store in elfshaker_data/packs/main

    In case any of the steps 1-3 fails, <directory> is removed before the process exits.


    update

    Usage

    elfshaker update

    Implementation

    1. Open elfshaker_data/remotes/*.esi
    2. Read the property url
    3. Fetch origin via HTTP GET (Headers: If-Modified-Since: <now> GMT)
    4. Overwrite the .esi file with the response if Status: OK, exit if Status: Not modified, error if other
    5. Fetch the .pack.idx listed in packs and overwrite the files elfshaker_data/packs/origin
      • For all .pack.idx which are not available locally
      • For all .pack.idx whose checksum on-disk does not match the checksum in the .esi

    The above sequence of operations is carried out for all .esi files in the directory.

    Any error is reported on stderr and cancels the operation for the target .esi, but not for any other indexes. The new .esi and .pack.idx are kept, the old ones are lost.


    Changes to existing commands

    The addition of clone changes the behaviour of existing commands.


    extract

    elfshaker extract [<remote>/<pack>]:<snapshot>

    extract is extended to automatically fetch .pack files when those are available from a remote. <remote> and <pack> below are resolved in the usual way (by reading available .pack.idx). If a matching pack cannot be found, the process exists with an appropriate error message.

    If elfshaker_data/<remote>/<pack>.pack is not found

    1. Find <pack>.pack in the list of packs in elfshaker_data/remotes/<remote>.esi
    2. Fetch <pack>.pack, verify its checksum, and store to elfshaker_data/packs/<remote>/<pack>.pack
    3. Extract <pack>:<snapshot> with the usual semantics

    Otherwise

    Proceed with the usual semantics of extract (whether success or error).

    Incompatibilities

    • Since we are using elfshaker_data/packs/<remote> to store the packs, users should not create a directory with the same name to store packs.

    .esi file format

    The elfshaker index format is a plain text file. Values are tab-separated.

    It starts with the line meta v1. The second line starts with url followed by the URL of the .esi file on the hosting server, which is used to refresh the .esi during elfshaker update. The following lines are tab-separated pack checksum, pack index checksum and URL (relative to url or absolute) from which to fetch the pack file. Pack indexes must be obtainable by appending .idx to the strings in packs.  

    meta    v1
    url    https://github.com/elfshaker/manyclangs/releases/download/v0.9.0/aarch64-ubuntu2004-manyclangs.esi
    039c501ac8dfcac91c6f05601cee876e1cc07e17    91768d65e5095a85472378f6dece7c5fe2524e90    aarch64-ubuntu2004-manyclangs-202102.pack
    cfd7585fe30db8a6690cb4425b94fbaeaeceb483    7871d5a9eb7d92cf5825dff75127b7d8ebf15dd7    aarch64-ubuntu2004-manyclangs-202103.pack
    

    Future work

    The design allows for multiple remotes to be supported in the future, by having multiple .esi files and corresponding sub-directories under elfshaker_data/packs/. This makes the likelihood of a name clash between the names of the remotes and user-created directories in elfshaker_data/packs/ greater, but since those are user-defined identifiers, the expectation is that users would be able to resolve these clashes manually, by naming remotes accordingly.

    The operations above are defined in terms of operation on files in elfshaker_data/remotes and should work the same regardless of the number of remotes added. (update updates all remotes, extract looks up all .pack.idx)

    opened by veselink1 5
  • Can we consider storing the test pack another way?

    Can we consider storing the test pack another way?

    1eae76179128ab1fd7cf6961765c838201634e2f added a test pack.

    The function of the test pack is to ensure that future updates don't break our ability to read old files, and that's great.

    I don't really like having a ~3MiB binary in the repository though. From experience I have always ended up regretting putting binaries of any significant size in a git repository, even if we don't currently expect it to change in the future.

    Could we attach it in the releases and wget it, instead?

    opened by peterwaller-arm 5
  • Instructions on building `compdb2line`

    Instructions on building `compdb2line`

    When running manyclangs-build-month it says "Note that compdb2line is part of this project and needs installing." - I can't find any reference on how to do this.

    There's some .go files in contrib/compdb2line. I install eflshaker from source with cargo install --path . There's no other reference to compdb2line in the repo I can find - how might I build and install this tool?

    opened by mattgodbolt 4
  • Decouple loose indexes

    Decouple loose indexes

    ~Depends on #29, so beware full diff contains that diff.~

    This builds on #29 (remove repository index).

    • It removes the notion of a PackId::Loose. I intend to go a bit further here and get rid of PackId as an enum but that can be done later in a no-functional-changes PR.
    • It removes --pack|-P from the command line interface, simplifying the user interface a bit.
      • There is a new API, repo.find_snapshot(maybe_canonical_snapshot) -> SnapshotId. This API returns an error if the given string does not uniquely identify a snapshot within a single index.
      • Users can specify a pack with the syntax pack:snapshot, or just snapshot, which feeds through this API.
      • Use of : as a packid:snapshot_name separator enables / to be used for namespacing packs by using directories on the filesystem, and for / to appear in snapshot in the future too, if we wanted.
    • Store is reimplemented. It now no longer has to do work which scales with the number of snapshots which makes it quite a bit cheaper (it was taking 10s of seconds per snapshot when you have lots of snapshots, but now it's a flat 100ms).
    • Stores are now independent so we can execute multiple of them in parallel sharing the object store safely.
    • Stores save the snapshot to an index called packs/loose/<snapshot_name>.idx. So a loose snapshot is uniquely identified with loose/snapshot_name:snapshot_name, for the time being, if snapshot_name exists in other packs, or just with the string snapshot_name otherwise.
    • Packing is generalized to take multiple indexes as input. If the indexes are not specified, all loose indexes are taken.
      • [x] TODO: Check that I remembered to sort indexes by name.
      • Including an existing pack during pack creation is not yet supported, but I have a route in mind to implement this: #35.
    • #34 For now loose packs are identified as index files lacking a pack. I intend to introduce a bit to the beginning of an index file identifying it as loose, so we can have 'indices' without having the packs.
    • After this PR, loose indices are not removed during pack creation, nor are loose objects cleaned. #17 is therefore a little more important. And probably we want to delete the loose indices by default.

    There is one slightly confusing nomenclature issue with this PR, in that we refer to loose packs as packs, when they aren't yet packed. So the loose snapshots live in a 'loose/name.pack.idx'.

    Resolves #15.

    opened by peterwaller-arm 4
  • Remove repository index

    Remove repository index

    ~Draft.~

    We plan to remove the RepositoryIndex for now and make it fast enough to consult the packs directly.

    This removes the notion of update-index from the UI, and gives us a single source of truth: the pack indices themselves.

    ~Depends: #27, #28 (only look at last commit).~

    opened by peterwaller-arm 4
  • Poor 'file not found' errors

    Poor 'file not found' errors

    Unfortunately when a File::open fails due to a missing file, it does not report an informative error about which file is missing.

    I think we should wrap all File::open calls with something that decorates such errors with the filename, so that the user knows what the problem is. I've hit this a couple of times and haven't been able to work out what is wrong without editing the code.

    good first issue help wanted 
    opened by peterwaller-arm 4
  • WIP: PackIndex version 2

    WIP: PackIndex version 2

    This change adds the new verison of the pack index:

    • file paths are stored as unicode strings (platform independant)
    • file mode is saved in the object metadata (defaults to rw-r-r)
    • snapshot checksums are stored

    Closes #95, #93.

    opened by veselink1 0
  • Refactor: Version the PackIndex structure

    Refactor: Version the PackIndex structure

    Separated the PackIndex definition from its implementation into a PackIndex trait and PackIndexV1 struct.

    Also added a VerPackIndex (versioned pack index) enum which will allow us to evolve the pack index format in the future, by handling the differences between the versions of the PackIndex in the PackIndex impl of the VerPackIndex enum.

    The intention is to make it easier to evolve the pack index.

    opened by veselink1 3
  • `elfshaker gc`

    `elfshaker gc`

    • [x] snapshot deduplication
    • [x] object graph tracing
    • [x] fs-based lock
    • [x] performance improvements (compute_snapshot_checksums is around 96% CPU time)
    • [x] API design
    • [x] Documentation

    Closes #17

    opened by veselink1 5
  •  list failed in linux ( Deserialization failed)

    list failed in linux ( Deserialization failed)

    I have a pack that was created on windows, I wanted to check its content on my Ubuntu 20.04. If I list I found my pack,

    PACK                                                                       SNAPSHOTS SIZE     
    loose/ba54343841542c1964b6b4064dc19982861fb310                             1         -        
    windowspack                             1         0.057MiB 
    

    But if I want to list all snapshot I have :

    ./elfshaker list windowspack
    

    But I get this error :

    [ERROR (main) 5.469347ms]: *FATAL*: Deserialization failed, corrupt pack index: cannot deserialize Windows OS string on Unix
    

    An idea to succeed in using a windows pack on linux ?

    enhancement help wanted 
    opened by Lambourl 3
  • Workaround git log --since missing commits

    Workaround git log --since missing commits

    Git's walking mechanism stops as soon as it sees a commit which is older than the date specified by --since.

    This is problematic in LLVM's commit history in August 2022, because there is a commit in the middle of it with a CommitDate in 2021.

    Work around this by considering a large number of commits and taking the oldest one matching the target date. I'll do the August pack build and confirm that everything is OK before merging this.

    Fix #89.

    Signed-off-by: Peter Waller [email protected]

    opened by peterwaller-arm 1
Releases(v0.9.0)
Owner
null
An EVM low-level language that gives full control over the control flow of the smart contract.

Meplang - An EVM low-level language Meplang is a low-level programming language that produces EVM bytecode. It is designed for developers who need ful

MEP 19 Jan 31, 2023
Lightweight alternative Discord client with a smaller footprint and some fancy extensible features.

Dorion Dorion is an alternative Discord client aimed and lower-spec or storage-sensitive PCs that supports themes, plugins, and more! Table of Content

SpikeHD 20 Jan 2, 2023
Fine-tune your instruments.

Chromatic Fine-tune your instruments with Chromatic. Chromatic detects the frequency of audio input, converts it to a musical note with the correct se

Nathanael 30 Apr 13, 2023
A version control system implemented from scratch in Rust.

Version Control An experiment to write a version control system from scratch in Rust. CLI Usage Usage: revtool <COMMAND> Commands: init initia

Samuel Schlesinger 3 May 3, 2023
(early experiments toward) a version-control system for structured data

chit: (early experiments toward) a version-control system for structured data please note, very little is actually implemented here. this is not usefu

davidad (David A. Dalrymple) 3 Jul 24, 2023
An over-simplified version control system written in Rust, similar to Git, for local files (Incomplete)

Vault Vault will be a command line tool (if successful) similar to git which would have multiple features like brances etc etc. __ __ _ _

Shubham 3 Nov 21, 2023
Vector is a high-performance, end-to-end (agent & aggregator) observability data pipeline that puts you in control of your observability data

Quickstart • Docs • Guides • Integrations • Chat • Download What is Vector? Vector is a high-performance, end-to-end (agent & aggregator) observabilit

Vector 12.1k Jan 2, 2023
Support SIMD low-memory overhead and high-performance adaptive radix tree.

Artful Artful is an adaptive radix tree library for Rust. At a high-level, it's like a BTreeMap. It is based on the implementation of paper, see The A

future 3 Sep 7, 2022
A high performance/low-overhead OpenMetrics library for Rust

* * * EXPERIMENTAL * * * discreet-metrics A high-performance/low-overhead metrics library aiming to conform with OpenMetrics and to satisfy the follow

null 2 Sep 14, 2022
High-performance, low-level framework for composing flexible web integrations

High-performance, low-level framework for composing flexible web integrations. Used mainly as a dependency of `barter-rs` project

Barter 8 Dec 28, 2022
A high-performance, high-reliability observability data pipeline.

Quickstart • Docs • Guides • Integrations • Chat • Download What is Vector? Vector is a high-performance, end-to-end (agent & aggregator) observabilit

Timber 12.1k Jan 2, 2023
🧑‍✈ Version control and key management for Solana programs.

captain ??‍✈️ Version control and key management for Solana programs. Automatic versioning of program binaries based on Cargo Separation of deployer a

Saber 35 Mar 1, 2022
Hot reload static web server for deploying mutiple static web site with version control.

SPA-SERVER It is to provide a static web http server with cache and hot reload. 中文 README Feature Built with Hyper and Warp, fast and small! SSL with

null 7 Dec 18, 2022
Optimistic multi-version concurrency control (MVCC) for main memory databases, written in Rust.

MVCC for Rust This is a work-in-progress the Hekaton optimistic multiversion concurrency control library in Rust. The aim of the project is to provide

Pekka Enberg 32 Apr 20, 2023
Fast KubeJS script manager. Includes version control and compatibility with KJSPKG packages.

CarbonJS A KubeJS script manager Features ?? Super fast ⚙️ Version control ?? Constantly new scripts being added ✅ Easy to use ?? Compatibility with K

Krzysztof Poręba 3 May 9, 2023
High-performance asynchronous computation framework for system simulation

Asynchronix A high-performance asynchronous computation framework for system simulation. What is this? Warning: this page is at the moment mostly addr

Asynchronics 7 Dec 7, 2022
Glommio Messaging Framework (GMF) is a high-performance RPC system designed to work with the Glommio framework.

Glommio Messaging Framework (GMF) The GMF library is a powerful and innovative framework developed for facilitating Remote Procedure Calls (RPCs) in R

Mohsen Zainalpour 29 Jun 13, 2023
A high performance Remote Procedure Call system.

A high performance Remote Procedure Call (RPC) system. Usage Add this to your Cargo.toml file. [dependencies] frpc = { git = "https://github.com/nurmo

Nur 5 Jul 28, 2023
An extremely high performance logging system for clients (iOS, Android, Desktop), written in Rust.

Pinenut Log 中文文档 ・ English An extremely high performance logging system for clients (iOS, Android, Desktop), written in Rust. Overview Compression Pin

Tangent 4 Dec 1, 2023
Get unix time (nanoseconds) in blazing low latency with high precision

RTSC Get unix time (nanoseconds) in blazing low latency with high precision. About 5xx faster than SystemTime::now(). Performance OS CPU benchmark rts

Ⅲx 8 Jul 14, 2022