Mirroring remote repositories to s3 storage, with atomic updates and periodic garbage collection.

Overview

rsync-sjtug

WIP: This project is still under development, and is not ready for production use.

rsync-sjtug is an open-source project designed to provide an efficient method of mirroring remote repositories to s3 storage, with atomic updates and periodic garbage collection.

This project implements the rsync wire protocol, and is compatible with the rsync protocol version 27. All rsyncd versions older than 2.6.0 are supported.

Features

  • Atomic repository update: users never see a partially updated repository.
  • Periodic garbage collection: old versions of files can be removed from the storage.
  • Delta transfer: only the changed parts of files are transferred. Please see the Delta Transfer section below for details.

Commands

  • rsync-fetcher - fetches the repository from the remote server, and uploads it to s3.
  • rsync-gateway - serves the mirrored repository from s3 in http protocol.
  • rsync-gc - periodically removes old versions of files from s3.

Example

  1. Sync rsync repository to S3.

    $ RUST_LOG=info RUST_BACKTRACE=1 AWS_ACCESS_KEY_ID=<ID> AWS_SECRET_ACCESS_KEY=<KEY> \
      rsync-fetcher \
        --src rsync://upstream/path \
        --s3-url https://s3_api_endpoint --s3-region region --s3-bucket bucket --s3-prefix repo_name \
        --redis redis://localhost --redis-namespace repo_name \ 
        --repository repo_name
        --gateway-base http://localhost:8081/repo_name
  2. Serve the repository over HTTP.

    $ cat > config.toml <<-EOF
    bind = ["localhost:8081"]
    
    [endpoints."out"]
    redis = "redis://localhost"
    redis_namespace = "test"
    s3_website = "http://localhost:8080/test/test-prefix"
    
    EOF
    
    $ RUST_LOG=info RUST_BACKTRACE=1 rsync-gateway <optional config file>
  3. GC old versions of files periodically.

    $ RUST_LOG=info RUST_BACKTRACE=1 AWS_ACCESS_KEY_ID=<ID> AWS_SECRET_ACCESS_KEY=<KEY> \
      rsync-gc \
        --s3-url https://s3_api_endpoint --s3-region region --s3-bucket bucket --s3-prefix repo_name \
        --redis redis://localhost --redis-namespace repo_name \ 
        --keep 2

    It's recommended to keep at least 2 versions of files in case a gateway is still using an old revision.

Design

File data and their metadata are stored separately.

Data

Files are stored in S3 storage, named by their blake2b-160 hash (<namespace/<hash>).

Listing html pages are stored in <namespace>/listing-<timestamp>/<path>/index.html.

Metadata

Metadata is stored in Redis for fast access.

Note that there are more than one file index in Redis.

  • <namespace>:index:<timestamp> - an index of the repository synced at <timestamp>.
  • <namespace>:partial - a partial index that is still being updated and not committed yet.
  • <namespace>:partial-stale - a temporary index that is used to store outdated files when updating the partial index. This might happen if you interrupt a synchronization, restart it, and some files downloaded in the first run are already outdated. It's ready to be garbage collected.
  • <namespace>:stale:<timestamp> - an index that is taken out of production, and is ready to be garbage collected.

Not all files in partial index should be removed. For example, if a file exists both in a stale index and a "live" index, it should not be removed.

Delta Transfer

rsync-sjtug implements the delta transfer algorithm described in the rsync protocol specification, which can reduce the amount of data transferred from remote server.

However, because the basis file is not available locally, it needs to be fetched before a delta can be calculated. What's more, S3 doesn't support random writes, so the patched file must be uploaded completely.

Therefore, if your S3 storage is not close (e.g. in the same network) to your rsync server, you may want to disable it.

Comments
  • Idea: alternative file serving

    Idea: alternative file serving

    S3 backends often have a convenient HTTP endpoint for end users, so the current implementation reuses that and redirects requests.

    If we are to support more storage backends, we must consider two problems:

    1. If the backend does not provide an HTTP endpoint, should we support them and have our gateway serve the files? We may use OpenDAL to simplify the implementation if we choose to do so.

    2. Some backends do not have a stable URL toward a file. For example, this is the case for S3 disabling public access, which enforces pre-sign and expiration time. Another case is the IPFS storage, which might have a stable key, but its URL is not specifiable. This causes problems for: a) redirecting, because we now generate URLs based on simple string concatenation (prefix + hash); b) listing generation, because we must know the URL in advance.

      To save traffic, accessing files (and jumping to subdirectories) does not require access to the gateway now. This is implemented by relative URLs. But this approach breaks for dynamic keys. A possible way is to direct all links to the gateway, which might increase traffic but should be able to handle the above cases.

    opened by PhotonQuantum 1
  • Fix: symlink implementation

    Fix: symlink implementation

    Due to an early design flaw (directory entries were not saved in the metadata server), current implementation of symlink resolving is expensive.

    New procedure:

    1. Try to look up the path directly in the hashtable.
    2. If failed, fallback to standard posix logic. Starting from the first component, follow the fs tree.

    Thus was not feasible because without dir entries stored in the metadata server, it is hard to tell whether a path is a directory or doesn't exist. Thus, backtrace is needed, and the time complexity is unacceptable.

    Appendix. current implementation

    1. Try to trace the path recursively for redirections. This works if there's no symlinked dir in ancestors of given path, or stuck.
    2. For each ancestor, check if it's a dir/symlink. Follow once if is a symlink. Then concat target dir and remaining components.
    3. Goto 1
    opened by PhotonQuantum 0
  • Docs: update docs

    Docs: update docs

    • [ ] update gateway design doc: we don't need an end slash now to list a directory
    • [ ] clarify intended usage: separation of data panel and control panel (if this is not needed, JuiceFS is a better choice)
    opened by PhotonQuantum 0
  • Idea: hardlink support

    Idea: hardlink support

    Currently, all hard links are resolved as regular files. Sometimes, excessive bytes are read from the remote server. This can be problematic for repositories making heavy use of hard links, e.g., fedora.

    Rsync has hard link support and can accurately transfer hard links between servers. Dev and inode ids are transmitted through the wire on the file list transfer stage. The client may recognize duplicated dev and ino pairs and initiate file content transfer only for the first instance.

    One naive approach is to manually specify a source by some heuristics and treat the rest as symlinks. However, there are challenges with this implementation.

    Hard links are non-directional. So it's better to see them as a cluster rather than a link. One approach is to manually specify a source by some heuristics and treat the rest as symlinks. However, if the "virtual" source is removed later, a new source must be chosen, and all other files initially sharing the same inode should be rewritten to point to the new source. Furthermore, detecting them without changing the metadata format (to book-keeping hard links) is expensive because we must reverse-track all entries pointed to this source. Therefore it's not a good choice to reuse existing symlink handling.

    Another possible implementation is to use hard link info only as an optimization. When the generator requests a file, first check if there's another file with the same dev & ino already asked. If yes, do not request this file and reuse the hash (remember, we use the content hash to address files). The only extra cost we need to pay (other than receiving and storing dev & ino fields in FileEntrys) is a hash table from (dev, ino) to file idx.

    opened by PhotonQuantum 0
  • Idea: Integration with OpenDAL to extend the storage backend to all storage services.

    Idea: Integration with OpenDAL to extend the storage backend to all storage services.

    Nice project!

    Maybe we can work together to integrate opendal and extend the storage backend to all storage services? I think there are many people looking for solutions for rsync to different storage platforms.

    opened by Xuanwo 3
Releases(v0.2.8)
Owner
SJTUG
Shanghai Jiao Tong University *nix User Group
SJTUG
Remote-Archive is a utility for exploring remote archive files without downloading the entire contents of the archive.

[WIP] REMOTE-ARCHIVE Remote-Archive is a utility for exploring remote archive files without downloading the entire contents of the archive. The idea b

null 4 Nov 7, 2022
A cargo subcommand for checking and applying updates to installed executables

cargo-update A cargo subcommand for checking and applying updates to installed executables Documentation Manpage Installation Firstly, ensure you have

наб 827 Jan 4, 2023
garbage-collecting on-disk object store, supporting higher level KV stores and databases.

marble Garbage-collecting disk-based object-store. See examples/kv.rs for a minimal key-value store built on top of this. Supports 4 methods: read: de

Komora 215 Dec 30, 2022
An easy-to-use, incremental, multi-threaded garbage collector for Rust

Refuse An easy-to-use, incremental, multi-threaded garbage collector for Rust. //! A basic usage example demonstrating the garbage collector. use refu

Khonsu Labs 6 May 3, 2024
A command-line tool and Docker image to automatically backup Git repositories from GitHub or anywhere

A command-line tool and Docker image to automatically backup Git repositories from GitHub or anywhere

Jake Wharton 256 Dec 27, 2022
Check a folder for dirty git repositories, forgotten branches and commits

dg - find dirty git repos Ever forgot to push a commit or lost your work because you assumed it was pushed to Github but it wasn't? dg finds local git

Dotan J. Nahum 11 Mar 19, 2023
This tool will profile official instances of OpenSUSE mirrorcache to determine the fastest repositories for your system

Mirror Magic tool to Magically make OpenSUSE Mirrors Magic-er This tool will profile official instances of OpenSUSE mirrorcache to determine the faste

Firstyear 30 Dec 22, 2022
Gix is a command-line interface (CLI) to access git repositories

gix is a command-line interface (CLI) to access git repositories. It's written to optimize the user-experience, and perform as good or better than the

Sebastian Thiel 5.2k Jan 5, 2023
gfold is a CLI-driven application that helps you keep track of multiple Git repositories.

gfold is a CLI-driven application that helps you keep track of multiple Git repositories.

Nick Gerace 215 Jan 4, 2023
Delete useless GitHub repositories easily.

delete-unused-repo Delete useless GitHub repositories easily. Demo del-unused-repo.mp4 Usage Warning You are responsible for your own actions, this is

null 2 Aug 9, 2022
A tool to dump exposed .git repositories

git-dumper This repository houses a tool to dump exposed .git repositories. This is a rewrite from the original GitTools's Dumper project, but in a re

HoLLy 10 Dec 13, 2022
A tool to dump exposed .git repositories

git-dumper This repository houses a tool to dump exposed .git repositories. This is a rewrite from the original GitTools's Dumper project, but in a re

HoLLy 8 Nov 1, 2022
Retrieve all requested SBOMs from the GitHub repositories.

GitHub SBOM(s) Generator Action This GitHub Action and/or standalone CLI application generates a Software Bill of Materials (SBOM) for a given GitHub

Brend Smits 6 Apr 21, 2023
Thread-safe cell based on atomic pointers to externally stored data

Simple thread-safe cell PtrCell is an atomic cell type that allows safe, concurrent access to shared data. No std, no data races, no nasal demons (UB)

Nikolay Levkovsky 3 Mar 23, 2024
A bit like tee, a bit like script, but all with a fake tty. Lets you remote control and watch a process

teetty teetty is a wrapper binary to execute a command in a pty while providing remote control facilities. This allows logging the stdout of a process

Armin Ronacher 259 Jan 3, 2023
Punic is a remote caching CLI built for Apple's .xcframework

Punic is a remote caching CLI built for Carthage that exclusively supports Apple's .xcframeworks.

Shred Labs 26 Nov 22, 2022
Call is an easy-to-use command tools for remote development.

Call is an easy-to-use command tools for remote development. It helps you to build remote development easily and elegant. It can work with makefile and justfile.

null 21 Dec 14, 2022
Cross-platform binary shims with optional remote fetching.

chim Cross-platform binary shims with optional remote fetching. Quickstart (make an automatic fetching node.js wrapper) Install chim: (see docs for al

Jeff Dickey 10 Jan 1, 2023
Fast turbo remote cache server written in Rust

Fast turbo remote cache server written in Rust. if you are using turbo and you want to have a self hosted remote cache server this is for you.

Salama Ashoush 10 May 24, 2023