Concatenate Amazon S3 files remotely using flexible patterns

Overview

S3 Concat

Crates.io Build Status

This tool has been migrated into s3-utils, please use that crate for future updates.

A small utility to concatenate files in AWS S3. Designed to be simple and quick, this tool uses the Multipart Upload API provided by AWS to concatenate files. This avoids the need to download files to the local machines, although it does come with caveats. S3 interaction is controlled by rusoto_s3, so check out those docs for authorization practices.

Installation

You can install s3-concat from either this repository, or from Crates (once it's published):

# install from Cargo
$ cargo install s3-concat

# install the latest from GitHub
$ cargo install --git https://github.com/whitfin/s3-concat.git

Usage

Credentials can be configured by following the instructions on the AWS Documentation, although examples will use environment variables for the sake of clarity.

You can concatenate files in a basic manner just by providing a source pattern, and a target file path:

$ AWS_ACCESS_KEY_ID=MY_ACCESS_KEY_ID \
    AWS_SECRET_ACCESS_KEY=MY_SECRET_ACCESS_KEY \
    AWS_DEFAULT_REGION=us-west-2 \
    s3-concat my.bucket.name 'archives/*.gz' 'archive.gz'

If the case you're working with long paths, you can add a prefix on the bucket name to avoid having to type it all out multiple times. In the following case, *.gz and archive.gz are relative to the my/annoyingly/nested/path/ prefix.

$ AWS_ACCESS_KEY_ID=MY_ACCESS_KEY_ID \
    AWS_SECRET_ACCESS_KEY=MY_SECRET_ACCESS_KEY \
    AWS_DEFAULT_REGION=us-west-2 \
    s3-concat my.bucket.name/my/annoyingly/nested/path/ '*.gz' 'archive.gz'

You can also use pattern matching (driven by the official regex crate), to use segments of the source paths in your target paths. Here is an example of mapping a date hierarchy (YYYY/MM/DD) to a flat structure (YYYY-MM-DD):

$ AWS_ACCESS_KEY_ID=MY_ACCESS_KEY_ID \
    AWS_SECRET_ACCESS_KEY=MY_SECRET_ACCESS_KEY \
    AWS_DEFAULT_REGION=us-west-2 \
    s3-concat my.bucket.name 'date-hierachy/(\d{4})/(\d{2})/(\d{2})/*.gz' 'flat-hierarchy/$1-$2-$3.gz'

In this case, all files in 2018/01/01/* would be mapped to 2018-01-01.gz. Don't forget to add single quotes around your expressions to avoid any pesky shell expansions!

For any other functionality, check out the help menu (although this example below might be outdated):

$ s3-concat -h
s3-concat 1.0.0
Isaac Whitfield <[email protected]>
Concatenate Amazon S3 files remotely using flexible patterns

USAGE:
    s3-concat [FLAGS] <bucket> <source> <target>

FLAGS:
    -c, --cleanup    Removes source files after concatenation
    -d, --dry-run    Only print out the calculated writes
    -h, --help       Prints help information
    -q, --quiet      Only prints errors during execution
    -V, --version    Prints version information

ARGS:
    <bucket>    An S3 bucket prefix to work within
    <source>    A source pattern to use to locate files
    <target>    A target pattern to use to concatenate files into

Limitations

In order to concatenate files remotely (i.e. without pulling them to your machine), this tool uses the Multipart Upload API of S3. This means that all limitations of that API are inherited by this tool. Usually, this isn't an issue, but one of the more noticeable problems is that files smaller than 5MB cannot be concatenated. To avoid wasted AWS calls, this is currently caught in the client layer and will result in a client side error. Due to the complexity in working around this, it's currently unsupported to join files with a size smaller than 5MB.

You might also like...
Like jq, but for HTML. Uses CSS selectors to extract bits content from HTML files.

Like jq, but for HTML. Uses CSS selectors to extract bits content from HTML files. Mozilla's MDN has a good reference for CSS selector syntax.

Rust crate for reading SER files used in astrophotography

Rust crate for reading SER files used in astrophotography.

Spot coupling by finding out which files are always in the same commit

git moves-together This tells you when files in the repository frequently move together. This lets you identify where the coupling is in the system. C

File Tree Fuzzer allows you to create a pseudo-random directory hierarchy filled with some number of files.

FTZZ File Tree Fuzzer allows you to create a pseudo-random directory hierarchy filled with some number of files. Installation $ cargo +nightly install

Teleport is a simple application for sending files from Point A to Point B

Teleporter Teleporter is a small utility in the vein of netcat to send files quickly from point A to point B. It is more convenient than netcat in tha

Clean up the lines of files in your code repository

lineman Clean up the lines of files in your code repository NOTE: While lineman does have tests in place to ensure it operates in a specific way, I st

A set of bison skeleton files that can be used to generate a Bison grammar that is written in Rust.

rust-bison-skeleton A set of bison skeleton files that can be used to generate a Bison grammar that is written in Rust. Technically it's more like a B

A git hook to manage derivative files automatically.

git-derivative A git hook to manage derivative files automatically. For example if you checked out to a branch with different yarn.lock, git-derivativ

Prints the absolute path of all regular files in an unmounted btrfs filesystem image.

btrfs-walk-tut Prints the absolute path of all regular files in an unmounted btrfs filesystem image. Learning about btrfs: Btrfs Basics Series This re

Comments
  • UnExpected Header repeated in destination merged file

    UnExpected Header repeated in destination merged file

    Header is repeating for every subfiles, finally merged file has header multiple times i.e each subfile loaded in merged file with header.

    For example we have to merge 5 files and every file has 10 rows then Merged/concatenated file will have 55 rows with header with interval of 11 rows. Header will be written five times in concatenated file.

    opened by szeshanoffrs 2
  • Update deprecated methods

    Update deprecated methods

    Rust string library has deprecated the use of trim_left_matches and trim_right_matches in favor of trim_start_matches and trim_end_matches. This PR updates this.

    opened by TimOrme 0
Releases(v1.1.0)
Owner
Isaac Whitfield
Fan of all things automated. OSS when applicable. Author of Cachex for Elixir. Senior Software Engineer at Axway. Intelligence wanes without practice.
Isaac Whitfield
Quickly mint NFT(able)s port forwards remotely

Easy expose A really simple way to expose some service behind a NAT, similar to rathole and frp. WARNING: This does not secure the channel, or even do

Ben Simms 3 Dec 3, 2022
Utilities and tools based around Amazon S3 to provide convenience APIs in a CLI

s3-utils Utilities and tools based around Amazon S3 to provide convenience APIs in a CLI. This tool contains a small set of command line utilities for

Isaac Whitfield 47 Dec 15, 2022
Minimal, flexible framework for implementing solutions to Advent of Code in Rust

This is advent_of_code_traits, a minimal, flexible framework for implementing solutions to Advent of Code in Rust.

David 8 Apr 17, 2022
Demonstration of flexible function calls in Rust with function overloading and optional arguments

Table of Contents Table of Contents flexible-fn-rs What is this trying to demo? How is the code structured? Named/Unnamed and Optional arguments Mecha

Tien Duc (TiDu) Nguyen 81 Nov 3, 2022
A flexible, simple to use, immutable, clone-efficient String replacement for Rust

A flexible, simple to use, immutable, clone-efficient String replacement for Rust. It unifies literals, inlined, and heap allocated strings into a single type.

Scott Meeuwsen 119 Dec 12, 2022
Flexible snowflake generator, reference snoyflake and leaf.

Flexible snowflake generator, reference snoyflake and leaf.

Egccri 2 May 6, 2022
Minimal, flexible & user-friendly X and Wayland tiling window manager with rust

SSWM Minimal, flexible & user-friendly X and Wayland tiling window manager but with rust. Feel free to open issues and make pull requests. [Overview]

Linus Walker 19 Aug 28, 2023
Garden monitoring system using m328p Arduino Uno boards. 100% Rust [no_std] using the avr hardware abstraction layer (avr-hal)

uno-revive-rs References Arduino Garden Controller Roadmap uno-revive-rs: roadmap Components & Controllers 1-2 Uno R3 m328p Soil moisture sensor: m328

Ethan Gallucci 1 May 4, 2022
Czkawka is a simple, fast and easy to use app to remove unnecessary files from your computer.

Multi functional app to find duplicates, empty folders, similar images etc.

RafaƂ Mikrut 9.2k Jan 4, 2023
Rust crate which provides direct access to files within a Debian archive

debarchive This Rust crate provides direct access to files within a Debian archive. This crate is used by our debrep utility to generate the Packages

Pop!_OS 11 Dec 18, 2021