Ribzip2 - A bzip2 implementation in pure Rust.

Related tags

Compression ribzip2
Overview

ribzip2 - a comprehensible bzip2 implementation

ribzip2 is command line utility providing bzip2 compression and decompression written in pure Rust. It is currently considered work-in-progress and lacks many features of the original implementation including

  • it has worse compression rate
  • it is slower (at least factor 2)

Design Goals

Goals

  • "enterprise style" comprehensible code equipped with tests explaining the involved algorithmns
  • pure rust
  • safe code
  • state-of-the-art algorithms with optimal asymptotic performance
  • efficient multithreading
  • ergonomic cli

Long-Term-Goals

  • ergonomic library crate
  • drop-in replacement for the bzip2 crate
  • drop-in replacement for C libzip2
  • drop-in replacement for the C bzip2/bunzip2 binary

Contribute!

Contributions are very welcome (issues, pull-requests, and comments). Find your issue under "issues". The code here (exlcuding samples using for compression tests) is published under the MIT license.

Comments
  • Improve CLI ergonomy

    Improve CLI ergonomy

    Solves #5

    • [ ] Compress file x.y to x.y.bz2 and decompress file x.y.bz2 to x.y
    • [ ] Check if the output files already exist and fail with an error message
    • [ ] Handle I/O failures with error messages
    opened by caglaryucekaya 7
  • Check formatting of rust files in CI

    Check formatting of rust files in CI

    Check that all files in the workspace are formatted properly using rustfmt --check and fail the CI otherwise. As an extra the following could be improved w.r.t. to formatting

    • add a vscode workspace file which recommends the rust analyzer plugin
    • set "format on save" as default setting
    good first issue difficulty easy 
    opened by torfmaster 7
  • Run RLE before distributing work on threads

    Run RLE before distributing work on threads

    Solves https://github.com/torfmaster/ribzip2/issues/4

    • Calculate RLE data in the main thread
    • Send calculated data to generate_block_data
    • Move rle::rle from block_encoder to stream
    opened by NorbertGarfield 5
  • Improve CLI ergonomy

    Improve CLI ergonomy

    Improve ergonomy of CLI

    • [ ] use bzip2 logic of compression/uncompression by decompression a file named x.y.bz2 to x.y and compressing x.y to x.y.bz2
    • [ ] check whether output file exists and fail with an error message
    • [ ] handle I/O failures with meaningful error messages
    good first issue difficulty easy 
    opened by torfmaster 5
  • Automatically use number of virtual CPUs as default for number of threads

    Automatically use number of virtual CPUs as default for number of threads

    Solves https://github.com/torfmaster/ribzip2/issues/1

    • Add num_cpus crate.
    • Mark --threads type as Option<usize>.
    • Check whether --threads option has been set from command line. If not, request num_cpus::get().
    opened by NorbertGarfield 4
  • Automatically use number of virtual CPUs as default for number of threads

    Automatically use number of virtual CPUs as default for number of threads

    Currently, the number of CPU defaults to 1 if it is not provided. As compression profits significantly from multi-threading it would make sense to use the num_cpus crate to use a higher default for the number of CPUs.

    good first issue difficulty easy 
    opened by torfmaster 3
  • Compute crc32 checksum before sending work to threads

    Compute crc32 checksum before sending work to threads

    Following up this discussion https://github.com/torfmaster/ribzip2/pull/17/files#r835048506

    Namely, that part: compute the checksum outside generate_block_data based on the amount of data read during RLE.

    opened by NorbertGarfield 1
  • Change libzip2 to libbzip2 in README.md

    Change libzip2 to libbzip2 in README.md

    I am not an expert - but I expect the lib is libbzip2 and not libzip2 - a quick google seems to suggest I am right :) Feel free to merge - or to discard!

    opened by markmmm 1
  • Run RLE before distributing work on threads

    Run RLE before distributing work on threads

    Currently, all steps of compression are distributed on threads. However, RLE consumes less then 1% of the time and as the output size of RLE is not deterministic and blows up the compressed block by 20% there is a bound of 720.000 bytes. This affects the efficiency of compression which allows up to 900.000 bytes. Thus, it is valid to do RLE in one thread and only then distribute the work on worker threads.

    good first issue difficulty medium 
    opened by torfmaster 1
  • Draft: add read implementation instead of function

    Draft: add read implementation instead of function

    An attempt to solve #18. It seems to be a feasible goal. Further steps

    • [ ] improve docs and make the Read implementation the preferred way
    • [ ] also apply the concepts also for decompression. Open question: how to deal with domain specific errors? Logging? Create extra recovering/analysis related methods for decompression?
    • [ ] attempt Write base trait implementations as well to get full compatibility

    Also fyi @caglaryucekaya.

    opened by torfmaster 0
  • Write an interface compatible to the bzip2 crate

    Write an interface compatible to the bzip2 crate

    The bzip2 crate is a wrapper around the original C bzip2 implementation and has a Read/Write interface. The goal of this issue is to write such an interface as well for libribzip2 such that it could serve as a drop-in replacement for this library.

    In a first step I would not make the block size configurable (as in the original implementation). The reason for the original design decision is that the complexity of the algorithm used is log n n (coming from fat pivot radix sort). We are using only linear time algorithms so run time should not be an issue (I would skip discussions about memory for now). You could either

    • ignore the Compression struct in bzip2 completely
    • re-interpret it as carrying information about the Huffman Code optimization taking place

    At will.

    This issue is mentored and I am trying to give you best support possible (on top of my every day job as developer).

    good first issue difficulty medium 
    opened by torfmaster 2
  • Write tests documenting the usage of the library

    Write tests documenting the usage of the library

    The library usage is currently poorly documented. Moreover, the only existing tests testing compression / decompression end-to-end. Let's solve these problems at once!

    • write round-trip unit tests for libribzip2 doing compression/decompression/compression. You can for example use the contents of cli/samples/pepper.txt which tests some of the features. You can also produce some static data larger than the block size of 900k to also test that use case
    • write tests for decompression using original bzip2 compressed data. You can store a bz2 file in the source code using https://doc.rust-lang.org/std/macro.include_bytes.html
    • write doc simple tests for the public library functions to explain the usage. Also run the doc tests in CI
    good first issue difficulty easy 
    opened by torfmaster 2
  • Improve error handling in libribzip2

    Improve error handling in libribzip2

    Provide error types for compression and decompression. There should be

    • a type for compression basically wrapping IO errors
    • a type for decompression basically wrapping IO errors or a decompression domain specific error

    As decompression errors are usually not recoverable the decompression domain specific error could be a flat enum containing the following cases (to be completed)

    • selected non-existing Huffman table
    • data exceeds block size of 900k (e.g. RLE or ZLE produces unexpectedly long blocks)
    • crc errors
    • orig_pointer out of bounds

    and many more. The errors should come with speaking display implementations. Ideally the CLI should display them instead of just panicking.

    help wanted difficulty medium 
    opened by torfmaster 0
  • Performance of compression: The clone wars

    Performance of compression: The clone wars

    ribzip2 uses a very naive representation of Huffman codes and also writes them in a naive way: it uses dynamically allocated arrays of enums. Also at other places the habit of representing bits as arrays of enums has large costs, in total at least 5% are spent during clone operations of these arrays. There are basically these places where this can be eliminated by a better internal representation, e.g. a 32 bit integer (bzip2 Huffman codes are length-limited to 17 bits writing and 20 bits reading anyway).

    • [ ] replace representation of huffman codes by 32 bits integers to avoid cloning of arrays
    • [ ] replace bitwriter internal representation by bytes or integers
    • [ ] use bit array (represented by integers, for examples) instead of arrays of enums
    • [ ] store block data more efficiently instead of just using arrays of Bit enums
    help wanted difficulty medium 
    opened by torfmaster 7
  • Automatic publishing to crates.io

    Automatic publishing to crates.io

    Allow automatic publication of crate to crates.io on merge. One proposal would be using https://github.com/kettleby/semantic-release-rust and semantic commit format. One important requirement would be introducing a commit message lint for pull requests.

    good first issue difficulty easy component ci 
    opened by torfmaster 0
Owner
null
banzai: pure rust bzip2 encoder

banzai banzai is a bzip2 encoder with linear-time complexity, written entirely in safe Rust. It is currently alpha software, which means that it is no

Jack Byrne 27 Oct 24, 2022
libbz2 (bzip2 compression) bindings for Rust

bzip2 Documentation A streaming compression/decompression library for rust with bindings to libbz2. # Cargo.toml [dependencies] bzip2 = "0.4" License

Alex Crichton 67 Dec 27, 2022
A Brotli implementation in pure and safe Rust

Brotli-rs - Brotli decompression in pure, safe Rust Documentation Compression provides a <Read>-struct to wrap a Brotli-compressed stream. A consumer

Thomas Pickert 59 Oct 7, 2022
A Rust implementation of the Zopfli compression algorithm.

Zopfli in Rust This is a reimplementation of the Zopfli compression tool in Rust. I have totally ignored zopflipng. More info about why and how I did

Carol (Nichols || Goulding) 76 Oct 20, 2022
Zip implementation in Rust

zip-rs Documentation Info A zip library for rust which supports reading and writing of simple ZIP files. Supported compression formats: stored (i.e. n

null 549 Jan 4, 2023
A simple rust library to read and write Zip archives, which is also my pet project for learning Rust

rust-zip A simple rust library to read and write Zip archives, which is also my pet project for learning Rust. At the moment you can list the files in

Kang Seonghoon 2 Jan 5, 2022
Brotli compressor and decompressor written in rust that optionally avoids the stdlib

rust-brotli What's new in 3.2 into_inner conversions for both Reader and Writer classes What's new in 3.0 A fully compatible FFI for drop-in compatibi

Dropbox 659 Dec 29, 2022
DEFLATE, gzip, and zlib bindings for Rust

flate2 A streaming compression/decompression library DEFLATE-based streams in Rust. This crate by default uses the miniz_oxide crate, a port of miniz.

The Rust Programming Language 619 Jan 8, 2023
Snappy bindings for Rust

Snappy [ Originally forked from https://github.com/thestinger/rust-snappy ] Documentation Usage Add this to your Cargo.toml: [dependencies] snappy = "

Jeff Belgum 14 Jan 21, 2022
Tar file reading/writing for Rust

tar-rs Documentation A tar archive reading/writing library for Rust. # Cargo.toml [dependencies] tar = "0.4" Reading an archive extern crate tar; use

Alex Crichton 490 Dec 30, 2022
Like pigz, but rust - a cross platform, fast, compression and decompression tool.

?? crabz Like pigz, but rust. A cross platform, fast, compression and decompression tool. Synopsis This is currently a proof of concept CLI tool using

Seth 232 Jan 2, 2023
A reimplementation of the Zopfli compression tool in Rust.

Zopfli in Rust This is a reimplementation of the Zopfli compression tool in Rust. Carol Nichols started the Rust implementation as an experiment in in

null 11 Dec 26, 2022
A Rust application that compress files and folders

Quick Storer This is a Rust application that compress files and folders. Usage Download or build the binary and place it on your desktop, or any other

AL68 & co. 1 Feb 2, 2022
lzlib (lzip compression) bindings for Rust

lzip Documentation A streaming compression/decompression library for rust with bindings to lzlib. # Cargo.toml [dependencies] lzip = "0.1" License Lic

Firas Khalil Khana 8 Sep 20, 2022
An extremely fast alternative to zip which is written in rust.

Zap Compress and/or encrypt folders fast. Like, really fast. or as some say, blazingly fast. Installation To install Zap, run the following command fr

null 39 Dec 23, 2022
An extremely fast alternative to zip which is written in rust.

Zap Compress and/or encrypt folders fast. Like, really fast. or as some say, blazingly fast. Installation To install Zap, run the following command fr

null 37 Nov 9, 2022
Simple NoNG songs manager for GD, written in Rust.

nong-manager Simple NoNG songs manager for GD, written in Rust. Powered by Song File Hub (https://songfilehub.com/home) How to use Enter song ID that

Alexander Simonov 4 May 13, 2023
Basic (and naïve) LZW and Huffman compression algorithms in Rust.

Naive implementation of the LZW and Huffman compression algorithms. To run, install the Rust toolchain. Cargo may be used to compile the source. Examp

Luiz Felipe Gonçalves 9 May 22, 2023
Pure Rust bzip2 decoder

bzip2-rs Pure Rust 100% safe bzip2 decompressor. Features Default features: Rust >= 1.34.2 is supported rustc_1_37: bump MSRV to 1.37, enable more opt

Paolo Barbolini 36 Jan 6, 2023
banzai: pure rust bzip2 encoder

banzai banzai is a bzip2 encoder with linear-time complexity, written entirely in safe Rust. It is currently alpha software, which means that it is no

Jack Byrne 27 Oct 24, 2022