A low-level MVCC file format for storing blobs.

Khonsu Labs

Last update: Jan 8, 2023

Related tags

Command-line sediment

Overview

Sediment

This repository isn't ready for public consumption. It just reached a stage where I wanted to start sharing ideas with others as well as using the issues tracker.

An experimental storage format designed for ACID-compliant, bulk blob storage. This crate is designed as a replacement to the append-only format that Nebari began with.

Each blob stored is assigned a unique GrainId. Once a blob is no longer needed, its GrainId can be archived. Once the database's log has been checkpointed far enough, the blob will be fully freed, and its storage will be able to be reused.

Goals of this format

The ultimate goal is for this crate to sit below Nebari. These are the specific features this crate is aiming to provide to achieve that goal:

ACID-complaint, single-file storage.
Allows storing arbitrary chunks of data, returning an opaque ID that can be used to look that data up again in the future.
Supports reusing reclaimed storage space.
Supports simultaneous writers with automatic batching of commits.
MVCC-transactions with the ability to roll back modifications instead of committing.
Embedding a header: The database's header will contain an optional GrainId that can be used to point to an embedded header.

One-fsync ACID writes

This file's format stores several pieces of information in pairs. Each batch written is assigned a new BatchId. This means the higher value is the most recently written value.

The file header is stored twice at the start of the file. When opening an existing database, the header with the largest BatchId is validated. If it cannot be validated, the in-memory state is initialized with the older version. When writing a new version of the header, it will overwrite the version that was not used to load from disk.

The file header contains a list of "basins", which are also stored in pairs. The file header will note which of the pair is active, allowing for basin headers to only be rewritten when changed.

The file header also points to the head page of a commit log. The commit log page contains an entry for each BatchId written, and a pointer to a previous location in the file if there are more batches before the current page. Each pointed-to log entry contains a list of grain operations performed.

To validate the file header, each of the grains in the most recent commit must be validated in addition to all of the headers touched during the commit. If any parts cannot be validated, the entire commit must be rolled back and the bad data scrubbed.

This design allows the commit operation to queue up all of the writes and perform a single fsync operation and be able to recover safely regardless of a power failure or crash before or during an fsync operation.

Theoretical Limits

Up to 254 Basins per file
Up to 254 Strata per Basin
Up to 64,516 Strata per file
Up to 281 trillion Grains per Stratum
Up to 18 quintillion Grains per file
Up to 4GB (2^32 bytes) per Grain

Comments

Implement embedded header reading/writing

The current embedded header should be able to be read from the database.

When committing a write session, a new grain id for the embedded header should be able to be specified. An error should be raised if an attempt to commit a write session occurs while another write session is queued that also modifies the header.

opened by ecton 0
Implement Isolation

Currently it's possible to read GrainIds that haven't been committed. This really isn't a big deal the way this file format is designed and its limited API -- the only way to get a GrainId to read is by writing to it. However, by allowing the read to succeed isn't true isolation.

The read operation should check the batch id and ensure it's been committed before returning.

A new function to WriteSession should be added that allows reading grains that were written during the session.

opened by ecton 0
Refactor IO abstraction to support parallel writes
The Committer builds a list of structure write operations that all need to happen before the fsync command is issued. Currently, these are written serially, but these should be able to be written using different strategies depending on the IO backend.

Two ways I have envisioned this so far:

A function that takes a list of IoBuffers and the offsets to write them at. It returns when all writes are completed.

Moving file IO to a thread pool and allowing write operations to be queued with a handle to wait for completion.

The latter sounds like it could be better, but it also is going to be more complex juggling the file handles with multiple threads.

Either way, one last goal of this is that Database should implement Clone, and currently the File abstraction is preventing this.
opened by ecton 0
Add ability to query the commit log
Internally, we need this to implement recovery/validation (#1). Externally, access to the commit log would allow users to build a replication log.

We need APIs for:

Querying the current range of BatchIds in the commit log.

Querying for individual commit log entries.
opened by ecton 0
Implement Checkpointing
Checkpointing is the process that allows Sediment to release archived data and reuse the storage space. To allow for a flexible MVCC paradigm, Sediment does not immediately free data, but rather marks the grain as archived during a commit.

Checkpointing the database is easy: provide the BatchId of the commit that you are ready to allow data to be released from. Using this BatchId (the checkpoint id):

Blocked by #2

Find all commit log entries that are for BatchIds less than or equal to the checkpoint id

Note each archived grain in each commit log entry.

Initiate a commit that:

Updates all grain maps and grain maps pages to reflect all archived grains are now free

Updates the header with the new checkpoint id

Marks any disk space used by commit log pages that are older than the checkpoint as free
opened by ecton 0
Implement archiving data

We need the ability to archive one or more GrainIds. This will mark the data as ready to archive, but the grain will not be reused until the database "checkpoint" (#3) has moved to or beyond the BatchId of the archive operation.

opened by ecton 0
Implement Recovery
All headers except the GrainMap have a way to be toggled from the root. But for grain maps, we want to be able to change grain allocations without rewriting basins or stratum.

This means that when we're recovering a file and rolling back the last change, we must mark the bad version of the GrainInfo as incorrect, probably by clearing its sequence. This will ensure when loading information in the future that we can always assume the highest sequence is the current sequence.

To finish recovery, we should probably do a two-stage fsync on recovery:

Delete broken GrainInfos, forcing the use of the current GrainInfo.

fsync

Roll FileHeader back, allowing the next load to not enter recovery mode.

fsync

Proceed with opening the database.

Rolling back a GrainInfo is safe because a grain's contents are immutable. So, the state change is either going from:

Free to Assigned: The revert just changes the status back to Free.

Assigned to Archived: The revert just changes the status back to Free. Because archiving does not alter the stored data, the existing data will still be available.

Archived to Free: Again, the revert just changes the status. Because the data will not be able to be used until the next commit after the grain is marked as free, the existing data will still be intact.
opened by ecton 0
Implement oversized blob support
One of the downsides of the current approach to grain allocation is that a value can only be split into 255 consecutive grains. The result of this is that a 4GB write must be done in a grain map that is 2^24 or larger.

Another downside (or benefit when spun in a different light) is that each grain map page contains 4,084 grains. That means to store a 4GB value in a 16MB grain size, a total of 65 gigs will be allocated for the grain map page's contents.

This sounds completely egregious, but it is simply a 1:16 ratio. At smaller sizes, it doesn't sound that crazy. On larger sizes, it's definitely overkill. Rather than let users shoot themselves in the foot with these massive allocations, while discussing with @daxpedda the other day, we thought combining these two things would be a good approach:

Implement a configuration option for maximum grain size. This would prevent grain maps being allocated larger than a specified size.

Implement a large blob storage that allocates ranges of pages for values larger than the maximum grain size. The stratum needs to be configured such that it can be recognized as a special type of stratum. This can encode page ranges and number of pages in 10 bytes, allowing for 409 oversized blobs per page.

The downside of this approach to oversized blobs is that each page of these blob headers will need to be scanned on load to understand what regions of the file are currently allocated.
opened by ecton 0
Implement a defragment function

If a lot of data is written and then removed, currently Sediment will never shrink the file. We should implement a function that attempts to relocate blocks of data from the end of the file to earlier locations in the file, allowing us to truncate the file once everything is copied.

We may want to tackle #8 before this, as it would enable even more space reclamation.

This process may require multiple commits to proceed, and reservations cannot be made against grain maps that are in the process of being relocated.

opened by ecton 0
Implement larger grain map handling
Currently sediment is limited to one grain map per stratum, and each grain map is currently hard-coded to 1 page of grains. These two restrictions limiting Sediment to ~11 million grains per file.

The format is designed to support two approaches for growing to larger sizes:

Expand a stratum to have more pages per grain map

Add another grain map to the stratum

My current vision is to grow the page count first until a sensible limit, and then switch to adding more grain maps.

Challenges in solving this:

Expanding a stratum to have more pages will always require copying the data to a new location in the file, due to needing to have additional space for the extra grain data and page data. We need to ensure that no reservations are made for this grain map while this operation is happening.

Unit testing these situations is going to be fun. Best idea I've come up with is to allow an artificial limit to the number of strata and basins a file can contain, specifically to be able to test these algorithms on smaller databases.
opened by ecton 0
Allow skipping grain maps

Pretend someone fills a database such that many grain maps are allocated in a single stratum. Now, imagine all except one grain is freed, and the grain is in the last page of the last map.

If we support an operation to rewrite the database to reclaim disk space, we would currently be limited by not being able to skip all of the freed grains.

The easiest implementation is to allow storing 0 for a grain map offset. When encountered, all grains will be assumed to be free.

This does not solve how to reduce the number of pages in a stratum -- but that may not be solvable. But realistically, who is going to grow a database to such a high order of magnitude, then delete all that data and be unlikely to reload a large amount of data again?

Just implementing the ability to deallocate grain maps when they're completely free seems like a big win that will likely be enough for all practical use cases.

opened by ecton 0
Add a metadata header
The file header is pretty tightly packed already, and I want to include at least a version number. A magic code would be excellent as well.

It seems wasteful to add an entire page at the start of the file for this, however, and I don't want other structures to not be page-aligned. So I had the idea to call it a metadata header to try to inspire some ideas of what might be stored in it.

My thoughts are to add a single page at the start of the file that contains two copies of this metadata header. By counting on powersafe overwrite, we can just rewrite the page whenever it changes. The format could be simple:

4 bytes magic code

4 bytes version code

4 bytes CRC of following bytes

Repeated for remaining 2036 bytes:

1 byte: metadata tag

1 byte: metadata length

[metadata]

This would limit metadata entries to 255 bytes, but the type of metadata I'm thinking of wanting to store would work just fine in there. For example, a unique database ID probably wouldn't be more than 16 bytes of random data.
opened by ecton 0
Refactor grain reservation

The current grain reservation "algorithm" is basically a brute force algorithm.

There should be a better way to keep track of free blocks on a per-grain-size basis. This solution should scale to a large number of grain maps such that only a handful of maps need to be kept in memory per grain size.

Having such a structure also makes it more likely that allocations will come from the same grain maps, minimizing the number of writes per commit.

Another problem to potentially solve: Reserving a grain is one location where currently there is no parallelism: each writer reserving a grain will block others from reserving a grain. Once the reservation is complete, each writer can write simultaneously. It would be amazing if the refactor could remove this synchronization point.

opened by ecton 1

Owner

Khonsu Labs

GitHub

A command-line tool to generate a list of required missing Android OS Project blobs.

aosp-missing-blobs aosp-missing-blobs is a nifty tool to identify required blobs (.so) that are missing from AOSP ROM builds, and to show which existi

176 Dec 16, 2022

Metaballs (blobs) coded in Rust. 100% software rendering.

Metaballs (blobs) coded in Rust. 100% software rendering. It is basically a Rust version of my old demo effect from the 90s (back then we'd use a fake Phong shading, though).

7 Dec 4, 2022

Tight Model format is a lossy 3D model format focused on reducing file size as much as posible without decreasing visual quality of the viewed model or read speeds.

What is Tight Model Format The main goal of the tmf project is to provide a way to save 3D game assets compressed in such a way, that there are no not

59 Mar 6, 2023

Single File Assets is a file storage format for images

SFA (Rust) Single File Assets is a file storage format for images. The packed images are not guaranteed to be of same format because the format while

1 Jan 23, 2022

My own image file format created for fun! Install the "hif_opener.exe" to open hif files. clone the repo and compile to make your own hif file

Why am i creating this? I wanted to create my own image format since I was 12 years old using Windows 7, tryna modify GTA San Andreas. That day, when

3 Dec 17, 2023

Low-level Rust library for implementing terminal command line interface, like in embedded systems.

Terminal CLI Need to build an interactive command prompt, with commands, properties and with full autocomplete? This is for you. Example, output only

47 Nov 25, 2022

A low-level ncurses wrapper for Rust

ncurses-rs This is a very thin wrapper around the ncurses TUI lib. NOTE: The ncurses lib is terribly unsafe and ncurses-rs is only the lightest wrappe

628 Jan 7, 2023

Verified Rust for low-level systems code

See Goals for a brief description of the project's goals. Building the project The main project source is in source. tools contains scripts for settin

95 Dec 27, 2022

Unopinionated low level API bindings focused on soundness, safety, and stronger types over raw FFI.

?? firehazard ?? Create a fire hazard by locking down your (Microsoft) Windows so nobody can escape (your security sandbox.) Unopinionated low level A

5 Nov 17, 2022

Horus is an open source tool for running forensic and administrative tasks at the kernel level using eBPF, a low-overhead in-kernel virtual machine, and the Rust programming language.

Horus Horus is an open-source tool for running forensic and administrative tasks at the kernel level using eBPF, a low-overhead in-kernel virtual mach

4 Dec 15, 2022

Low level access to processors using the AArch64 execution state.

aarch64-cpu Low level access to processors using the AArch64 execution state. Usage Please note that for using this crate's register definitions (as p

18 Jan 5, 2023

High-performance, low-level framework for composing flexible web integrations

High-performance, low-level framework for composing flexible web integrations. Used mainly as a dependency of `barter-rs` project

8 Dec 28, 2022

Rust low-level minimalist APNG writer and PNG reader with just a few dependencies with all possible formats coverage (including HDR).

project Wiki https://github.com/js29a/micro_png/wiki at glance use micro_png::*; fn main() { // load an image let image = read_png("tmp/test.

8 Aug 30, 2023

A new pure-Rust library for cross-platform low-level access to USB devices.

nusb A new pure-Rust library for cross-platform low-level access to USB devices. Documentation Compared to rusb and libusb Pure Rust, no dependency on

23 Oct 30, 2023

Rust File Management CLI is a command-line tool written in Rust that provides essential file management functionalities. Whether you're working with files or directories, this tool simplifies common file operations with ease.

Rust FileOps Rust File Management CLI is a command-line tool written in Rust that provides essential file management functionalities. Whether you're w

5 May 2, 2024

Given a set of kmers (fasta format) and a set of sequences (fasta format), this tool will extract the sequences containing the kmers.

Kmer2sequences Description Given a set of kmers (fasta / fastq [.gz] format) and a set of sequences (fasta / fastq [.gz] format), this tool will extra

22 Sep 16, 2023

CLI application to run clang-format on a set of files specified using globs in a JSON configuration file.

run_clang_format CLI application for running clang-format for an existing .clang-format file on a set of files, specified using globs in a .json confi

6 Dec 16, 2022

Blazingly Rapid Uncompressed Harebrained Image File Format.

BRUHIFF Blazingly rapid uncompressed harebrained Image File Format. Also known as BRUHIFF or BRUH. How to Download the repo / git clone it. Open a com

12 Aug 16, 2023

A command-line tool aiming to upload the local image used in your markdown file to the GitHub repo and replace the local file path with the returned URL.

Pup A command line tool aiming to upload the local image used in your markdown file to the GitHub repo and replace the local file path with the return

11 Aug 17, 2022