garbage-collecting on-disk object store, supporting higher level KV stores and databases.

Related tags

Command-line marble
Overview

marble

Garbage-collecting disk-based object-store. See examples/kv.rs for a minimal key-value store built on top of this.

Supports 4 methods:

  • read: designed for low-latency, concurrent reads of objects
  • write_batch: designed for bulky, high-throughput writes of large batches of objects, ideal for compacting write-ahead logs into a set of updates to specific objects. Does not block calls to read except for brief moments where metadata is being updated.
  • maintenance: compacts backing storage files that have become fragmented. Blocks concurrent calls to write_batch but does not block readers any more than write_batch does. Returns the number of successfully rewritten objects.
  • file_statistics: returns statistics about live and total objects in the backing storage files.

Defragmentation is always generational, and will group rewritten objects together. Written objects can be further sharded based on a configured partition_function which allows you to shard objects by ObjectId and the size of the object raw bytes.

Marble solves a pretty basic problem in database storage: storing arbitrary bytes on-disk, getting them back, and defragmenting files.

You can think of it as a KV where keys are non-zero u64's, and values are arbitrary blobs of raw bytes.

Writes are meant to be performed in bulk by some background process. Each call to Marble::write_batch creates at least one new file that stores the objects being written. Multiple calls to fsync occur for each call to write_batch. It is blocking. Object metadata is added to the backing wait-free pagetable incrementally, not atomically, so if you rely on batch atomicity, you should serve the batch's objects directly from a cache of your own until write_batch returns. However, upon crash, batches are recovered atomically.

Reads can continue mostly unblocked while batch writes and maintenance are being handled.

You are responsible for:

  • calling Marble::maintenance at appropriate intervals to defragment storage files.
  • choosing appropriate configuration tunables for your desired space and write amplification.
  • ensuring the Config.partition_function is set to a function that appropriately shards your objects based on their ObjectId and/or size. Ideally, objects that have expected death times will be colocated in a shard so that work spent copying live objects is minimized.
  • allocating and managing free ObjectId's.

If you want to create an industrial database on top of Marble, you will probably also want to add:

  • logging and a write cache for accumulating updates that occasionally get flushed to Marble via write_batch. Remember, each call to write_batch creates at least one new file and fsyncs multiple times, so you should batch calls appropriately. Once the log or write cache has reached an appropriate size, you can have a background thread write a corresponding batch of objects to its storage, and once write_batch returns, the corresponding log segments and write cache can be deleted, as the objects will be available via Marble::read.
  • an appropriate read cache. Marble::read always reads directly from disk.
  • for maximum SSD friendliness, your own log should be configurable to be written to a separate storage device, to avoid comingling writes that have vastly different expected death times.
  • dictionary-based compression for efficiently compressing objects that may be smaller than 64k.

Ideas for getting great garbage collection performance:

  • give certain kinds of objects a certain ObjectId range. for example, tree index nodes can be above 1<<63, and tree leaf nodes can be below that point. The Config.partition_function can return the shard 0 for leaf nodes, and 1 for index nodes, and they will always be written to separate files.
  • WiscKey-style sharding of large items from other items, based on the size of the object. Assign a shard ID based on which power of 2 the object size is.
  • Basically any sharding strategy that tends to group items together that exhibit some amount of locality in terms of expected mutations or overall lifespan.

In short, you get to focus on a bunch of the fun parts of building your own database, without so much effort spent on boring file garbage collection.

You might also like...
TUI input library supporting multiple backends
TUI input library supporting multiple backends

tui-input WARNING: Most of the functionality is only human tested. A TUI input library supporting multiple backends. This crate can be used with tui-r

Supporting code for the paper "Optimized Homomorphic Evaluation of Boolean Functions" submitted to Eurocrypt 2024

This repository contains the code related to the paper Optimized Homomorphic Evaluation of Boolean Functions. The folder search_algorithm contains the

Terminal disk space navigator 🔭
Terminal disk space navigator 🔭

Given a path on your hard-drive (which could also be the root path, eg. /). diskonaut scans it and indexes its metadata to memory so that you could explore its contents (even while still scanning!).

Executables on Disk? Bleh 🤮
Executables on Disk? Bleh 🤮

Executables on Disk? Preposterous! Saving executables to disk is like telling EDRs that "Hey! Take a look at this thing I just fetched from the Intern

A user-friendly, lightweight TUI for disk imaging
A user-friendly, lightweight TUI for disk imaging

Caligula Burning Tool Caligula is a user-friendly, lightweight TUI for imaging disks. $ caligula burn -h Burn an image to a disk Usage: caligula burn

💾Saving Your Hard Disk, One Project at a Time
💾Saving Your Hard Disk, One Project at a Time

nodemore 💾 Nodemore Recursively Searches Directories for Unused Projects Contents Why? Installation Usage Configuration Why? NodeJS has a horrible wa

A simple disk benchmark tool.
A simple disk benchmark tool.

simple-disk-benchmark A simple disk benchmark tool. Operating Systems Currently, macOS and Linux are tested. Windows may work but is not tested. Devel

My solutions for the 2021 edition of the Advent of Code, using Rust and SOM (Simple Object Machine)

Advent of Code 2021 These are my solutions for the 2021 edition of the Advent of Code. The solutions are all implemented using both Rust and SOM (Simp

Pure Rust implementation of Javascript Object Signing and Encryption (JOSE)

RustCrypto: JOSE Pure Rust implementation of Javascript Object Signing and Encryption (JOSE) License All crates licensed under either of Apache Licens

Comments
  • How to use `stable_logical_sequence_number`?

    How to use `stable_logical_sequence_number`?

    The docs say it can be used in logs to avoid double-recovery, which makes sense. But what value should actually be written into a log? I assume it can't be the stable LSN, since by definition of writing it into a log, it's not stable in marble yet. So is it just stable LSN + 1?

    I can't tell if it's safe to use +1 since there might be concurrent commits and checkpoints, meaning the checkpoint into marble might bump the LSN before the commit that's using it can finish, and then suddenly it's not correct.

    opened by bonsairobo 1
  • What does non-runtime-atomicity of Marble::write_batch imply?

    What does non-runtime-atomicity of Marble::write_batch imply?

    This function is crash-atomic but NOT runtime atomic. If you are concurrently serving reads, and require atomic batch semantics, you should serve reads out of an in-memory cache until this function returns.

    • Does it still guarantee that value returned from Marble::read for any key during Marble::write_batch would be either pre-write_batch or post-write_batch one, not an "out-of-thin-air" value or a "teared" value with a first half of the buffer old and second half of the buffer new?
    • If yes, can a value intermittently turn None before switching to the post-write_batch one?
    • If no, can it also temporarily disturb unrelated keys (not mentioned in write_match)? What worst can happen from an attempt to read an entry that is being modified?
    opened by vi 0
  • Project status

    Project status

    Hi, would you recommend use this library somewhere or it is some kind of experimental/research project? Can it be considered sled replacement (for some use cases) and what are your thoughts about sled's future?

    opened by madmaxio 4
  • `Marble::read` buffer alignment

    `Marble::read` buffer alignment

    I think it would be useful to guarantee some minimum alignment of the buffer returned from Marble::read, which would allow the buffer to be used with zero-copy deserialization like rkyv. I know that sled aligns IVecs to 16 bytes (on the main branch). I hacked this into my fork of marble here.

    opened by bonsairobo 2
Owner
Komora
Komora
A tool for collecting rollup blocks from the Aztec Connect rollup, and exporting them to csv

Aztec Connect Data Gobbler The Aztec Connect Data gobbler is a tool made for extracting data from the Aztec Connect system using only L1 as its source

Lasse Herskind 6 Feb 17, 2023
Wrapper around atspi-code to provide higher-level at-spi Rust bindings

atspi Wrapper around atspi-codegen to provide higher-level at-spi Rust bindings. Contributions Take a look at our atspi-codegen crate, and try inpleme

Odilia 3 Feb 7, 2022
MidenIR for compiling to Miden Assembly from higher-level languages

Miden IR This repository provides a compiler for the Miden VM, specifically for Miden Assembly. It does this by lowering from a higher-level intermedi

Polygon Miden 10 Mar 30, 2023
A crate providing a MemoryCell struct, which stores a current and previous value.

memcell What is a MemoryCell? A MemoryCell is a struct containing both a current and optional previous value. Definition #[derive(Debug, Clone)] pub s

Imajin 9 Nov 21, 2022
Mirroring remote repositories to s3 storage, with atomic updates and periodic garbage collection.

rsync-sjtug WIP: This project is still under development, and is not ready for production use. rsync-sjtug is an open-source project designed to provi

SJTUG 57 Feb 22, 2023
A mini paste bin and url shortener written in rust without databases.

pb Build $ cargo build --release Environment Variables PB_DATA: /some/path (Default: ./pb_data) PB_SITE: Url of your site. (Default: http://localhost:

Edward P 5 Jul 26, 2022
Source code for our paper "Higher-order finite elements for embedded simulation"

Higher-order Finite Elements for Embedded Simulation This repository contains the source code used to produce the results for our paper: Longva, A., L

Interactive Computer Graphics 18 Sep 30, 2022
A beautiful, tiny traceback and logging library supporting #![no_std] rust.

breadcrumbs Breadcrumbs is a beautiful, tiny traceback and logging library for Rust that offers seamless integration with #![no_std], multi-threading

Michael 18 Nov 21, 2023
Zenith - sort of like top or htop but with zoom-able charts, CPU, GPU, network, and disk usage

Zenith - sort of like top or htop but with zoom-able charts, CPU, GPU, network, and disk usage

Benjamin Vaisvil 1.6k Jan 4, 2023
A library for loading and executing PE (Portable Executable) from memory without ever touching the disk

memexec A library for loading and executing PE (Portable Executable) from memory without ever touching the disk This is my own version for specific pr

FssAy 5 Aug 27, 2022