An LSM storage engine designed to significantly reduce I/O amplification written in safe rust (Under active development)

Related tags

Database velarixdb
Overview

codecov Tests Crates.io Documentation Clippy Contributor Covenant License: MIT

VelarixDB is an LSM-based storage engine designed to significantly reduce IO amplification, resulting in better performance and durability for storage devices.

Introduction

VelarixDB: Designed to reduce I/O amplification

VelarixDB is an ongoing project (not production ready) designed to optimize data movement during load times and compaction. Inspired by the WiscKey paper, WiscKey: Separating Keys from Values in SSD-conscious Storage, velarixdb aims to significantly enhance performance over traditional key-value stores.

Problem

During compaction in LevelDB or RocksDB, in the worst case, up to 10 SSTable files needs to be read, sorted and re-written since keys are not allowed to overlapp across all the sstables from Level 1 downwards. Suppose after merging SSTables in one level, the next level exceeds its threshold, compaction can cascade from Level 0 all the way to Level 6 meaning the overall write amplification can be up to 50 (ignoring the first compaction level).[ Reference -> Official LevelDB Compaction Process Docs ]. This repetitive data movement can cause significant wear on SSDs, reducing their lifespan due to the high number of write cycles. The goal is to minimize the amount of data moved during compaction, thereby reducing the amount of data re-written and extending the device's lifetime.

Solution

To address this, we focus on whether a key has been deleted or updated. Including values in the compaction process (which are often larger than keys) unnecessarily amplifies the amount of data read and written. Therefore, we store keys and values separately. Specifically, we map value offsets to the keys, represented as 32-bit integers.

This approach reduces the amount of data read, written, and moved during compaction, leading to improved performance and less wear on storage devices, particularly SSDs. By minimizing the data movement, we not only enhance the efficiency of the database but also significantly extend the lifespan of the underlying storage hardware.

Performance Benefits

According to the benchmarks presented in the WiscKey paper, implementations can outperform LevelDB and RocksDB by:

  • 2.5x to 111x for database loading
  • 1.6x to 14x for random lookups

Addressing major concerns

  • Range Query: Since keys are separate from values, won't that affect range queries performance. Well, we now have internal parallelism in SSDs, as we fetch the keys from the LSM tree we can fetch the values in parallel from the vlog file. This benchmark from the Wisckey Paper shows how for request size ≥ 64KB, the aggregate throughput of random reads with 32 threads matches the sequential read throughput.
  • More Disk IO for Reads: Since keys are now seperate from values, we have to make extra disk IO to fetch values? Yes, but since the key density now increases for each level (since we are only storing keys and value offsets in the sstable), we will most likely search fewer levels compared to LevelDB or RocksDB for thesame query. A significant portion of the LSM tree can also be cached in memory.

Designed for asynchronous runtime (unstable)

Based on the introduction and efficiency of async IO at the OS kernel level e.g io_uring for the Linux kernel, VelarixDB is designed for asynchronous runtime. In this case Tokio runtime. Tokio allows for efficient and scalable asynchronous operations, making the most of modern multi-core processors. Frankly, most OS File System does not provide async API currently but Tokio uses a thread pool to offload blocking file system operations. This means that even though the file system operations themselves are blocking at the OS level, Tokio can handle them without blocking the main async task executor. Tokio might adopt io_uring in the future. (We haven't benchmarked the async version therefore this is unstable and might be removed in a future version)

Disclaimer

Please note that velarixdb is still under development and is not yet production-ready.

NOTE

v2 is the most recent version (not experimental) and under active development, the src modules are for the experimental version

Basic Features

  • Atomic Put(), Get(), Delete(), and Update() operations
  • 100% safe & stable Rust
  • Separation of keys from values, reducing the amount of data moved during compaction (i.e., reduced IO amplification)
  • Garbage Collector
  • Lock-free memtable with Crossbeam SkipMap (no Mutex)
  • Tokio Runtime for efficient thread management
  • Bloom Filters for fast in-memory key searches
  • Crash recovery using the Value Log
  • Index to improve searches on Sorted String Tables (SSTs)
  • Key Range to store the largest and smallest keys in an SST
  • Sized Tier Compaction Strategy (STCS)

TODO

  • Snapshot Isolation
  • Block Cache
  • Batched Writes
  • Range Query
  • Snappy Compression
  • Value Buffer to keep values in memory and only flush in batches to reduce IO (under investigation)
  • Checksum to detect data corruption
  • Leveled Compaction (LCS), Time-Window Compaction (TCS), and Unified Compaction (UCS)
  • Monitoring module to continuously monitor and generate reports

It is not:

  • A standalone server
  • A relational database
  • A wide-column database: it has no notion of columns

Constraint

  • Keys are limited to 65,536 bytes, and values are limited to 2^32 bytes. Larger keys and values have a bigger performance impact.
  • Like any typical key-value store, keys are stored in lexicographic order. If you are storing integer keys (e.g., timeseries data), use the big-endian form to adhere to locality.

Basic usage

cargo add velarixdb
use velarixdb::db::DataStore;
# use tempfile::tempdir;

#[tokio::main]
async fn main() {
    let root = tempdir().unwrap();
    let path = root.path().join("velarix");
    let mut store = DataStore::open("big_tech", path).await.unwrap(); // handle IO error

    store.put("apple", "tim cook").await;
    store.put("google", "sundar pichai").await;
    store.put("nvidia", "jensen huang").await;
    store.put("microsoft", "satya nadella").await;
    store.put("meta", "mark zuckerberg").await;
    store.put("openai", "sam altman").await;


    let entry1 = store.get("apple").await.unwrap(); // Handle error
    let entry2 = store.get("google").await.unwrap();
    let entry3 = store.get("nvidia").await.unwrap();
    let entry4 = store.get("microsoft").await.unwrap();
    let entry5 = store.get("meta").await.unwrap();
    let entry6 = store.get("openai").await.unwrap();
    let entry7 = store.get("***not_found_key**").await.unwrap();

    assert_eq!(std::str::from_utf8(&entry1.unwrap().val).unwrap(), "tim cook");
    assert_eq!(std::str::from_utf8(&entry2.unwrap().val).unwrap(), "sundar pichai");
    assert_eq!(std::str::from_utf8(&entry3.unwrap().val).unwrap(), "jensen huang");
    assert_eq!(std::str::from_utf8(&entry4.unwrap().val).unwrap(), "satya nadella");
    assert_eq!(std::str::from_utf8(&entry5.unwrap().val).unwrap(), "mark zuckerberg");
    assert_eq!(std::str::from_utf8(&entry6.unwrap().val).unwrap(), "sam altman");
    assert!(entry7.is_none());

    // Remove an entry
    store.delete("apple").await.unwrap();

    // Update an entry
    let success = store.update("microsoft", "elon musk").await;
    assert!(success.is_ok());
}

Store JSON

use serde::{Deserialize, Serialize};
use serde_json;
use velarixdb::db::DataStore;
use tempfile::tempdir;

#[tokio::main]
async fn main() {
    let root = tempdir().unwrap();
    let path = root.path().join("velarix");
    let mut store = DataStore::open("big_tech", path).await.unwrap(); // handle IO error

    #[derive(Serialize, Deserialize)]
    struct BigTech {
        name: String,
        rank: i32,
    }
    let new_entry = BigTech {
        name: String::from("Google"),
        rank: 50,
    };
    let json_string = serde_json::to_string(&new_entry).unwrap();

    let res = store.put("google", json_string).await;
    assert!(res.is_ok());

    let entry = store.get("google").await.unwrap().unwrap();
    let entry_string = std::str::from_utf8(&entry.val).unwrap();
    let big_tech: BigTech = serde_json::from_str(&entry_string).unwrap();

    assert_eq!(big_tech.name, new_entry.name);
    assert_eq!(big_tech.rank, new_entry.rank);
}

Examples

See here for practical examples

You might also like...
OBKV Table Client is Rust Library that can be used to access table data from OceanBase storage layer.

OBKV Table Client is Rust Library that can be used to access table data from OceanBase storage layer. Its access method is different from JDBC, it skips the SQL parsing layer, so it has significant performance advantage.

A mini kv database demo that using simplified bitcask storage model with rust implementation

A mini kv database demo that using simplified bitcask storage model with rust implementation.

A Key-Value data storage system. - dorea db

Dorea DB 🛰 Dorea is a key-value data storage system. It is based on the Bitcask storage model Documentation | Crates.io | API Doucment 简体中文 | English

Forage is for Storage
Forage is for Storage

Forage Forage is for Storage Remote storage: Open storage channels to a remote storage provider over Tor Lightweight: Platform-optimized using Blake3-

Zenith substitutes PostgreSQL storage layer and redistributes data across a cluster of nodes

Zenith substitutes PostgreSQL storage layer and redistributes data across a cluster of nodes

Routing - specialised storage DHT

PLEASE NOTE THAT THIS REPOSITORY HAS NOW BEEN ARCHIVED All routing development is now via the safe_network repository sn_routing sn_routing - a specia

Distributed, version controlled, SQL database with cryptographically verifiable storage, queries and results. Think git for postgres.

SDB - SignatureDB Distributed, version controlled, SQL database with cryptographically verifiable storage, queries and results. Think git for postgres

SQLite compiled to WASM with pluggable data storage

wasm-sqlite SQLite compiled to WASM with pluggable data storage. Useful to save SQLite in e.g. Cloudflare Durable Objects (example: https://github.com

Storage metadata service.

Storage API This projects hosts the storage metadata API, a backend component which hosts metadata for storage snapshots, the actual data is housed in

Owner
gifted_dl
Adewumi Sunkammi D.
gifted_dl
A LSM-based Key-Value Store in Rust

CobbleDB A LSM-based Key-Value Store in Rust Motivation There is no open-source LSM-based key-value store in Rust natively. Some crates are either a w

Yizheng Jiao 2 Oct 25, 2021
Plugin for macro-, mini-quad (quads) to save data in simple local storage using Web Storage API in WASM and local file on a native platforms.

quad-storage This is the crate to save data in persistent local storage in miniquad/macroquad environment. In WASM the data persists even if tab or br

ilya sheprut 9 Jan 4, 2023
Cassandra (CQL) driver for Rust, using the DataStax C/C++ driver under the covers.

cassandra-cpp This is a maintained Rust project that exposes the DataStax cpp driver at https://github.com/datastax/cpp-driver/ in a somewhat-sane cra

null 93 Jan 7, 2023
ForestDB - A Fast Key-Value Storage Engine Based on Hierarchical B+-Tree Trie

ForestDB is a key-value storage engine developed by Couchbase Caching and Storage Team, and its main index structure is built from Hierarchic

null 1.2k Dec 26, 2022
A high-performance storage engine for modern hardware and platforms.

PhotonDB A high-performance storage engine for modern hardware and platforms. PhotonDB is designed from scratch to leverage the power of modern multi-

PhotonDB 466 Jun 22, 2023
Appendable and iterable key/list storage, backed by S3, written in rust

klstore Appendable and iterable key/list storage, backed by S3. General Overview Per key, a single writer appends to underlying storage, enabling many

Eric Thill 3 Sep 29, 2022
LevelDB is a fast key-value storage library written at Google that provides an ordered mapping from string keys to string values.

LevelDB is a fast key-value storage library written at Google that provides an ordered mapping from string keys to string values. Authors: Sanjay Ghem

Google 31.5k Jan 1, 2023
A very WIP RISCV64 OS written in Rust to learn about low-level and OS development

river A very WIP Rust-based RISCV64 OS for learning. The name is based off of the name RISCV with some added letters: "riscv" + er Make sure you have

James [Undefined] 5 Dec 18, 2022
X-Engine: A SQL Engine built from scratch in Rust.

XNGIN (pronounced "X Engine") This is a personal project to build a SQL engine from scratch. The project name is inspired by Nginx, which is a very po

Jiang Zhe 111 Dec 15, 2022
A user crud written in Rust, designed to connect to a MySQL database with full integration test coverage.

SQLX User CRUD Purpose This application demonstrates the how to implement a common design for CRUDs in, potentially, a system of microservices. The de

null 78 Nov 27, 2022