Appendable and iterable key/list storage, backed by S3, written in rust

Related tags

Database klstore
Overview

klstore

Appendable and iterable key/list storage, backed by S3.

General Overview

Per key, a single writer appends to underlying storage, enabling many concurrent readers to iterate through pages of elements. Writers are able to perform batching and nonce checking to enable high-throughput, exactly-once persistence of data. Readers are stateless and able to deterministically page backwards and forwards through a list, which is uniquely identified by a key per keyspace. Readers can begin iterating from any point in a list, searchable by either offset, timestamp, or nonce.

Terminology

Term Description
keyspace A collection of keys
key A unique key in a keyspace
list An iterable collection of elements, identified by a key
offset The position of a single element in a list
nonce An ever-growing number per key, used for de-duplication of records

Rust API

Reader

The StoreReader trait expresses the API around reading from an S3-backed key/list store:

pub trait StoreReader {
    /// read metadata for a keyspace, returning an error if the keyspace does not exist
    fn read_keyspace_metadata(&self, keyspace: &str) -> Result
   ;

    
   /// read metadata for the given key, returning None if the key does not exist
    
   fn 
   read_key_metadata(
        
   &
   self,
        keyspace: 
   &
   str,
        key: 
   &
   str,
    ) -> 
   Result<
   Option
   
    , StoreError>;

    
    /// read the next page from a list, returning an empty list if the key does not exist
    
    /// when no filter params and continuation is empty, it will start iterating from the first available offset.
    
    /// continuation can be used to read from the next page
    
    /// if specified, max_size determines the max elements to be returned, otherwise a configured default is used
    
    /// filter opations can be specified to start reading from an item other than the first available offset
    
    fn 
    read_page(
        
    &
    self,
        keyspace: 
    &
    str,
        key: 
    &
    str,
        order: IterationOrder,
        max_size: 
    Option<
    u64>,
        filter: 
    Option
    
     ,
        continuation: 
     Option<
     String>,
    ) -> 
     Result
     
      ;
}
     
    
   
  

An optional ItemFilter can be defined to begin iteration from an arbitrary element in a key's list:

pub struct ItemFilter {
    pub start_offset: Option<u64>,
    pub start_timestamp: Option<i64>,
    pub start_nonce: Option<u128>,
}

A returned item list represents a single page of results. Iteration can continue using the returned continuation_token:

pub struct ItemList {
    pub keyspace: String,
    pub key: String,
    pub items: Vec<Item>,
    pub continuation: Option<String>,
    pub stats: ListStats,
}

Writer

The StoreWriter trait expresses the API around writing to an S3-backed key/list store:

pub trait StoreWriter {
    /// create a new keyspace, returning an error on failure or if the keyspace already existed
    fn create_keyspace(&self, keyspace: &str) -> Result
   ;

    
   /// append items to a list, creating a new key if necessary.
    
   /// in some implementations, this may be dispatched and executed asynchronously.
    
   fn 
   append_list(
        
   &
   self,
        keyspace: 
   &
   str,
        key: 
   &
   str,
        items: 
   Vec
   
    ,
    ) -> 
    Result<(), StoreError>;

    
    /// flush pending writes for a specific key
    
    fn 
    flush_key(
    &
    self, keyspace: 
    &
    str, key: 
    &
    str) -> 
    Result<(), StoreError>;

    
    /// flush all pending asynchronous operations
    
    fn 
    flush_all(
    &
    self) -> 
    Result<(), StoreError>;

    
    /// should be called periodically for implementation that require it.
    
    /// this will trigger scheduled operations, like flushing a pending batch.
    
    fn 
    duty_cycle(
    &
    self) -> 
    Result<(), StoreError>;
}
   
  

The append_list function allows writing of a vector of items, each of which will be individually nonce-checked. The user has the option of specifiying a timestamp. When left undefined, the StoreWriter implementation will use the system clock.

pub struct Insertion {
    pub value: Vec<u8>,
    pub nonce: Option<u128>,
    pub timestamp: Option<i64>,
}

S3

The S3StoreWriter and S3StoreReader offset reader and writer functionality backed by an S3 store. Both can be configured from the S3StoreConfig object, but some configuration parameters are only used by one of the two implementations.

Object Keys

The following structure is used for object keys to enable bi-directional iteration and O(log(n)) binary searchability by offset, timestamp and nonce.

{prefix}{keyspace}/{key}/data_o{firstOffset}-o{lastOffset}_t{minTimestamp}-t{maxTimestamp}_n{firstNonce}-n{nextNonce}_s{sizeInBytes}_p{priorBatchStartOffset}.bin

Shared Config

The following parameters are used to specify S3-connection details:

/// object prefix, defaults to an empty string, which would put the keyspace at the root of the bucket
object_prefix: String

/// required, the bucket name
bucket_name: Option<String>

/// optional, used to override the S3 endpoint
endpoint: Option<String>

/// optional, defaults to us-east-1
region: String

/// enable path-style endpoints, defaults to false
path_style: bool

/// use default credentials instead of given access keys, defaults to true
use_default_credentials: bool

/// optional, used when use_default_credentials=false
access_key: Option<String>

/// optional, used when use_default_credentials=false
secret_key: Option<String>

/// optional, used when use_default_credentials=false
security_token: Option<String>

/// optional, used when use_default_credentials=false
session_token: Option<String>

/// optional, used when use_default_credentials=false
profile: Option<String>

Reader-Specific Config

The following parameters are used to specify reader default behavior when not defined in a request:

/// set the default number of max results used when none is defined in the request
default_max_results: u64

Writer-Specific Config

The following parameters are used to specify writer cache and compaction behavior:

/// set the maximum number of cached keys kept in memory in the writer, defaults to 100k
max_cached_keys: usize,

/// set the size threshold to trigger object compaction of a complete batch, defaults to 1MB
compact_items_threshold: u64

/// set the size threshold to trigger object compaction of a complete batch, defaults to 1MiB
compact_size_threshold: u64

/// set the object count threshold to trigger compaction of a partial batch, defaults to 100
compact_objects_threshold: 100

Batching

The BatchingStoreWriter implements the StoreWriterTrait and wraps an underlying StoreWriter implementation to enable batching of insertions to optimize throughput. Multiple threads can be utilized to further increase write throughput. Individual keys will be batched by the same writer thread. The BatchingStoreWriterConfig allows the user to configure the following batching parameters:

/// set the number of writer threads. defaults to 1.
/// each key will always be written by the same thread, so consider number of keys when determining the number of threads to use.
writer_thread_count: usize

/// set the capacity of the queue for each writer thread. defaults to unbound.
/// this is the maximum number of writes that can be queued per thread.
/// consider this parameter, a typical write size, and the number of writer threads to determine memory requirements.
/// when set to None, there will be no capacity limitations.
writer_thread_queue_capacity: usize

/// set the interval at which to check to flush batches. defaults to 100 milliseconds.
batch_check_interval_millis: u64

/// set the interval at which to flush batches, regardless of size. defaults to 1 second.
/// this interval plus processing time is the maximum additional delay introduced by batching.
batch_flush_interval_millis: u64

/// flush a batch upon reaching a specific item count. defaults to u64::MAX items.
batch_flush_item_count_threshold: u64

/// flush a batch upon reaching a specific batch size. defaults to 1MB.
/// each key builds a separate batch, so this parameter and flush_interval+throughput will help determine memory requirements.
batch_flush_size_threshold: u64

Kafka Bridge

A KafkaConsumerBridge couples an S3StoreWriter and BatchingStoreWriter with a KafkaConsumer. It allows for efficient and configurable batching of data sourced from Kafka via a consumer group.

Configuation

Here is an example KafkaConsumerBridge configuration:

[s3]
endpoint="http://localhost:4566"
bucket_name="my-bucket"
region="us-east-1"
path_style=true
use_default_credentials=false
access_key="test"
secret_key="test"
security_token="test"
session_token="test"
profile="test"
compact_size_threshold=1000000
compact_items_threshold=100
compact_objects_threshold=10

[batcher]
writer_thread_count=1
writer_thread_queue_capacity=4096
batch_check_interval_millis=100
batch_flush_interval_millis=1000
batch_flush_item_count_threshold=100000
batch_flush_size_threshold=1000000

[kafka]
topic="inbound"
offset_commit_interval_seconds=60
group.id="test_group"
bootstrap.servers="127.0.0.1:9092"
auto.offset.reset=earliest

[parser]
nonce_parser="RecordOffset"
timestamp_parser="None"
keyspace_parser="Static(my_keyspace)"
key_parser="RecordPartition"

Consumer Group Offsets

The offset_commit_interval_seconds property indicates how often the batcher will be flushed and offsets will be committed for the consumer group. Note that enable.auto.commit will always be set to false and enable.auto.offset.store will always be set to true so that the Kafka Bridge can deterministically commit offsets after writes.

UTF-8 Parsers

key_parser and keyspace_parser can be configured as follows:

Value Description
Static(some_string) Use the given string
RecordHeader(my_header_name) Parse the given header as UTF-8 from each record
RecordKey Parse each record key as UTF-8
RecordPartition Use the record partition

Number Parsers

nonce_parser and timestamp_parser can be configured as follows:

Value Description
None Always set to None
RecordHeaderBigEndian(my_header_name) Parse the given header as little-endian from each record
RecordHeaderLittleEndian(my_header_name) Parse the given header as big-endian from each record
RecordHeaderUtf8(my_header_name) Parse the given header as UTF-8 converted to a number from each record
RecordKeyBigEndian Parse each record key as little-endian
RecordKeyLittleEndian Parse each record key as big-endian
RecordKeyUtf8 Parse each record key as UTF-8 converted to a number
RecordOffset Use the record offset
RecordPartition Use the record partition
You might also like...
A fast and simple in-memory database with a key-value data model written in Rust

Segment Segment is a simple & fast in-memory database with a key-value data model written in Rust. Features Dynamic keyspaces Keyspace level control o

A simple embedded key-value store written in rust as a learning project

A simple embedded key-value store written in rust as a learning project

A
A "blazingly" fast key-value pair database without bloat written in rust

A fast key-value pair in memory database. With a very simple and fast API. At Xiler it gets used to store and manage client sessions throughout the pl

Zenith substitutes PostgreSQL storage layer and redistributes data across a cluster of nodes

Zenith substitutes PostgreSQL storage layer and redistributes data across a cluster of nodes

Distributed, version controlled, SQL database with cryptographically verifiable storage, queries and results. Think git for postgres.

SDB - SignatureDB Distributed, version controlled, SQL database with cryptographically verifiable storage, queries and results. Think git for postgres

A high-performance storage engine for modern hardware and platforms.

PhotonDB A high-performance storage engine for modern hardware and platforms. PhotonDB is designed from scratch to leverage the power of modern multi-

OBKV Table Client is Rust Library that can be used to access table data from OceanBase storage layer.

OBKV Table Client is Rust Library that can be used to access table data from OceanBase storage layer. Its access method is different from JDBC, it skips the SQL parsing layer, so it has significant performance advantage.

A mini kv database demo that using simplified bitcask storage model with rust implementation

A mini kv database demo that using simplified bitcask storage model with rust implementation.

Forage is for Storage
Forage is for Storage

Forage Forage is for Storage Remote storage: Open storage channels to a remote storage provider over Tor Lightweight: Platform-optimized using Blake3-

Owner
Eric Thill
Eric Thill
Plugin for macro-, mini-quad (quads) to save data in simple local storage using Web Storage API in WASM and local file on a native platforms.

quad-storage This is the crate to save data in persistent local storage in miniquad/macroquad environment. In WASM the data persists even if tab or br

ilya sheprut 9 Jan 4, 2023
ForestDB - A Fast Key-Value Storage Engine Based on Hierarchical B+-Tree Trie

ForestDB is a key-value storage engine developed by Couchbase Caching and Storage Team, and its main index structure is built from Hierarchic

null 1.2k Dec 26, 2022
A Key-Value data storage system. - dorea db

Dorea DB ?? Dorea is a key-value data storage system. It is based on the Bitcask storage model Documentation | Crates.io | API Doucment 简体中文 | English

ZhuoEr Liu 112 Dec 2, 2022
General basic key-value structs for Key-Value based storages

General basic key-value structs for Key-Value based storages

Al Liu 0 Dec 3, 2022
Simple and handy btrfs snapshoting tool. Supports unattended snapshots, tracking, restoring, automatic cleanup and more. Backed with SQLite.

Description Simple and handy btrfs snapshoting tool. Supports unattended snapshots, tracking, restoring, automatic cleanup and more. Backed with SQLit

Eduard Tolosa 27 Nov 22, 2022
A tokio-uring backed runtime for Rust

tokio-uring A proof-of-concept runtime backed by io-uring while maintaining compatibility with the Tokio ecosystem. This is a proof of concept and not

Tokio 726 Jan 4, 2023
A Pub/Sub library for Rust backed by Postgres

Unisub Unisub is a Pub/Sub library for Rust, using Postgres as the backend. It offers a convenient way to publish and subscribe to messages across dif

Nick Rempel 12 Oct 6, 2023
The simplest implementation of LLM-backed vector search on Postgres.

pg_vectorize under development The simplest implementation of LLM-backed vector search on Postgres. -- initialize an existing table select vectorize.i

Tembo 5 Jul 25, 2023
An LSM storage engine designed to significantly reduce I/O amplification written in safe rust (Under active development)

VelarixDB is an LSM-based storage engine designed to significantly reduce IO amplification, resulting in better performance and durability for storage

gifted_dl 14 Sep 25, 2024
AgateDB is an embeddable, persistent and fast key-value (KV) database written in pure Rust

AgateDB is an embeddable, persistent and fast key-value (KV) database written in pure Rust. It is designed as an experimental engine for the TiKV project, and will bring aggressive optimizations for TiKV specifically.

TiKV Project 535 Jan 9, 2023