AgateDB is an embeddable, persistent and fast key-value (KV) database written in pure Rust

Related tags

Database agatedb
Overview

AgateDB

AgateDB is an embeddable, persistent and fast key-value (KV) database written in pure Rust. It is designed as an experimental engine for the TiKV project, and will bring aggressive optimizations for TiKV specifically.

Project Status

AgateDB is still under early heavy development, you can check the development progress at the GitHub Project.

The whole plan is to port badger in Rust first and then port the optimizations that have been made in unistore.

AgateDB is under active development on develop branch. Currently, it can be used as a key-value store with MVCC. It implements most of the functionalities of badger managed mode.

Why not X?

X is a great project! The motivation of this project is to ultimately land the optimizations we have been made to unistore in TiKV. Unistore is based on badger, so we start with badger.

We are using Rust because it can bring memory safety out of box, which is important during rapid development. TiKV is also written in Rust, so it will be easier to integrate with each other like supporting async/await, sharing global thread pools, etc.

Comments
  • max_version support

    max_version support

    Signed-off-by: wangnengjie [email protected]

    Support max_version and initialize next_txn_ts when open.

    In this PR, the max_version is the max version in memtable, immutable memtable and all SST, means the max version that has been written to database.

    The max_version is some what like lsn in rocksdb but not equivalent. In agatedb, we get (or user set) commit_ts from oracle when commiting transaction and the commit_ts is set as version of entries in this transaction.

    When reopening (or recovering from) existed db, we need max_version to initializenext_txn_ts because the read_ts is next_txn_ts - 1 (non manage mode). Just like init lsn in rocksdb. Otherwise, we get read_ts that is 0 after reopen and no exist data can be seen. Moreover, as next_txn_ts will increase from 1 again, old data might be overwritten with same version and some data can magically 'appear' and 'disappear' with next_txn_ts increasing. Without max_version, agatedb is only a one-off database.

    opened by wangnengjie 9
  • txn: add watermark

    txn: add watermark

    Based on

    • https://github.com/tikv/agatedb/pull/57
    • https://github.com/dgraph-io/badger/blob/45bca18f24ef5cc04701a1e17448ddfce9372da0/y/watermark.go

    Signed-off-by: GanZiheng [email protected]

    opened by GanZiheng 9
  • fix test and bench panic

    fix test and bench panic

    Signed-off-by: Alex Chi [email protected]

    Benchmark failure

    This is done by simply changing the capacity of skiplist while running this unit test. The long term solution will be changing iteration number of Criterion.rs in each bench case.

    test_one_key failure

    From GitHub Action log I found that the 100th(/200) message timed out. I guess this is because that one pool (randomly) never spawns thread. So I forced thread pool to shutdown in this test, to make sure all threads are terminated before we check for the response.

    However, even when I shutdown the pool, there will still be possibility of not receiving any message.

    thread 'test_one_key' panicked at 'timeout on receiving r0 w0 msg', skiplist/tests/tests.rs:157:23
    

    And then I guess this is caused by pools with the same prefix name, leading to threads of the same name. I changed prefix name of two pools to different ones read and write. This doesn't work.

    Finally I merged two pools into one. And this works.

    This bug won't affect other part of agatedb, as currently, yatp is only used in tests.

    opened by skyzh 7
  • [dev] panic on drop causes process to abort

    [dev] panic on drop causes process to abort

    As seen in failed CI in https://github.com/tikv/agatedb/pull/88 , agatedb will abort when we panic on drop. I implemented a partial solution the we ignore all failures (e.g. PoisonedLock, failed to remove file) and simply print them. We could later find a better way to solve this issue.

    bug 
    opened by skyzh 5
  • 一个我搞不定的错误 No such file or directory (os error 2)

    一个我搞不定的错误 No such file or directory (os error 2)

    代码见我的fork的cli分支 https://github.com/rmw-lib/agatedb/tree/cli/examples

    我的想法是给 agatedb 写一个命令行工具,也可以当做学习agatedb的demo代码

    代码是在 https://github.com/tikv/agatedb/pull/178 (simple agatedb get put) 基础上继续开发的

    但是目前卡在了这个错误上 No such file or directory (os error 2),因为这个错误没代码行提示,我也搞不清楚为啥出错

    截图如下

    image

    另外,每次运行 /tmp/adb 下面都会多一个 .mem文件,感觉这也不太正常

    opened by gcxfd 4
  • chore: update badger link

    chore: update badger link

    Dgraph is dissolved and the original badger repo is not active anymore. Now the badger is maintained under the newly found org, see https://github.com/outcaste-io/badger

    opened by Connor1996 4
  • txn: finish oracle

    txn: finish oracle

    Based on

    • https://github.com/tikv/agatedb/pull/67
    • https://github.com/dgraph-io/badger/blob/45bca18f24ef5cc04701a1e17448ddfce9372da0/txn.go

    Signed-off-by: GanZiheng [email protected]

    opened by GanZiheng 4
  • Add manifest

    Add manifest

    Based on

    • https://github.com/tikv/agatedb/pull/60
    • https://github.com/dgraph-io/badger/blob/45bca18f24ef5cc04701a1e17448ddfce9372da0/manifest.go.

    Signed-off-by: GanZiheng [email protected]

    opened by GanZiheng 4
  • compact_prios exceeds max_levels

    compact_prios exceeds max_levels

    This panic is not guaranteed to be triggered. Will look into details later.

    thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Any', /home/skyzh/.cargo/git/checkouts/yatp-e704b73c3ee279b6/6bbea16/src/pool.rs:46:26
    stack backtrace:
       0: rust_begin_unwind
                 at /rustc/f5230fbf76bafd86ee4376a0e26e551df8d17fec/library/std/src/panicking.rs:495:5
       1: core::panicking::panic_fmt
                 at /rustc/f5230fbf76bafd86ee4376a0e26e551df8d17fec/library/core/src/panicking.rs:92:14
       2: core::option::expect_none_failed
                 at /rustc/f5230fbf76bafd86ee4376a0e26e551df8d17fec/library/core/src/option.rs:1268:5
       3: core::result::Result<T,E>::unwrap
                 at /home/skyzh/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/result.rs:973:23
       4: yatp::pool::ThreadPool<T>::shutdown
                 at /home/skyzh/.cargo/git/checkouts/yatp-e704b73c3ee279b6/6bbea16/src/pool.rs:46:17
       5: <agatedb::db::Agate as core::ops::drop::Drop>::drop
                 at ./src/db.rs:98:9
       6: core::ptr::drop_in_place
                 at /home/skyzh/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/ptr/mod.rs:175:1
       7: agatedb::db::tests::with_agate_test::{{closure}}
                 at ./src/db.rs:617:9
    note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
    
    bug 
    opened by skyzh 4
  • add sanitizer

    add sanitizer

    Signed-off-by: Alex Chi [email protected]

    fix https://github.com/tikv/agatedb/issues/10

    Sanitizer will only run when there's any change in skiplist module. First successful run is triggered along with split sanitizer commit.

    In the future, we could apply sanitizer to other modules.

    opened by skyzh 4
  • Merge branch develop into master

    Merge branch develop into master

    Development Task

    Merge develop into master part by part.

    Development Progress

    The pull requests merged at develop branch are as follows, I would try to apply them into master.

    • vlog

      • [x] https://github.com/tikv/agatedb/pull/84 @zhangjinpeng1987 https://github.com/tikv/agatedb/pull/143
      • [x] https://github.com/tikv/agatedb/pull/88 @zhangjinpeng1987 https://github.com/tikv/agatedb/pull/127
    • memtable

      • [x] https://github.com/tikv/agatedb/pull/45 @GanZiheng https://github.com/tikv/agatedb/pull/126
      • [x] https://github.com/tikv/agatedb/pull/46 @GanZiheng https://github.com/tikv/agatedb/pull/126
    • levels

      • [x] https://github.com/tikv/agatedb/pull/47 @zhangjinpeng1987 https://github.com/tikv/agatedb/pull/138
      • [x] https://github.com/tikv/agatedb/pull/50 @zhangjinpeng1987 https://github.com/tikv/agatedb/pull/148
      • [x] https://github.com/tikv/agatedb/pull/52 @GanZiheng https://github.com/tikv/agatedb/pull/129
      • [x] https://github.com/tikv/agatedb/pull/54 @GanZiheng https://github.com/tikv/agatedb/pull/129
      • [x] https://github.com/tikv/agatedb/pull/75 @GanZiheng https://github.com/tikv/agatedb/pull/129
      • [x] https://github.com/tikv/agatedb/pull/73 @GanZiheng https://github.com/tikv/agatedb/pull/129
      • [x] https://github.com/tikv/agatedb/pull/77 @GanZiheng https://github.com/tikv/agatedb/pull/129
      • [x] https://github.com/tikv/agatedb/pull/78 @GanZiheng https://github.com/tikv/agatedb/pull/129
      • [x] https://github.com/tikv/agatedb/pull/101 @GanZiheng https://github.com/tikv/agatedb/pull/129
    • txn

      • [x] https://github.com/tikv/agatedb/pull/57 @GanZiheng https://github.com/tikv/agatedb/pull/132
      • [x] https://github.com/tikv/agatedb/pull/67 @GanZiheng
        • https://github.com/tikv/agatedb/pull/136
        • https://github.com/tikv/agatedb/pull/143
      • [x] https://github.com/tikv/agatedb/pull/72 @GanZiheng https://github.com/tikv/agatedb/pull/143
    • agate

      • [x] https://github.com/tikv/agatedb/pull/58 @GanZiheng https://github.com/tikv/agatedb/pull/127
      • [x] https://github.com/tikv/agatedb/pull/62 @GanZiheng https://github.com/tikv/agatedb/pull/127
      • [x] https://github.com/tikv/agatedb/pull/60 @GanZiheng https://github.com/tikv/agatedb/pull/123
      • [x] https://github.com/tikv/agatedb/pull/87 @GanZiheng https://github.com/tikv/agatedb/pull/170
      • [x] https://github.com/tikv/agatedb/pull/91 @GanZiheng https://github.com/tikv/agatedb/pull/170
    • bench

      • [x] https://github.com/tikv/agatedb/pull/86 @GanZiheng https://github.com/tikv/agatedb/pull/169
      • [x] https://github.com/tikv/agatedb/pull/90 @GanZiheng https://github.com/tikv/agatedb/pull/169
      • [x] https://github.com/tikv/agatedb/pull/93 @GanZiheng https://github.com/tikv/agatedb/pull/167
      • [x] https://github.com/tikv/agatedb/pull/94 @GanZiheng https://github.com/tikv/agatedb/pull/167
      • [x] https://github.com/tikv/agatedb/pull/96 @GanZiheng https://github.com/tikv/agatedb/pull/167
      • [x] https://github.com/tikv/agatedb/pull/99 @GanZiheng https://github.com/tikv/agatedb/pull/167
      • [x] https://github.com/tikv/agatedb/pull/102 @GanZiheng https://github.com/tikv/agatedb/pull/167
    opened by GanZiheng 3
  • Proposal: Ingest External SST

    Proposal: Ingest External SST

    This issue describe a possible way to ingest external SST file to a running agatedb instance just as rocksdb.

    Background

    TiKV use ingest external SST feature for some use case. To integrate agatedb into TiKV, we need this feature.

    From Creating and Ingesting SST files · facebook/rocksdb Wiki (github.com), ingesting a SST needs follow steps:

    1. copy or link the file(s) into the DB directory
    2. block (not skip) writes to the DB and assign correct sequence number to ingested file
    3. flush memtable if range overlap (SST file's key range overlap with memtable)
    4. assign the file to LSM-tree

    Things go different in agatedb.

    Analysis

    Step 1 and step 4 are the same.

    For step 3, it is unnecessary to flush memtable when overlap for agatedb. Agatedb reads all levels when get and picks the latest one (less or equal to read_ts), so it's okay to have new keys below old keys in LSM-tree. This really makes the whole process efficient. (Please point out if I'm wrong)

    For step 2, it is the most difficult: How to protect ACID during and after ingestion?

    Agatedb has ACID transaction (SSI). The architecture is totally different from rocksdb. We have a read_ts and a commit_ts and we have conflict check when commit.

    Possible Implementation

    Main idea: Regard ingesting SST files as writing large range of keys and make the whole process a transaction.

    Ingesting files shall be an expensive job.

    Time Point: before commit

    1. Add files, fetch new file id from LevelsController then move or copy files to DB dir with allocated file id(name).
    2. Verify file checksum (optional) and get file infomation (just open Table?).
    3. Make a new transaction, mark update as true (add a ingest flag for transaction?).
    4. (TBD) other checks…

    Check files before starting a txn.

    Time Point: commit

    1. Same as current, but add ingest range field to CommittedTxn in CommitInfo which contains smallest and largest keys in each ingest files. This is for fast but inaccurate conflict check, otherwise we need to calc every key hash in SST which really takes time.
    2. Set commit_ts as Table's global_version
    3. Send to write channel as usual (see below)
    4. Process Ingest, find suitable level for each files. (Which will take write lock of LevelHandler , and will block read process of other txn)
    5. Write manifest file. The file's global_version is stored in manifest.

    A question here. I see we protect the order of Requests send to write channel the same as commit_ts and I wonder why? For WAL in order?

    If ingest job can be out-of-order, I think it's possible and better to make ingest job running in current thread and not sending to write channel.

    You can see that the whole commit process has only one small I/O — append to manifest. But when picking level, it takes time if there are many tables in this level and this happens under a write lock of LevelHandler .

    Conflict Check

    There are two ways for conflict check as I described in TP: commit (step 1) .

    1. Calculate every key's hash in ingested files and then conflict check works as same right now. For performance problem, we can calculate key hash before transaction.
    2. Add ingest range field to CommittedTxn in CommitInfo . And for any update txn, when adding read keys, mark smallest and largest read key for this txn. When checking conflict, beside checking hash, also check if read range overlap with ingest range. This is really inaccurate and will cause many txn conflict but I think it makes sense.

    The second way needs to refactor transaction much more.

    (A more simple way is to break ACID…hhh)

    Global Version

    This concept is the same as rocksdb. For ingested files, all keys has wrong version and it's unreasonable to modify each. When file has a global version, every key in this file has this version.

    In latest rocksdb, this value is stored in manifest to avoid random write to ingested files.

    We need to add a field in meta.proto and update Table iterator (also block iterator) to fit this feature.

    global_version may have ambiguity. Maybe another name when impl.

    Discuss

    1. A new transaction struct only for ingest or refactor current?
    2. Which way to check conflict? (or any better idea?)
    3. Send to write channel or done in current thread?
    4. Any good idea to impliment this feature?
    5. Strategy to pick level. I think from the bottom to check if there is overlap is nice.
    6. What if external files (in one job) has overlap range? Rocksdb will put all files at level0.
    opened by wangnengjie 8
  • Read lock on LevelHandler while get

    Read lock on LevelHandler while get

    When getting value from LSM, agatedb iters on levelhandlers and calls LevelHandler::get() under a RwLockReadGuard which will cause the whole read I/O path on this level under a read lock. It is always not a good idea to do I/O under lock even it is a read lock.

    See: https://github.com/tikv/agatedb/blob/475e17bce85cf5985161004c068616a594923365/src/levels.rs#L931

    I think such read lock is only necessary for fetching associated Tables from LevelHandler and the exact read work can be done outside the lock.

    This logic is same as what it is in badger. https://github.com/dgraph-io/badger/blob/7d159dd923ac9e470ffa76365f9b29ccbd038b0d/levels.go#L1602 The comment says it will call h.RLock() and h.RUnlock(), but exactly they are called in getTableForKey() which only locked when getting tables. https://github.com/dgraph-io/badger/blob/master/level_handler.go#L241-L243

    BTW, as Table might be removed (I/O) when the struct drop and this might happened when modifying LevelHandler under write lock. So this might influence read? I' mot sure whether it's right?

    opened by wangnengjie 4
  • [DNM] cache: add block cache for table

    [DNM] cache: add block cache for table

    DO NOT MERGE! This is just an example impl for block cache. The design of LRU is copied from leveldb. (I'm a Rust beginner. So the code is quite ugly.) ~~Currently, as agatedb opens every table in mmap mode, the block cache is nearly useless.~~(sorry I misunderstand mmap) I'll add some test for block cache later. relate issues: https://github.com/tikv/agatedb/issues/153 https://github.com/tikv/agatedb/issues/164

    opened by wangnengjie 0
  • iterator: unify behavior and implement `prev` and `to_last` to all Iterators

    iterator: unify behavior and implement `prev` and `to_last` to all Iterators

    Signed-off-by: GanZiheng [email protected]

    Unify all kinds of iterators which implement the AgateIterator trait.

    Suppose we have another two extra elements in the data which the iterator iterates over, one is before the original first element, called before first, and another is after the last original element, namely after last.

    1. If we arrives the first element, and call prev, we will arrive before first, and valid will return false;
    2. If we arrives the last element, and call next, we will arrive after last, and valid will return false;
    3. At before first, if we call prev, nothing will change, and if we call next, we will move to the first element;
    4. At after last, if we call next, nothing will change, and if we call prev, we will move to the last element;

    When entries are set but not committed yet, they will stay in pending_writes of the transaction, and if we build PendingWritesIterator to iterate over them, the timestamp of each entry is set to read timestamp of the transaction. So, we may get duplicate keys in MergeIterator of Iterator generated by Transaction. ~~However, the logic of prev in MergeIterator is complicated if there exists duplicated keys. So I also update the timestamp appended to key. When writing an entry, we will append commit timestamp * 10 to the key and we will use read timestamp * 10 + 5 to search an entry or build PendingWritesIterator.~~ Since all keys are different in PendingWritesIterator and other iterators respectively, to a specific key, there may be at most one pair of identical keys in all iterators. We can simplify logic of MergeIterator with this limitation.

    opened by GanZiheng 8
  • Transaction 的 set(key,value) 不会增加版本号,当前进程读取是新数据,关闭进程后重新读取就会读取旧数据

    Transaction 的 set(key,value) 不会增加版本号,当前进程读取是新数据,关闭进程后重新读取就会读取旧数据

    Transaction 的 set(key,value) 不会增加版本号,当前进程读取是新数据,关闭进程后重新读取就会读取旧数据

    读取都是用的

    pub fn get(&self, key: impl Into<BytesMut>) -> Result<Value> {
        let key = key_with_ts(key.into(), std::u64::MAX);
        self.core.get(&key)
    }
    
    opened by gcxfd 3
Owner
TiKV Project
TiKV Project
General basic key-value structs for Key-Value based storages

General basic key-value structs for Key-Value based storages

Al Liu 0 Dec 3, 2022
A fast and simple in-memory database with a key-value data model written in Rust

Segment Segment is a simple & fast in-memory database with a key-value data model written in Rust. Features Dynamic keyspaces Keyspace level control o

Segment 61 Jan 5, 2023
A "blazingly" fast key-value pair database without bloat written in rust

A fast key-value pair in memory database. With a very simple and fast API. At Xiler it gets used to store and manage client sessions throughout the pl

Arthur 16 Dec 16, 2022
Ultra fast, persistent database supporting Redis API

What is SableDb? SableDb is a key-value NoSQL database that utilizes RocksDb as its storage engine and is compatible with the Redis protocol. It aims

sabledb 406 Aug 9, 2024
Immutable Ordered Key-Value Database Engine

PumpkinDB Build status (Linux) Build status (Windows) Project status Usable, between alpha and beta Production-readiness Depends on your risk toleranc

null 1.3k Jan 2, 2023
Distributed transactional key-value database, originally created to complement TiDB

Website | Documentation | Community Chat TiKV is an open-source, distributed, and transactional key-value database. Unlike other traditional NoSQL sys

TiKV Project 12.4k Jan 3, 2023
RefineDB - A strongly-typed document database that runs on any transactional key-value store.

RefineDB - A strongly-typed document database that runs on any transactional key-value store.

Heyang Zhou 375 Jan 4, 2023
A simple key value database for storing simple structures.

Perdia-DB A simple key value database for storing simple structures. No nesting of structures is supported, but may be implemented in the future. Toke

Perdia 4 May 24, 2022
An embedded, in-memory, immutable, copy-on-write, key-value database engine

An embedded, in-memory, immutable, copy-on-write, key-value database engine. Features In-memory database Multi-version concurrency control Rich transa

SurrealDB 7 May 5, 2024
LevelDB is a fast key-value storage library written at Google that provides an ordered mapping from string keys to string values.

LevelDB is a fast key-value storage library written at Google that provides an ordered mapping from string keys to string values. Authors: Sanjay Ghem

Google 31.5k Jan 1, 2023
RedisLess is a fast, lightweight, embedded and scalable in-memory Key/Value store library compatible with the Redis API.

RedisLess is a fast, lightweight, embedded and scalable in-memory Key/Value store library compatible with the Redis API.

Qovery 145 Nov 23, 2022
A Distributed SQL Database - Building the Database in the Public to Learn Database Internals

Table of Contents Overview Usage TODO MVCC in entangleDB SQL Query Execution in entangleDB entangleDB Raft Consensus Engine What I am trying to build

Sarthak Dalabehera 38 Jan 2, 2024
ForestDB - A Fast Key-Value Storage Engine Based on Hierarchical B+-Tree Trie

ForestDB is a key-value storage engine developed by Couchbase Caching and Storage Team, and its main index structure is built from Hierarchic

null 1.2k Dec 26, 2022
A simple embedded key-value store written in rust as a learning project

A simple embedded key-value store written in rust as a learning project

Blobcode 1 Feb 20, 2022
PickleDB-rs is a lightweight and simple key-value store. It is a Rust version for Python's PickleDB

PickleDB PickleDB is a lightweight and simple key-value store written in Rust, heavily inspired by Python's PickleDB PickleDB is fun and easy to use u

null 155 Jan 5, 2023
Log structured append-only key-value store from Rust In Action with some enhancements.

riakv Log structured, append only, key value store implementation from Rust In Action with some enhancements. Features Persistent key value store with

Arindam Das 5 Oct 29, 2022
A LSM-based Key-Value Store in Rust

CobbleDB A LSM-based Key-Value Store in Rust Motivation There is no open-source LSM-based key-value store in Rust natively. Some crates are either a w

Yizheng Jiao 2 Oct 25, 2021
A rust Key-Value store based on Redis.

Key-Value Store A Key-Value store that uses Redis to store data. Built using an async web framework in Rust with a full Command-Line interface and log

Miguel David Salcedo 0 Jan 14, 2022
CLI tool to work with Sled key-value databases.

sledtool CLI tool to work with Sled key-value databases. $ sledtool --help Usage: sledtool <dbpath> <command> [<args>] CLI tool to work with Sled da

Vitaly Shukela 27 Sep 26, 2022