AgateDB is an embeddable, persistent and fast key-value (KV) database written in pure Rust

TiKV Project

Last update: Jan 9, 2023

Related tags

Database agatedb

Overview

AgateDB

AgateDB is an embeddable, persistent and fast key-value (KV) database written in pure Rust. It is designed as an experimental engine for the TiKV project, and will bring aggressive optimizations for TiKV specifically.

Project Status

AgateDB is still under early heavy development, you can check the development progress at the GitHub Project.

The whole plan is to port badger in Rust first and then port the optimizations that have been made in unistore.

AgateDB is under active development on develop branch. Currently, it can be used as a key-value store with MVCC. It implements most of the functionalities of badger managed mode.

Why not X?

X is a great project! The motivation of this project is to ultimately land the optimizations we have been made to unistore in TiKV. Unistore is based on badger, so we start with badger.

We are using Rust because it can bring memory safety out of box, which is important during rapid development. TiKV is also written in Rust, so it will be easier to integrate with each other like supporting async/await, sharing global thread pools, etc.

Comments

max_version support

Signed-off-by: wangnengjie [email protected]

Support max_version and initialize next_txn_ts when open.

In this PR, the max_version is the max version in memtable, immutable memtable and all SST, means the max version that has been written to database.

The max_version is some what like lsn in rocksdb but not equivalent. In agatedb, we get (or user set) commit_ts from oracle when commiting transaction and the commit_ts is set as version of entries in this transaction.

When reopening (or recovering from) existed db, we need max_version to initializenext_txn_ts because the read_ts is next_txn_ts - 1 (non manage mode). Just like init lsn in rocksdb. Otherwise, we get read_ts that is 0 after reopen and no exist data can be seen. Moreover, as next_txn_ts will increase from 1 again, old data might be overwritten with same version and some data can magically 'appear' and 'disappear' with next_txn_ts increasing. Without max_version, agatedb is only a one-off database.

opened by wangnengjie 9
txn: add watermark
Based on

https://github.com/tikv/agatedb/pull/57

https://github.com/dgraph-io/badger/blob/45bca18f24ef5cc04701a1e17448ddfce9372da0/y/watermark.go

Signed-off-by: GanZiheng [email protected]
opened by GanZiheng 9
fix test and bench panic
Signed-off-by: Alex Chi [email protected]

Benchmark failure

This is done by simply changing the capacity of skiplist while running this unit test. The long term solution will be changing iteration number of Criterion.rs in each bench case.

test_one_key failure

From GitHub Action log I found that the 100th(/200) message timed out. I guess this is because that one pool (randomly) never spawns thread. So I forced thread pool to shutdown in this test, to make sure all threads are terminated before we check for the response.

However, even when I shutdown the pool, there will still be possibility of not receiving any message.

thread 'test_one_key' panicked at 'timeout on receiving r0 w0 msg', skiplist/tests/tests.rs:157:23

And then I guess this is caused by pools with the same prefix name, leading to threads of the same name. I changed prefix name of two pools to different ones read and write. This doesn't work.

Finally I merged two pools into one. And this works.

This bug won't affect other part of agatedb, as currently, yatp is only used in tests.
opened by skyzh 7
[dev] panic on drop causes process to abort

As seen in failed CI in https://github.com/tikv/agatedb/pull/88 , agatedb will abort when we panic on drop. I implemented a partial solution the we ignore all failures (e.g. PoisonedLock, failed to remove file) and simply print them. We could later find a better way to solve this issue.
bug

opened by skyzh 5
一个我搞不定的错误 No such file or directory (os error 2)

代码见我的fork的cli分支 https://github.com/rmw-lib/agatedb/tree/cli/examples

我的想法是给 agatedb 写一个命令行工具，也可以当做学习agatedb的demo代码

代码是在 https://github.com/tikv/agatedb/pull/178 (simple agatedb get put) 基础上继续开发的

但是目前卡在了这个错误上 No such file or directory (os error 2)，因为这个错误没代码行提示，我也搞不清楚为啥出错

截图如下

另外，每次运行 /tmp/adb 下面都会多一个 .mem文件，感觉这也不太正常

opened by gcxfd 4
chore: update badger link

Dgraph is dissolved and the original badger repo is not active anymore. Now the badger is maintained under the newly found org, see https://github.com/outcaste-io/badger

opened by Connor1996 4
txn: finish oracle
Based on

https://github.com/tikv/agatedb/pull/67

https://github.com/dgraph-io/badger/blob/45bca18f24ef5cc04701a1e17448ddfce9372da0/txn.go

Signed-off-by: GanZiheng [email protected]
opened by GanZiheng 4
Add manifest
Based on

https://github.com/tikv/agatedb/pull/60

https://github.com/dgraph-io/badger/blob/45bca18f24ef5cc04701a1e17448ddfce9372da0/manifest.go.

Signed-off-by: GanZiheng [email protected]
opened by GanZiheng 4

compact_prios exceeds max_levels

This panic is not guaranteed to be triggered. Will look into details later.

thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Any', /home/skyzh/.cargo/git/checkouts/yatp-e704b73c3ee279b6/6bbea16/src/pool.rs:46:26
stack backtrace:
   0: rust_begin_unwind
             at /rustc/f5230fbf76bafd86ee4376a0e26e551df8d17fec/library/std/src/panicking.rs:495:5
   1: core::panicking::panic_fmt
             at /rustc/f5230fbf76bafd86ee4376a0e26e551df8d17fec/library/core/src/panicking.rs:92:14
   2: core::option::expect_none_failed
             at /rustc/f5230fbf76bafd86ee4376a0e26e551df8d17fec/library/core/src/option.rs:1268:5
   3: core::result::Result<T,E>::unwrap
             at /home/skyzh/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/result.rs:973:23
   4: yatp::pool::ThreadPool<T>::shutdown
             at /home/skyzh/.cargo/git/checkouts/yatp-e704b73c3ee279b6/6bbea16/src/pool.rs:46:17
   5: <agatedb::db::Agate as core::ops::drop::Drop>::drop
             at ./src/db.rs:98:9
   6: core::ptr::drop_in_place
             at /home/skyzh/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/ptr/mod.rs:175:1
   7: agatedb::db::tests::with_agate_test::{{closure}}
             at ./src/db.rs:617:9
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

bug

opened by skyzh 4

add sanitizer

Signed-off-by: Alex Chi [email protected]

fix https://github.com/tikv/agatedb/issues/10

Sanitizer will only run when there's any change in skiplist module. First successful run is triggered along with split sanitizer commit.

In the future, we could apply sanitizer to other modules.

opened by skyzh 4
Merge branch develop into master
Development Task

Merge develop into master part by part.

Development Progress

The pull requests merged at develop branch are as follows, I would try to apply them into master.

vlog

[x] https://github.com/tikv/agatedb/pull/84 @zhangjinpeng1987 https://github.com/tikv/agatedb/pull/143

[x] https://github.com/tikv/agatedb/pull/88 @zhangjinpeng1987 https://github.com/tikv/agatedb/pull/127

memtable

[x] https://github.com/tikv/agatedb/pull/45 @GanZiheng https://github.com/tikv/agatedb/pull/126

[x] https://github.com/tikv/agatedb/pull/46 @GanZiheng https://github.com/tikv/agatedb/pull/126

levels

[x] https://github.com/tikv/agatedb/pull/47 @zhangjinpeng1987 https://github.com/tikv/agatedb/pull/138

[x] https://github.com/tikv/agatedb/pull/50 @zhangjinpeng1987 https://github.com/tikv/agatedb/pull/148

[x] https://github.com/tikv/agatedb/pull/52 @GanZiheng https://github.com/tikv/agatedb/pull/129

[x] https://github.com/tikv/agatedb/pull/54 @GanZiheng https://github.com/tikv/agatedb/pull/129

[x] https://github.com/tikv/agatedb/pull/75 @GanZiheng https://github.com/tikv/agatedb/pull/129

[x] https://github.com/tikv/agatedb/pull/73 @GanZiheng https://github.com/tikv/agatedb/pull/129

[x] https://github.com/tikv/agatedb/pull/77 @GanZiheng https://github.com/tikv/agatedb/pull/129

[x] https://github.com/tikv/agatedb/pull/78 @GanZiheng https://github.com/tikv/agatedb/pull/129

[x] https://github.com/tikv/agatedb/pull/101 @GanZiheng https://github.com/tikv/agatedb/pull/129

txn

[x] https://github.com/tikv/agatedb/pull/57 @GanZiheng https://github.com/tikv/agatedb/pull/132

[x] https://github.com/tikv/agatedb/pull/67 @GanZiheng

https://github.com/tikv/agatedb/pull/136

https://github.com/tikv/agatedb/pull/143

[x] https://github.com/tikv/agatedb/pull/72 @GanZiheng https://github.com/tikv/agatedb/pull/143

agate

[x] https://github.com/tikv/agatedb/pull/58 @GanZiheng https://github.com/tikv/agatedb/pull/127

[x] https://github.com/tikv/agatedb/pull/62 @GanZiheng https://github.com/tikv/agatedb/pull/127

[x] https://github.com/tikv/agatedb/pull/60 @GanZiheng https://github.com/tikv/agatedb/pull/123

[x] https://github.com/tikv/agatedb/pull/87 @GanZiheng https://github.com/tikv/agatedb/pull/170

[x] https://github.com/tikv/agatedb/pull/91 @GanZiheng https://github.com/tikv/agatedb/pull/170

bench

[x] https://github.com/tikv/agatedb/pull/86 @GanZiheng https://github.com/tikv/agatedb/pull/169

[x] https://github.com/tikv/agatedb/pull/90 @GanZiheng https://github.com/tikv/agatedb/pull/169

[x] https://github.com/tikv/agatedb/pull/93 @GanZiheng https://github.com/tikv/agatedb/pull/167

[x] https://github.com/tikv/agatedb/pull/94 @GanZiheng https://github.com/tikv/agatedb/pull/167

[x] https://github.com/tikv/agatedb/pull/96 @GanZiheng https://github.com/tikv/agatedb/pull/167

[x] https://github.com/tikv/agatedb/pull/99 @GanZiheng https://github.com/tikv/agatedb/pull/167

[x] https://github.com/tikv/agatedb/pull/102 @GanZiheng https://github.com/tikv/agatedb/pull/167
opened by GanZiheng 3
Proposal: Ingest External SST
This issue describe a possible way to ingest external SST file to a running agatedb instance just as rocksdb.

Background

TiKV use ingest external SST feature for some use case. To integrate agatedb into TiKV, we need this feature.

From Creating and Ingesting SST files · facebook/rocksdb Wiki (github.com), ingesting a SST needs follow steps:

copy or link the file(s) into the DB directory

block (not skip) writes to the DB and assign correct sequence number to ingested file

flush memtable if range overlap (SST file's key range overlap with memtable)

assign the file to LSM-tree

Things go different in agatedb.

Analysis

Step 1 and step 4 are the same.

For step 3, it is unnecessary to flush memtable when overlap for agatedb. Agatedb reads all levels when get and picks the latest one (less or equal to read_ts), so it's okay to have new keys below old keys in LSM-tree. This really makes the whole process efficient. (Please point out if I'm wrong)

For step 2, it is the most difficult: How to protect ACID during and after ingestion?

Agatedb has ACID transaction (SSI). The architecture is totally different from rocksdb. We have a read_ts and a commit_ts and we have conflict check when commit.

Possible Implementation

Main idea: Regard ingesting SST files as writing large range of keys and make the whole process a transaction.

Ingesting files shall be an expensive job.

Time Point: before commit

Add files, fetch new file id from LevelsController then move or copy files to DB dir with allocated file id(name).

Verify file checksum (optional) and get file infomation (just open Table?).

Make a new transaction, mark update as true (add a ingest flag for transaction?).

(TBD) other checks…

Check files before starting a txn.

Time Point: commit

Same as current, but add ingest range field to CommittedTxn in CommitInfo which contains smallest and largest keys in each ingest files. This is for fast but inaccurate conflict check, otherwise we need to calc every key hash in SST which really takes time.

Set commit_ts as Table's global_version

Send to write channel as usual (see below)

Process Ingest, find suitable level for each files. (Which will take write lock of LevelHandler , and will block read process of other txn)

Write manifest file. The file's global_version is stored in manifest.

A question here. I see we protect the order of Requests send to write channel the same as commit_ts and I wonder why? For WAL in order?

If ingest job can be out-of-order, I think it's possible and better to make ingest job running in current thread and not sending to write channel.

You can see that the whole commit process has only one small I/O — append to manifest. But when picking level, it takes time if there are many tables in this level and this happens under a write lock of LevelHandler .

Conflict Check

There are two ways for conflict check as I described in TP: commit (step 1) .

Calculate every key's hash in ingested files and then conflict check works as same right now. For performance problem, we can calculate key hash before transaction.

Add ingest range field to CommittedTxn in CommitInfo . And for any update txn, when adding read keys, mark smallest and largest read key for this txn. When checking conflict, beside checking hash, also check if read range overlap with ingest range. This is really inaccurate and will cause many txn conflict but I think it makes sense.

The second way needs to refactor transaction much more.

(A more simple way is to break ACID…hhh)

Global Version

This concept is the same as rocksdb. For ingested files, all keys has wrong version and it's unreasonable to modify each. When file has a global version, every key in this file has this version.

In latest rocksdb, this value is stored in manifest to avoid random write to ingested files.

We need to add a field in meta.proto and update Table iterator (also block iterator) to fit this feature.

global_version may have ambiguity. Maybe another name when impl.

Discuss

A new transaction struct only for ingest or refactor current?

Which way to check conflict? (or any better idea?)

Send to write channel or done in current thread?

Any good idea to impliment this feature?

Strategy to pick level. I think from the bottom to check if there is overlap is nice.

What if external files (in one job) has overlap range? Rocksdb will put all files at level0.
opened by wangnengjie 8
Read lock on LevelHandler while get

When getting value from LSM, agatedb iters on levelhandlers and calls LevelHandler::get() under a RwLockReadGuard which will cause the whole read I/O path on this level under a read lock. It is always not a good idea to do I/O under lock even it is a read lock.

See: https://github.com/tikv/agatedb/blob/475e17bce85cf5985161004c068616a594923365/src/levels.rs#L931

I think such read lock is only necessary for fetching associated Tables from LevelHandler and the exact read work can be done outside the lock.

This logic is same as what it is in badger. https://github.com/dgraph-io/badger/blob/7d159dd923ac9e470ffa76365f9b29ccbd038b0d/levels.go#L1602 The comment says it will call h.RLock() and h.RUnlock(), but exactly they are called in getTableForKey() which only locked when getting tables. https://github.com/dgraph-io/badger/blob/master/level_handler.go#L241-L243

BTW, as Table might be removed (I/O) when the struct drop and this might happened when modifying LevelHandler under write lock. So this might influence read? I' mot sure whether it's right?

opened by wangnengjie 4
[DNM] cache: add block cache for table

DO NOT MERGE! This is just an example impl for block cache. The design of LRU is copied from leveldb. (I'm a Rust beginner. So the code is quite ugly.) ~~Currently, as agatedb opens every table in mmap mode, the block cache is nearly useless.~~(sorry I misunderstand mmap) I'll add some test for block cache later. relate issues: https://github.com/tikv/agatedb/issues/153 https://github.com/tikv/agatedb/issues/164

opened by wangnengjie 0
iterator: unify behavior and implement `prev` and `to_last` to all Iterators
Signed-off-by: GanZiheng [email protected]

Unify all kinds of iterators which implement the AgateIterator trait.

Suppose we have another two extra elements in the data which the iterator iterates over, one is before the original first element, called before first, and another is after the last original element, namely after last.

If we arrives the first element, and call prev, we will arrive before first, and valid will return false;

If we arrives the last element, and call next, we will arrive after last, and valid will return false;

At before first, if we call prev, nothing will change, and if we call next, we will move to the first element;

At after last, if we call next, nothing will change, and if we call prev, we will move to the last element;

When entries are set but not committed yet, they will stay in pending_writes of the transaction, and if we build PendingWritesIterator to iterate over them, the timestamp of each entry is set to read timestamp of the transaction. So, we may get duplicate keys in MergeIterator of Iterator generated by Transaction. ~~However, the logic of prev in MergeIterator is complicated if there exists duplicated keys. So I also update the timestamp appended to key. When writing an entry, we will append commit timestamp * 10 to the key and we will use read timestamp * 10 + 5 to search an entry or build PendingWritesIterator.~~ Since all keys are different in PendingWritesIterator and other iterators respectively, to a specific key, there may be at most one pair of identical keys in all iterators. We can simplify logic of MergeIterator with this limitation.
opened by GanZiheng 8
Transaction 的 set(key,value) 不会增加版本号，当前进程读取是新数据，关闭进程后重新读取就会读取旧数据
Transaction 的 set(key,value) 不会增加版本号，当前进程读取是新数据，关闭进程后重新读取就会读取旧数据

读取都是用的

pub fn get(&self, key: impl Into<BytesMut>) -> Result<Value> { let key = key_with_ts(key.into(), std::u64::MAX); self.core.get(&key) }
opened by gcxfd 3

Owner

TiKV Project

GitHub

General basic key-value structs for Key-Value based storages

0 Dec 3, 2022

A fast and simple in-memory database with a key-value data model written in Rust

Segment Segment is a simple & fast in-memory database with a key-value data model written in Rust. Features Dynamic keyspaces Keyspace level control o

61 Jan 5, 2023

A "blazingly" fast key-value pair database without bloat written in rust

A fast key-value pair in memory database. With a very simple and fast API. At Xiler it gets used to store and manage client sessions throughout the pl

16 Dec 16, 2022

Ultra fast, persistent database supporting Redis API

What is SableDb? SableDb is a key-value NoSQL database that utilizes RocksDb as its storage engine and is compatible with the Redis protocol. It aims

406 Aug 9, 2024

Immutable Ordered Key-Value Database Engine

PumpkinDB Build status (Linux) Build status (Windows) Project status Usable, between alpha and beta Production-readiness Depends on your risk toleranc

1.3k Jan 2, 2023

Distributed transactional key-value database, originally created to complement TiDB

Website | Documentation | Community Chat TiKV is an open-source, distributed, and transactional key-value database. Unlike other traditional NoSQL sys

12.4k Jan 3, 2023

RefineDB - A strongly-typed document database that runs on any transactional key-value store.

375 Jan 4, 2023

A simple key value database for storing simple structures.

Perdia-DB A simple key value database for storing simple structures. No nesting of structures is supported, but may be implemented in the future. Toke

4 May 24, 2022

An embedded, in-memory, immutable, copy-on-write, key-value database engine

An embedded, in-memory, immutable, copy-on-write, key-value database engine. Features In-memory database Multi-version concurrency control Rich transa

7 May 5, 2024

LevelDB is a fast key-value storage library written at Google that provides an ordered mapping from string keys to string values.

LevelDB is a fast key-value storage library written at Google that provides an ordered mapping from string keys to string values. Authors: Sanjay Ghem

31.5k Jan 1, 2023

AgateDB is an embeddable, persistent and fast key-value (KV) database written in pure Rust

Related tags

Overview

AgateDB

Project Status

Why not X?

Comments

Development Task

Development Progress

Background

Analysis

Possible Implementation

Time Point: before commit

Time Point: commit

Conflict Check

Global Version

Discuss

Owner

TiKV Project

General basic key-value structs for Key-Value based storages

A fast and simple in-memory database with a key-value data model written in Rust

A "blazingly" fast key-value pair database without bloat written in rust

Ultra fast, persistent database supporting Redis API

Immutable Ordered Key-Value Database Engine

Distributed transactional key-value database, originally created to complement TiDB

RefineDB - A strongly-typed document database that runs on any transactional key-value store.

A simple key value database for storing simple structures.

An embedded, in-memory, immutable, copy-on-write, key-value database engine

LevelDB is a fast key-value storage library written at Google that provides an ordered mapping from string keys to string values.

RedisLess is a fast, lightweight, embedded and scalable in-memory Key/Value store library compatible with the Redis API.

A Distributed SQL Database - Building the Database in the Public to Learn Database Internals

ForestDB - A Fast Key-Value Storage Engine Based on Hierarchical B+-Tree Trie

A simple embedded key-value store written in rust as a learning project

PickleDB-rs is a lightweight and simple key-value store. It is a Rust version for Python's PickleDB

Log structured append-only key-value store from Rust In Action with some enhancements.

A LSM-based Key-Value Store in Rust

A rust Key-Value store based on Redis.

CLI tool to work with Sled key-value databases.