Experimental blockchain database

Parity Technologies

Last update: Dec 26, 2022

Related tags

Database parity-db

Overview

A database for the blockchain.

Design considerations

API

The database is a universal key-value storage that supports transactions. It does not support iteration or prefix-based retrieval.

State-optimized

90% Of blockchain data and IO is trie nodes. Database should allow for efficient storage and retrieval of state data first.

Single writer

Database should be able to support multiple concurrent readers. It is sufficient to allow a single concurrent writer.

No cache

Low level LRU caching of blockchain data, such as individual trie nodes, proves to be inefficient. Cache should be done on a higher level of abstractions. I.e. storage items or block headers.

Transaction isolation

Transaction are applied atomically. Queries can't retrieve partially committed data.

Durability

Database should be restored to consistent state if IO is interrupted at any point.

Implementation

Data structure

Data is organized into columns. Each column serving a particular type of data, e.g. state or headers. Column consists of an index and a set of 16 value tables for varying value size.

Index

Index is an is mmap-backed dynamically sized probing hash table. Each entry is a page of 64 8-byte entries, making 512 bytes. Each 64-bit entry contains 32 bits of value address, 4 bits of value table index and 28 bit value c derived from k. c is computed by skipping n high bits of k and taking the next 28 bits. k is 256-bit key that is derived from the original key and is uniformly distributed. n is current index bit-size. First n bits of k map k to a page. Entries inside the page are unsorted. Empty entry is denoted with a zero value. Empty database starts with n = 16, which allows to put just 240 bits of k in the value table.

Value tables

Value table is linear array of fixed-size entries that can grow as necessary. Each entry may contain one of the following:

Filled entry, that contains 240 bits of k, 16 bit data value size and the actual value
Tombstone entry. This contains an index of the previous tombstone, forming a linked list of available entries.
Multipart entry. This is much like Filled, additionally holding an address of the next entry that holds continuation of the data.

15 of 16 value tables only allow values up to entry size. An additional table with 8kb entry size is designated for large values and allows multipart entries.

Operations

Lookup

Compute k, find index page using first n bits. Search for a matching entry that has matching c bits. Use the address in the entry to query the partial k and value from a value table. Confirm that k is indeed what is expected.

Insertion

If an insertion is attempted into a full index page a reindex is triggered. Page size of 64 index entries trigger a reindex once load factor reaches about 50%.

Reindex

When a collision can't be resolved, a new table is created with twice the capacity. Insertion is immediately continued to the new table. A background process is started that moves entries from the old table to the new. All queries during that process check both tables.

Transaction pipeline

On commit all data is first moved to an in-memory overlay, making it available for queries. The commit is then added to the commit queue. This allows for commit function to return as early as possible. Commit queue is processed by a commit worker that collects data that would be modified in the tables and writes it to the available log file. All modified index and value table pages are placed in the in-memory overlay. The file is then handled to another background thread that flushes it to disk and adds it to the finalization queue. Finally, another thread handles the finalization queue. It reads the file and applies all changes to the tables, clearing the page overlay.

On startup if the log files exists they are validated for corruption and enacted upon the tables.

Potential issues

Memory mapped IO won't be able to support 32-bit systems once the index grows to 2GB.
Size amplification. Index grow up to about 50% capacity before rebalance is triggered. Which means about 50% of allocated space is actually used for occupied index entries. Additionally, each value table entry is only partially filled with actual data.

Comments

iter_column_while doesn't return correct number of elements

Simple test case:

#[test]
fn iter_column_while() {
    for n in 0..10000 {
        let dir = tempfile::TempDir::new().unwrap();
        let options = parity_db::Options::with_columns(dir.path(), 1);
        let db = parity_db::Db::open_or_create(&options).unwrap();

        let keys = vec![[0u8; 32], [1u8; 32], [2u8; 32], [3u8; 32], [4u8; 32]];

        let tx = keys.iter().map(|key| (0, key, Some(key.to_vec())));
        db.commit(tx).unwrap();

        for key in &keys {
            assert!(db.get(0, key).unwrap().is_some());
        }

        let mut keys_count = 0;
        db.iter_column_while(0, |_iter_state| {
            keys_count += 1;

            true
        })
            .unwrap();

        assert_eq!(keys_count, 5, "iteration {}", n);
    }
}

Fails with error like this:

---- object_mappings::tests::iter_column_while stdout ----
---- object_mappings::tests::iter_column_while stdout ----
thread 'object_mappings::tests::iter_column_while' panicked at 'assertion failed: `(left == right)`
  left: `4`,
 right: `5`: iteration 29', crates/subspace-farmer/src/object_mappings/tests.rs:152:9

Number of elements read can vary, for instance with these 5 keys, hashing disabled and salt set to zero iterator returns no elements at all:

0061277d94811d9d65f82e9d267ae62a00219382f109cb4ca806ee0f9d6c2eb5
0061277d94811d9d65f82e9d267ae62a00219382f109cb4ca806ee0f9d6c2eb8
0061277d94811d9d65f82e9d267ae62a00219382f109cb4ca806ee0f9d6c2eba
0061277d94811d9d65f82e9d267ae62a00219382f109cb4ca806ee0f9d6c2ebb
0061277d94811d9d65f82e9d267ae62a00219382f109cb4ca806ee0f9d6c2ece

opened by nazar-pc 14

`btree_index` in existing database

If I change the configuration for a column from default (btree_index = false) to btree_index = true and the db already existed, will it automatically index that column when I open the db?

opened by tgmichel 12
Add reverse iteration for btree db

This pr adds possibility to go back with btree iterator.

Reviewing by commits would make more sense, as the first two ones are minor refactoring. The rest of commits propagate reverse iterator from bottom to the top.

There are some minor problems with this code though. It repeats itself a lot, because I started of with just negating invariants of next function. Also, there is clearly not enough of tests for that, will add them soon.

opened by i1i1 12
Corruption: Missing btree entry at addr 45:61327

Here is one of our users reporting getting "Corruption: Missing btree entry at addr 45:61327" on ParityDb database. Keys and values are both 8 bytes in in one column and one 11 byte key with 8 bytes value in second column.

I suspect application was killed due to shutting down slowly by systemd, which is the type of interruption ParityDb claims to be handling:

Database is restored to consistent state if IO is interrupted at any point.

I don't have much more details at the moment, but I hope this is helpful.

opened by nazar-pc 10
Btree Indexing
This PR contains my initial indexing code. I create a draft PR as things starts to work properly and I am looking a bit more at the performances, but welcome initial feedback. The PR is a bit overly complex, but there is some perf/design I wanted to test so I added some abstraction to check that (not done yet).

Also I did use a lot of code from a different branch which explain I started with a separate file for the btree node storage. I wanted constant size node storage, which I am not convinced is interesting: Storage of big key is pretty costy (key is stored with value :(). So storing in value tables (or its own value tables) may end up being better (untested yet).

Also this expose a specific api: one column we can have both hash indexed and btree indexed contents. Specializing column to either hash indexed or btree indexed can be an option too (I could restore some migration optimization then).

in column I did split write plan, there is still some code from a different branch where locked are used as parameter, not strictly needed. It is very slightly better for not recursively searching index when reindexing, but that is not really relevant.

there is a race in tests, deadlock sometimes. I think it is from the logic of the db test (see TODO linked pr).

key as trait is also something from a different branch, it could be replaced by just be an enum, or make specialized table function (for_parts is containing lot of logic).

Main open questions at this point of the PR:

switch to column specialized for indexed content model?

switch to variable size nodes stored in value tables?

switch CommitSet back to something similar to master
opened by cheme 10
Makes fuzzer call explicitly I/O operations

Makes fuzzing way more deterministic

Allows a more careful model of which state are already persisted and which state is not

Does also some refactoring of Db test related API to be able to call them from the fuzzer

opened by Tpt 6
Slice indexation error after BTree header write failure
First bug uncovered thanks to #113.

When a partial write happens after opening the database with a single BTree the iteration fails on the following db opening with 'range end index 8 out of range for slice of length 0' inside of Entry::read_slice nested in BTreeTable::btree_header.

Reproduction test (requires #113 and the enabeling of the instrumentation feature):

#[test] fn test_from_fuzzer() { crate::set_number_of_allowed_io_operations(12); let tmp = tempdir().unwrap(); let mut options = Options::with_columns(tmp.path(), 1); options.columns[0].btree_index = true; assert!(Db::open_or_create(&options).is_err()); //It's going to fail let db = Db::open_or_create(&options).unwrap(); db.iter(0).unwrap(); //Read is panicking here }
bug
opened by Tpt 6
Fix stack overflow during btree iteration

Migrating from rocksdb to paritydb (here) I discovered that stack randomly overflows during BTreeIterator::{next, prev} calls.

Seems like rust compiler failed to optimize some recursive functions with no state with TCO (even in release mode). Looks like as all optimizations, TCO is also eventual and is not guaranteed in any way. More on the history of it in the compiler here.

So this pr basically rewrites recursive functions using loops so that we would guarantee that we will reuse memory and won't allocate stack frame.

Better to review pr commit by commit, and for the first one it would be helpful to ignore indentations.

opened by i1i1 6
Last state iterator

This is a small refactor of iterator logic when switching backward/forward. Mainly removea bit of redundant code and makes that we do not return same value when changing iteration direction.

Currently draft as I need to reread it and clippy it, but @Tpt it may be interesting to run your fuzzer on this branch.

opened by cheme 5
Testing from db.

This PR extracts some changes I was using in https://github.com/cheme/parity-db/tree/fifo_and_set tests.

These allow stopping db in a given state to specifically test the commit overlay, log overlay or db file. The tests I added are not very useful for the current code base, but here as an example.

In https://github.com/paritytech/parity-db/commit/72b1e1d856731fac0f9f2520f2756f2019f183fb I did try to use it on the stress tests, but only work for small number of commit (I did not debug thoroughly as this was a bit expected from the way it is implemented). Not sure it is really useful to have (it did help me, but can be kept outside of master, same thing for the pr actually).

opened by cheme 5

v0.2.3 has a broken Cargo.toml

I'm trying to follow https://substrate.dev/docs/en/tutorials/create-your-first-substrate-chain/setup

but my build fails with:

$ cargo build --release
error: failed to download `parity-db v0.2.3`

Caused by:
  unable to get packages from source

Caused by:
  failed to parse manifest at `/home/bear/.cargo/registry/src/github.com-1ecc6299db9ec823/parity-db-0.2.3/Cargo.toml`

Caused by:
  failed to parse the version requirement `0.11	` for dependency `parking_lot`

Caused by:
  expected comma after minor version number, found '\t'

I downloaded parity-db==0.2.3 via cargo-download and found that Cargo.toml indeed has an extra \t on the parking_lot declaration:

# THIS FILE IS AUTOMATICALLY GENERATED BY CARGO
#
# When uploading crates to the registry Cargo will automatically
# "normalize" Cargo.toml files for maximal compatibility
# with all versions of Cargo and also rewrite `path` dependencies
# to registry (e.g., crates.io) dependencies
#
# If you believe there's an error in this file please file an
# issue against the rust-lang/cargo repository. If you're
# editing this file be aware that the upstream Cargo.toml
# will likely look very different (and much more reasonable)

[package]
edition = "2018"
name = "parity-db"
version = "0.2.3"
authors = ["Parity Technologies <[email protected]>"]
description = "Key-value database for the blockchain"
homepage = "https://substrate.dev"
license = "MIT OR Apache-2.0"
repository = "https://github.com/paritytech/parity-db/"
[profile.release]
lto = "fat"
codegen-units = 1
debug = true
panic = "abort"
[dependencies.blake2-rfc]
version = "0.2.18"

[dependencies.crc32fast]
version = "1.2.0"

[dependencies.fs2]
version = "0.4.3"

[dependencies.hex]
version = "0.4.2"

[dependencies.libc]
version = "0.2"

[dependencies.log]
version = "0.4.8"

[dependencies.memmap2]
version = "0.2"

[dependencies.parking_lot]
version = "0.11\t"

[dependencies.rand]
version = "0.8.2"
[dev-dependencies.env_logger]
version = "0.8.2"

opened by bernardoaraujor 5

2nd level index

Introduce a second level hash index. This will be used to store smart contract tries, child tries, etc. The main requirement here is that the sub-index can be dropped in amortized constant time.

API requirements: The hash column can be configured to use a second level index. All insert/get commands for such column accept two keys: K1 and K2 where K1 identifies the index, and K2 is the value key in that index. Additionally, there's a drop command that accepts a single key and deletes all values in the index identified by the key.

Implementation ideas are welcome. As far as I see it, the structure could be made similar to existing index with all values in a single index. Deletion could be done by background process, similar to re-indexing. When an index is dropped, it is added to the deletion queue. A background process checks all index entries and deletes any values that are in the indexes being dropped.
enhancement

opened by arkpar 0
Optimize memory usage
Under certain write workloads, when there's a lot of commits with a lot of small value insertions, the memory used by the index overlay can be blown by a factor of 10 or more when compared to the actual data inserted. There's a good example in #93. This is because the overlay keeps the whole index page in memory, even though only a single entry in the page may be modified. I can think of two ways to fix that:

Don't keep the whole page, but rather only modified entries.

Investigate existing mechanisms in linux to control page flushing for memory mapped files. If it is possible to control when exactly individual pages are flushed to disk, we don't need the overlay in the first place.

In any case, this is only worth fixing if the solution does not introduce any performance hits.
enhancement
opened by arkpar 0
Basic ClusterFuzzLite setup
We use for now the GitHub action storage

Choices:

on PR run for 2mns

run a weekly fuzzing for 1h every sunday and then minimize artefacts and compute code coverage
opened by Tpt 1
Properly fuzz concurrency behaviors

Currently the fuzzers only write and read using a single thread and do not control the background threads concurrency. It might be nice to use a tool like loom to properly cover possible concurrency issues.

It should also make the debugging of bugs like #124 easier (no more reproduction issues).
enhancement

opened by Tpt 2

High memory usage on default allocator

After switching from rocksdb to parity db we discovered high memory usage on system allocator:

| | system | jemalloc | |-------|--------|----------| | btree | 5.93G | 4.34G | | kv | 7.27G | 5.49G |

Here is the code for reproducing the issue:

#[global_allocator]
static ALLOC: jemallocator::Jemalloc = jemallocator::Jemalloc;

fn main() {
    let batch_size = 1024 * 1024;
    let handles = (0..4)
        .map(|i| {
            let p = format!("some-path/{i}");
            let _ = std::fs::remove_dir_all(&p);
            let opts = parity_db::Options {
                path: p.into(),
                columns: vec![parity_db::ColumnOptions {
                    preimage: false,
                    btree_index: true,
                    uniform: false,
                    ref_counted: false,
                    compression: parity_db::CompressionType::NoCompression,
                    compression_threshold: 4096,
                }],
                sync_wal: true,
                sync_data: true,
                stats: false,
                salt: None,
            };
            std::thread::spawn(move || {
                let db = parity_db::Db::open_or_create(&opts).unwrap();

                for range in (0..100u64).map(|i| i * batch_size..(i + 1) * batch_size) {
                    db.commit(range.map(|i| (0, i.to_be_bytes(), Some(i.to_le_bytes().to_vec()))))
                        .unwrap();
                }
            })
        })
        .collect::<Vec<_>>();
    for handle in handles {
        handle.join().unwrap();
    }
}

opened by i1i1 4

Owner

Parity Technologies

Solutions for a trust-free world

GitHub

Data analysis infrastructure for the Neo N3 blockchain.

Shrike Shrike is a set of tools built for the purpose of Neo blockchain data analysis. The infrastructure comprises three components: Indexer - Synchr

7 Jan 18, 2023

Immutable Ordered Key-Value Database Engine

PumpkinDB Build status (Linux) Build status (Windows) Project status Usable, between alpha and beta Production-readiness Depends on your risk toleranc

1.3k Jan 2, 2023

Skybase is an extremely fast, secure and reliable real-time NoSQL database with automated snapshots and SSL

Skybase The next-generation NoSQL database What is Skybase? Skybase (or SkybaseDB/SDB) is an effort to provide the best of key/value stores, document

1.4k Dec 29, 2022

Distributed transactional key-value database, originally created to complement TiDB

Website | Documentation | Community Chat TiKV is an open-source, distributed, and transactional key-value database. Unlike other traditional NoSQL sys

12.4k Jan 3, 2023

small distributed database protocol

clepsydra Overview This is a work-in-progress implementation of a core protocol for a minimalist distributed database. It strives to be as small and s

19 Dec 2, 2021

A user crud written in Rust, designed to connect to a MySQL database with full integration test coverage.

SQLX User CRUD Purpose This application demonstrates the how to implement a common design for CRUDs in, potentially, a system of microservices. The de

78 Nov 27, 2022

Rust version of the Haskell ERD tool. Translates a plain text description of a relational database schema to dot files representing an entity relation diagram.

erd-rs Rust CLI tool for creating entity-relationship diagrams from plain text markup. Based on erd (uses the same input format and output rendering).

32 Jul 25, 2022

AgateDB is an embeddable, persistent and fast key-value (KV) database written in pure Rust

AgateDB is an embeddable, persistent and fast key-value (KV) database written in pure Rust. It is designed as an experimental engine for the TiKV project, and will bring aggressive optimizations for TiKV specifically.

535 Jan 9, 2023

A programmable document database inspired by CouchDB written in Rust

PliantDB PliantDB aims to be a Rust-written, ACID-compliant, document-database inspired by CouchDB. While it is inspired by CouchDB, this project will

718 Dec 31, 2022

🐸Slippi DB ingests Slippi replays and puts the data into a SQLite database for easier parsing.

The primary goal of this project is to make it easier to analyze large amounts of Slippi data. Its end goal is to create something similar to Ballchasing.com but for Melee.

20 Jan 2, 2023

A cross-platform terminal database tool written in Rust

gobang is currently in alpha A cross-platform terminal database tool written in Rust Features Cross-platform support (macOS, Windows, Linux) Mu

2.1k Jan 5, 2023

Pure rust embeddable key-value store database.

MHdb is a pure Rust database implementation, based on dbm. See crate documentation. Changelog v1.0.3 Update Cargo.toml v1.0.2 Update Cargo.toml v1.0.1

7 Dec 10, 2022

influxdb provides an asynchronous Rust interface to an InfluxDB database.

influxdb influxdb provides an asynchronous Rust interface to an InfluxDB database. This crate supports insertion of strings already in the InfluxDB Li

9 Feb 16, 2021

Yet Another Kev-Value DataBase

Yet Another Kev-Value DataBase Extremely simple (simplest possible?) single-file BTree-based key-value database. Build for fun and learning: goal is t

18 May 23, 2022

Skytable is an extremely fast, secure and reliable real-time NoSQL database with automated snapshots and TLS

Skytable is an effort to provide the best of key/value stores, document stores and columnar databases, that is, simplicity, flexibility and queryability at scale. The name 'Skytable' exemplifies our vision to create a database that has limitless possibilities. Skytable was previously known as TerrabaseDB (and then Skybase) and is also nicknamed "STable", "Sky" and "SDB" by the community.

1.4k Dec 29, 2022

Experimental blockchain database

Related tags

Overview

A database for the blockchain.

Design considerations

API

State-optimized

Single writer

No cache

Transaction isolation

Durability

Implementation

Data structure

Index

Value tables

Operations

Lookup

Insertion

Reindex

Transaction pipeline

Potential issues

Comments

Owner

Parity Technologies

Data analysis infrastructure for the Neo N3 blockchain.

Immutable Ordered Key-Value Database Engine

Skybase is an extremely fast, secure and reliable real-time NoSQL database with automated snapshots and SSL

Distributed transactional key-value database, originally created to complement TiDB

small distributed database protocol

A user crud written in Rust, designed to connect to a MySQL database with full integration test coverage.

Rust version of the Haskell ERD tool. Translates a plain text description of a relational database schema to dot files representing an entity relation diagram.

AgateDB is an embeddable, persistent and fast key-value (KV) database written in pure Rust

A programmable document database inspired by CouchDB written in Rust

🐸Slippi DB ingests Slippi replays and puts the data into a SQLite database for easier parsing.

A cross-platform terminal database tool written in Rust

Pure rust embeddable key-value store database.

influxdb provides an asynchronous Rust interface to an InfluxDB database.

Yet Another Kev-Value DataBase

Skytable is an extremely fast, secure and reliable real-time NoSQL database with automated snapshots and TLS

RefineDB - A strongly-typed document database that runs on any transactional key-value store.

FeOphant - A SQL database server written in Rust and inspired by PostreSQL.

GlueSQL is a SQL database library written in Rust

Open Zignatures Database