Experimental blockchain database

Related tags

Database parity-db
Overview

A database for the blockchain.

Design considerations

API

The database is a universal key-value storage that supports transactions. It does not support iteration or prefix-based retrieval.

State-optimized

90% Of blockchain data and IO is trie nodes. Database should allow for efficient storage and retrieval of state data first.

Single writer

Database should be able to support multiple concurrent readers. It is sufficient to allow a single concurrent writer.

No cache

Low level LRU caching of blockchain data, such as individual trie nodes, proves to be inefficient. Cache should be done on a higher level of abstractions. I.e. storage items or block headers.

Transaction isolation

Transaction are applied atomically. Queries can't retrieve partially committed data.

Durability

Database should be restored to consistent state if IO is interrupted at any point.

Implementation

Data structure

Data is organized into columns. Each column serving a particular type of data, e.g. state or headers. Column consists of an index and a set of 16 value tables for varying value size.

Index

Index is an is mmap-backed dynamically sized probing hash table. Each entry is a page of 64 8-byte entries, making 512 bytes. Each 64-bit entry contains 32 bits of value address, 4 bits of value table index and 28 bit value c derived from k. c is computed by skipping n high bits of k and taking the next 28 bits. k is 256-bit key that is derived from the original key and is uniformly distributed. n is current index bit-size. First n bits of k map k to a page. Entries inside the page are unsorted. Empty entry is denoted with a zero value. Empty database starts with n = 16, which allows to put just 240 bits of k in the value table.

Value tables

Value table is linear array of fixed-size entries that can grow as necessary. Each entry may contain one of the following:

  • Filled entry, that contains 240 bits of k, 16 bit data value size and the actual value
  • Tombstone entry. This contains an index of the previous tombstone, forming a linked list of available entries.
  • Multipart entry. This is much like Filled, additionally holding an address of the next entry that holds continuation of the data.

15 of 16 value tables only allow values up to entry size. An additional table with 8kb entry size is designated for large values and allows multipart entries.

Operations

Lookup

Compute k, find index page using first n bits. Search for a matching entry that has matching c bits. Use the address in the entry to query the partial k and value from a value table. Confirm that k is indeed what is expected.

Insertion

If an insertion is attempted into a full index page a reindex is triggered. Page size of 64 index entries trigger a reindex once load factor reaches about 50%.

Reindex

When a collision can't be resolved, a new table is created with twice the capacity. Insertion is immediately continued to the new table. A background process is started that moves entries from the old table to the new. All queries during that process check both tables.

Transaction pipeline

On commit all data is first moved to an in-memory overlay, making it available for queries. The commit is then added to the commit queue. This allows for commit function to return as early as possible. Commit queue is processed by a commit worker that collects data that would be modified in the tables and writes it to the available log file. All modified index and value table pages are placed in the in-memory overlay. The file is then handled to another background thread that flushes it to disk and adds it to the finalization queue. Finally, another thread handles the finalization queue. It reads the file and applies all changes to the tables, clearing the page overlay.

On startup if the log files exists they are validated for corruption and enacted upon the tables.

Potential issues

  • Memory mapped IO won't be able to support 32-bit systems once the index grows to 2GB.
  • Size amplification. Index grow up to about 50% capacity before rebalance is triggered. Which means about 50% of allocated space is actually used for occupied index entries. Additionally, each value table entry is only partially filled with actual data.
Comments
  • iter_column_while doesn't return correct number of elements

    iter_column_while doesn't return correct number of elements

    Simple test case:

    #[test]
    fn iter_column_while() {
        for n in 0..10000 {
            let dir = tempfile::TempDir::new().unwrap();
            let options = parity_db::Options::with_columns(dir.path(), 1);
            let db = parity_db::Db::open_or_create(&options).unwrap();
    
            let keys = vec![[0u8; 32], [1u8; 32], [2u8; 32], [3u8; 32], [4u8; 32]];
    
            let tx = keys.iter().map(|key| (0, key, Some(key.to_vec())));
            db.commit(tx).unwrap();
    
            for key in &keys {
                assert!(db.get(0, key).unwrap().is_some());
            }
    
            let mut keys_count = 0;
            db.iter_column_while(0, |_iter_state| {
                keys_count += 1;
    
                true
            })
                .unwrap();
    
            assert_eq!(keys_count, 5, "iteration {}", n);
        }
    }
    

    Fails with error like this:

    ---- object_mappings::tests::iter_column_while stdout ----
    ---- object_mappings::tests::iter_column_while stdout ----
    thread 'object_mappings::tests::iter_column_while' panicked at 'assertion failed: `(left == right)`
      left: `4`,
     right: `5`: iteration 29', crates/subspace-farmer/src/object_mappings/tests.rs:152:9
    

    Number of elements read can vary, for instance with these 5 keys, hashing disabled and salt set to zero iterator returns no elements at all:

    0061277d94811d9d65f82e9d267ae62a00219382f109cb4ca806ee0f9d6c2eb5
    0061277d94811d9d65f82e9d267ae62a00219382f109cb4ca806ee0f9d6c2eb8
    0061277d94811d9d65f82e9d267ae62a00219382f109cb4ca806ee0f9d6c2eba
    0061277d94811d9d65f82e9d267ae62a00219382f109cb4ca806ee0f9d6c2ebb
    0061277d94811d9d65f82e9d267ae62a00219382f109cb4ca806ee0f9d6c2ece
    
    opened by nazar-pc 14
  • `btree_index` in existing database

    `btree_index` in existing database

    If I change the configuration for a column from default (btree_index = false) to btree_index = true and the db already existed, will it automatically index that column when I open the db?

    opened by tgmichel 12
  • Add reverse iteration for btree db

    Add reverse iteration for btree db

    This pr adds possibility to go back with btree iterator.

    Reviewing by commits would make more sense, as the first two ones are minor refactoring. The rest of commits propagate reverse iterator from bottom to the top.

    There are some minor problems with this code though. It repeats itself a lot, because I started of with just negating invariants of next function. Also, there is clearly not enough of tests for that, will add them soon.

    opened by i1i1 12
  • Corruption: Missing btree entry at addr 45:61327

    Corruption: Missing btree entry at addr 45:61327

    Here is one of our users reporting getting "Corruption: Missing btree entry at addr 45:61327" on ParityDb database. Keys and values are both 8 bytes in in one column and one 11 byte key with 8 bytes value in second column.

    I suspect application was killed due to shutting down slowly by systemd, which is the type of interruption ParityDb claims to be handling:

    Database is restored to consistent state if IO is interrupted at any point.

    I don't have much more details at the moment, but I hope this is helpful.

    opened by nazar-pc 10
  • Btree Indexing

    Btree Indexing

    This PR contains my initial indexing code. I create a draft PR as things starts to work properly and I am looking a bit more at the performances, but welcome initial feedback. The PR is a bit overly complex, but there is some perf/design I wanted to test so I added some abstraction to check that (not done yet).

    Also I did use a lot of code from a different branch which explain I started with a separate file for the btree node storage. I wanted constant size node storage, which I am not convinced is interesting: Storage of big key is pretty costy (key is stored with value :(). So storing in value tables (or its own value tables) may end up being better (untested yet).

    Also this expose a specific api: one column we can have both hash indexed and btree indexed contents. Specializing column to either hash indexed or btree indexed can be an option too (I could restore some migration optimization then).

    • in column I did split write plan, there is still some code from a different branch where locked are used as parameter, not strictly needed. It is very slightly better for not recursively searching index when reindexing, but that is not really relevant.

    • there is a race in tests, deadlock sometimes. I think it is from the logic of the db test (see TODO linked pr).

    • key as trait is also something from a different branch, it could be replaced by just be an enum, or make specialized table function (for_parts is containing lot of logic).

    Main open questions at this point of the PR:

    • switch to column specialized for indexed content model?
    • switch to variable size nodes stored in value tables?
    • switch CommitSet back to something similar to master
    opened by cheme 10
  • Makes fuzzer call explicitly I/O operations

    Makes fuzzer call explicitly I/O operations

    Makes fuzzing way more deterministic

    Allows a more careful model of which state are already  persisted and which state is not

    Does also some refactoring of Db test related API to be able to call them from the fuzzer

    opened by Tpt 6
  • Slice indexation error after BTree header write failure

    Slice indexation error after BTree header write failure

    First bug uncovered thanks to #113.

    When a partial write happens after opening the database with a single BTree the iteration fails on the following db opening with 'range end index 8 out of range for slice of length 0' inside of Entry::read_slice nested in BTreeTable::btree_header.

    Reproduction test (requires #113 and the enabeling of the instrumentation feature):

    	#[test]
    	fn test_from_fuzzer() {
    		crate::set_number_of_allowed_io_operations(12);
    		let tmp = tempdir().unwrap();
    		let mut options = Options::with_columns(tmp.path(), 1);
    		options.columns[0].btree_index = true;
    		assert!(Db::open_or_create(&options).is_err()); //It's going to fail
    
    		let db = Db::open_or_create(&options).unwrap();
    		db.iter(0).unwrap(); //Read is panicking here
    	}
    
    bug 
    opened by Tpt 6
  • Fix stack overflow during btree iteration

    Fix stack overflow during btree iteration

    Migrating from rocksdb to paritydb (here) I discovered that stack randomly overflows during BTreeIterator::{next, prev} calls.

    Seems like rust compiler failed to optimize some recursive functions with no state with TCO (even in release mode). Looks like as all optimizations, TCO is also eventual and is not guaranteed in any way. More on the history of it in the compiler here.

    So this pr basically rewrites recursive functions using loops so that we would guarantee that we will reuse memory and won't allocate stack frame.

    Better to review pr commit by commit, and for the first one it would be helpful to ignore indentations.

    opened by i1i1 6
  • Last state iterator

    Last state iterator

    This is a small refactor of iterator logic when switching backward/forward. Mainly removea bit of redundant code and makes that we do not return same value when changing iteration direction.

    Currently draft as I need to reread it and clippy it, but @Tpt it may be interesting to run your fuzzer on this branch.

    opened by cheme 5
  • Testing from db.

    Testing from db.

    This PR extracts some changes I was using in https://github.com/cheme/parity-db/tree/fifo_and_set tests.

    These allow stopping db in a given state to specifically test the commit overlay, log overlay or db file. The tests I added are not very useful for the current code base, but here as an example.

    In https://github.com/paritytech/parity-db/commit/72b1e1d856731fac0f9f2520f2756f2019f183fb I did try to use it on the stress tests, but only work for small number of commit (I did not debug thoroughly as this was a bit expected from the way it is implemented). Not sure it is really useful to have (it did help me, but can be kept outside of master, same thing for the pr actually).

    opened by cheme 5
  • v0.2.3 has a broken Cargo.toml

    v0.2.3 has a broken Cargo.toml

    I'm trying to follow https://substrate.dev/docs/en/tutorials/create-your-first-substrate-chain/setup

    but my build fails with:

    $ cargo build --release
    error: failed to download `parity-db v0.2.3`
    
    Caused by:
      unable to get packages from source
    
    Caused by:
      failed to parse manifest at `/home/bear/.cargo/registry/src/github.com-1ecc6299db9ec823/parity-db-0.2.3/Cargo.toml`
    
    Caused by:
      failed to parse the version requirement `0.11	` for dependency `parking_lot`
    
    Caused by:
      expected comma after minor version number, found '\t'
    

    I downloaded parity-db==0.2.3 via cargo-download and found that Cargo.toml indeed has an extra \t on the parking_lot declaration:

    # THIS FILE IS AUTOMATICALLY GENERATED BY CARGO
    #
    # When uploading crates to the registry Cargo will automatically
    # "normalize" Cargo.toml files for maximal compatibility
    # with all versions of Cargo and also rewrite `path` dependencies
    # to registry (e.g., crates.io) dependencies
    #
    # If you believe there's an error in this file please file an
    # issue against the rust-lang/cargo repository. If you're
    # editing this file be aware that the upstream Cargo.toml
    # will likely look very different (and much more reasonable)
    
    [package]
    edition = "2018"
    name = "parity-db"
    version = "0.2.3"
    authors = ["Parity Technologies <[email protected]>"]
    description = "Key-value database for the blockchain"
    homepage = "https://substrate.dev"
    license = "MIT OR Apache-2.0"
    repository = "https://github.com/paritytech/parity-db/"
    [profile.release]
    lto = "fat"
    codegen-units = 1
    debug = true
    panic = "abort"
    [dependencies.blake2-rfc]
    version = "0.2.18"
    
    [dependencies.crc32fast]
    version = "1.2.0"
    
    [dependencies.fs2]
    version = "0.4.3"
    
    [dependencies.hex]
    version = "0.4.2"
    
    [dependencies.libc]
    version = "0.2"
    
    [dependencies.log]
    version = "0.4.8"
    
    [dependencies.memmap2]
    version = "0.2"
    
    [dependencies.parking_lot]
    version = "0.11\t"
    
    [dependencies.rand]
    version = "0.8.2"
    [dev-dependencies.env_logger]
    version = "0.8.2"
    
    opened by bernardoaraujor 5
  • 2nd level index

    2nd level index

    Introduce a second level hash index. This will be used to store smart contract tries, child tries, etc. The main requirement here is that the sub-index can be dropped in amortized constant time.

    API requirements: The hash column can be configured to use a second level index. All insert/get commands for such column accept two keys: K1 and K2 where K1 identifies the index, and K2 is the value key in that index. Additionally, there's a drop command that accepts a single key and deletes all values in the index identified by the key.

    Implementation ideas are welcome. As far as I see it, the structure could be made similar to existing index with all values in a single index. Deletion could be done by background process, similar to re-indexing. When an index is dropped, it is added to the deletion queue. A background process checks all index entries and deletes any values that are in the indexes being dropped.

    enhancement 
    opened by arkpar 0
  • Optimize memory usage

    Optimize memory usage

    Under certain write workloads, when there's a lot of commits with a lot of small value insertions, the memory used by the index overlay can be blown by a factor of 10 or more when compared to the actual data inserted. There's a good example in #93. This is because the overlay keeps the whole index page in memory, even though only a single entry in the page may be modified. I can think of two ways to fix that:

    1. Don't keep the whole page, but rather only modified entries.

    2. Investigate existing mechanisms in linux to control page flushing for memory mapped files. If it is possible to control when exactly individual pages are flushed to disk, we don't need the overlay in the first place.

    In any case, this is only worth fixing if the solution does not introduce any performance hits.

    enhancement 
    opened by arkpar 0
  • Basic ClusterFuzzLite setup

    Basic ClusterFuzzLite setup

    We use for now the GitHub action storage

    Choices:

    • on PR run for 2mns
    • run a weekly fuzzing for 1h every sunday and then minimize artefacts and compute code coverage
    opened by Tpt 1
  • Properly fuzz concurrency behaviors

    Properly fuzz concurrency behaviors

    Currently the fuzzers only write and read using a single thread and do not control the background threads concurrency. It might be nice to use a tool like loom to properly cover possible concurrency issues.

    It should also make the debugging of bugs like #124 easier (no more reproduction issues).

    enhancement 
    opened by Tpt 2
  • High memory usage on default allocator

    High memory usage on default allocator

    After switching from rocksdb to parity db we discovered high memory usage on system allocator:

    | | system | jemalloc | |-------|--------|----------| | btree | 5.93G | 4.34G | | kv | 7.27G | 5.49G |

    Here is the code for reproducing the issue:

    #[global_allocator]
    static ALLOC: jemallocator::Jemalloc = jemallocator::Jemalloc;
    
    fn main() {
        let batch_size = 1024 * 1024;
        let handles = (0..4)
            .map(|i| {
                let p = format!("some-path/{i}");
                let _ = std::fs::remove_dir_all(&p);
                let opts = parity_db::Options {
                    path: p.into(),
                    columns: vec![parity_db::ColumnOptions {
                        preimage: false,
                        btree_index: true,
                        uniform: false,
                        ref_counted: false,
                        compression: parity_db::CompressionType::NoCompression,
                        compression_threshold: 4096,
                    }],
                    sync_wal: true,
                    sync_data: true,
                    stats: false,
                    salt: None,
                };
                std::thread::spawn(move || {
                    let db = parity_db::Db::open_or_create(&opts).unwrap();
    
                    for range in (0..100u64).map(|i| i * batch_size..(i + 1) * batch_size) {
                        db.commit(range.map(|i| (0, i.to_be_bytes(), Some(i.to_le_bytes().to_vec()))))
                            .unwrap();
                    }
                })
            })
            .collect::<Vec<_>>();
        for handle in handles {
            handle.join().unwrap();
        }
    }
    
    opened by i1i1 4
Owner
Parity Technologies
Solutions for a trust-free world
Parity Technologies
Data analysis infrastructure for the Neo N3 blockchain.

Shrike Shrike is a set of tools built for the purpose of Neo blockchain data analysis. The infrastructure comprises three components: Indexer - Synchr

edge 7 Jan 18, 2023
Immutable Ordered Key-Value Database Engine

PumpkinDB Build status (Linux) Build status (Windows) Project status Usable, between alpha and beta Production-readiness Depends on your risk toleranc

null 1.3k Jan 2, 2023
Skybase is an extremely fast, secure and reliable real-time NoSQL database with automated snapshots and SSL

Skybase The next-generation NoSQL database What is Skybase? Skybase (or SkybaseDB/SDB) is an effort to provide the best of key/value stores, document

Skybase 1.4k Dec 29, 2022
Distributed transactional key-value database, originally created to complement TiDB

Website | Documentation | Community Chat TiKV is an open-source, distributed, and transactional key-value database. Unlike other traditional NoSQL sys

TiKV Project 12.4k Jan 3, 2023
small distributed database protocol

clepsydra Overview This is a work-in-progress implementation of a core protocol for a minimalist distributed database. It strives to be as small and s

Graydon Hoare 19 Dec 2, 2021
A user crud written in Rust, designed to connect to a MySQL database with full integration test coverage.

SQLX User CRUD Purpose This application demonstrates the how to implement a common design for CRUDs in, potentially, a system of microservices. The de

null 78 Nov 27, 2022
Rust version of the Haskell ERD tool. Translates a plain text description of a relational database schema to dot files representing an entity relation diagram.

erd-rs Rust CLI tool for creating entity-relationship diagrams from plain text markup. Based on erd (uses the same input format and output rendering).

Dave Challis 32 Jul 25, 2022
AgateDB is an embeddable, persistent and fast key-value (KV) database written in pure Rust

AgateDB is an embeddable, persistent and fast key-value (KV) database written in pure Rust. It is designed as an experimental engine for the TiKV project, and will bring aggressive optimizations for TiKV specifically.

TiKV Project 535 Jan 9, 2023
A programmable document database inspired by CouchDB written in Rust

PliantDB PliantDB aims to be a Rust-written, ACID-compliant, document-database inspired by CouchDB. While it is inspired by CouchDB, this project will

Khonsu Labs 718 Dec 31, 2022
🐸Slippi DB ingests Slippi replays and puts the data into a SQLite database for easier parsing.

The primary goal of this project is to make it easier to analyze large amounts of Slippi data. Its end goal is to create something similar to Ballchasing.com but for Melee.

Max Timkovich 20 Jan 2, 2023
A cross-platform terminal database tool written in Rust

gobang is currently in alpha A cross-platform terminal database tool written in Rust Features Cross-platform support (macOS, Windows, Linux) Mu

Takayuki Maeda 2.1k Jan 5, 2023
Pure rust embeddable key-value store database.

MHdb is a pure Rust database implementation, based on dbm. See crate documentation. Changelog v1.0.3 Update Cargo.toml v1.0.2 Update Cargo.toml v1.0.1

Magnus Hirth 7 Dec 10, 2022
influxdb provides an asynchronous Rust interface to an InfluxDB database.

influxdb influxdb provides an asynchronous Rust interface to an InfluxDB database. This crate supports insertion of strings already in the InfluxDB Li

null 9 Feb 16, 2021
Yet Another Kev-Value DataBase

Yet Another Kev-Value DataBase Extremely simple (simplest possible?) single-file BTree-based key-value database. Build for fun and learning: goal is t

Sergey Melnychuk 18 May 23, 2022
Skytable is an extremely fast, secure and reliable real-time NoSQL database with automated snapshots and TLS

Skytable is an effort to provide the best of key/value stores, document stores and columnar databases, that is, simplicity, flexibility and queryability at scale. The name 'Skytable' exemplifies our vision to create a database that has limitless possibilities. Skytable was previously known as TerrabaseDB (and then Skybase) and is also nicknamed "STable", "Sky" and "SDB" by the community.

Skytable 1.4k Dec 29, 2022
RefineDB - A strongly-typed document database that runs on any transactional key-value store.

RefineDB - A strongly-typed document database that runs on any transactional key-value store.

Heyang Zhou 375 Jan 4, 2023
FeOphant - A SQL database server written in Rust and inspired by PostreSQL.

A PostgreSQL inspired SQL database written in Rust.

Christopher Hotchkiss 27 Dec 7, 2022
GlueSQL is a SQL database library written in Rust

GlueSQL is a SQL database library written in Rust. It provides a parser (sqlparser-rs), execution layer, and optional storage (sled) packaged into a single library.

GlueSQL 2.1k Jan 8, 2023
Open Zignatures Database

The openZign project Zignatures and other binary identification database. For fun and to aid reverse-engineering tasks. Collected from various datasou

Cyrill Leutwiler 3 Sep 19, 2021