a tokio-enabled data store for triple data

Overview

terminusdb-store, a tokio-enabled data store for triple data

Build Status Crate Documentation codecov

Overview

This library implements a way to store triple data - data that consists of a subject, predicate and an object, where object can either be some value, or a node (a string that can appear both in subject and object position).

An example of triple data is:

cow says value(moo).
duck says value(quack).
cow likes node(duck).
duck hates node(cow).

In cow says value(moo), cow is the subject, says is the predicate, and value(moo) is the object.

In cow likes node(duck), cow is the subject, likes is the predicate, and node(duck) is the object.

terminusdb-store allows you to store a lot of such facts, and search through them efficiently.

This library is intended as a common base for anyone who wishes to build a database containing triple data. It makes very few assumptions on what valid data is, only focusing on the actual storage aspect.

This library is tokio-enabled. Any i/o and locking happens through futures, and as a result, many of the functions in this library return futures. These futures are intended to run on a tokio runtime, and many of them will fail outside of one. If you do not wish to use tokio, there's a small sync wrapper in store::sync which embeds its own tokio runtime, exposing a purely synchronous API.

Usage

Add this to your Cargo.toml:

[dependencies]
terminus-store = "0.16"

create a directory where you want the store to be, then open that store with

let store = terminus_store::open_directory_store("/path/to/store").await.unwrap();

Or use the sync wrapper:

let store = terminus_store::open_sync_directory_store("/path/to/store").unwrap();

For more information, visit the documentation on docs.rs.

Roadmap

We are constantly developing terminusdb-store to make it a high quality succinct graph representation versioned datastorage layer. To help facilitate understanding of our aims for this project we have laid out a Roadmap. If you would like to assist in the development of terminusdb-store, or you think something should be added to the roadmap please contact us.

License

terminus-store is licensed under Apache 2.0.

Contributing

See CONTRIBUTING.md

See also

  • The Terminus database, for which this library was written: Website - GitHub
  • Our prolog bindings for this library: terminus_store_prolog
  • The HDT format, which the terminusdb-store layer format is based on: Website
Comments
  • src/structure: format and minimize trait bounds

    src/structure: format and minimize trait bounds

    There are two related changes here. Neither one involves a change to the function of the code. Both are localized to the src/structure directory to reduce the number of changes to review.

    The first change is to run rustfmt over all of the source files in the directory. This is mostly a whitespace change.

    The second is reduce the usage of trait bounds to only where they are needed. The main changes were:

    • In general, the structs did not need trait bounds. I removed all except one, which was necessary.
    • There were a large number of Clone bounds that were not needed. I removed all except the necessary ones, which I attached to functions instead of the implementation.
    • I reduced the number of AsRef<[u8]> bounds to make it clear which functions used it and which did not.
    opened by spl 9
  • Use sled as a storage backend

    Use sled as a storage backend

    https://github.com/spacejam/sled

    Sled is an ACID kv-database with LSM/B+Tree performance characteristics written in rust. It's uses prefix encoding for keys which strikes me as very well aligned with terminusdb's storage strategy. It is currently in beta, though some of its caveats wouldn't really apply to tdb currently; e.g. its GC isn't very advanced, but that doesn't matter for tdb since we don't have a GC yet either. :) Sled's readme has a great overview of the project's goals and implementation.

    I think it would be interesting to explore using sled as a (/ an alternative) storage backend for terminusdb-store.

    wontfix 
    opened by infogulch 7
  • How to represent a sequence of bytes

    How to represent a sequence of bytes

    I'm opening up this issue to discuss the appropriate representation for a buffer (i.e. an arbitrary contiguous sequence of bytes) in terminus-store. This discussion will help me to get an understanding for the motivation and mechanics of the current approach and to probe for reactions to an alternative approach, which I propose at the end. Please feel free to comment on anything or to correct my understanding if necessary.

    Currently, the predominant view of a buffer appears to be M: AsRef<[u8]>. This type implies two things:

    1. A given data: M has the operation data.as_ref() that returns &[u8]. This gives a read-only view of a buffer that can be shared between threads without the option of writing to it.
    2. The struct containing the data: M owns the value referencing the buffer. There is no borrowing of references here.

    This appears to have been changed from a previously predominant view of a buffer as a slice: data: &'a [u8] (1deedbf3e73d99012e70d96fe821cbc7b26d454e, bf6416bc6cf4c83b9a179e0a6a41dd21b000c69b, ad7dd425e85f75d9bdb7336a54a5e3560acbfed2, e5a50a00338ee5304327e424b72ca7eaf584eccc, c6a14f90b505efba9571da19e6f7f3faa84be5f7). This view meant:

    1. The data: &'a [u8] cannot be shared between threads.
    2. The view into the data lasts no longer than the buffer's owner, who has the 'a lifetime.

    Now, given that the buffers currently seem to be backed by one of the two following structs:

    • pub struct SharedVec(pub Arc<Vec<u8>>);
    • pub struct SharedMmap(Option<Arc<FileBacking>>);

    which both have Arc, I presume that the data is being shared read-only between threads. (I'm actually not yet clear on where the sharing is occurring, so if you want to enlighten me, I'd appreciate it!) If there was no sharing, I think the slice approach is better, since (a) there is less runtime work to manage usage of the buffers and (b) the type system keeps track of the lifetimes.

    I think using M: AsRef<[u8]> is somewhat painful as schema for typing a buffer. It's too general and leads to trait bounds such as M: 'static + AsRef<[u8]> + Clone + Send + Sync in many places.

    After doing some research, I think something like Bytes from the bytes crate would work better. Bytes is a thread-shareable container representing a contiguous sequence of bytes. It satisfies 'static + AsRef<[u8]> + Clone + Send + Sync. It also supports operations like split_to and split_off, which I think would work well when you want to segment a buffer into different representations. Replacing data: M with data: Bytes would make many of the trait bounds disappear.

    Unfortunately, Bytes does not support memmap::Mmap, which means it would not suit terminus-store's current usage of AsRef<[u8]>. However, I've already implemented an adaptation of Bytes that does support memmap::Mmap. Others have, too. See https://github.com/tokio-rs/bytes/issues/359.

    Here are some questions prompted by the above:

    • What's the best way to represent a contiguous sequence of bytes in terminus-store?
    • Does it need to be read-only?
    • Does it need to be shared between threads?
    • Would it be useful to use a less general type than AsRef<[u8]>? Could that type be a struct instead of a set of trait bounds?
    opened by spl 5
  • Replace AsRef<[u8]> with Bytes, stop using mmap

    Replace AsRef<[u8]> with Bytes, stop using mmap

    • Transition all uses of AsRef<[u8]> to Bytes
    • Use clone in LogArrayIterator, remove OwnedLogArrayIterator
    • Track first element in LogArray, remove LogArraySlice
    • Use Bytes in FileLoad, remove Map associated type
    • Read files into memory, remove memmap::Mmap

    Closes #39

    opened by spl 4
  • Various LogArray improvements

    Various LogArray improvements

    • More documentation
    • More tests
    • Consistent naming
    • More error checking
    • Better error reporting
    • Cleaned up and unified entry and decode functions
    • Cleaned up future-based functions (removed Box/dyn, used combinators)
    • Increased code coverage in tests
    • Fixed monotonic check
    opened by spl 4
  • Migrate from tokio 0.1 to 0.2

    Migrate from tokio 0.1 to 0.2

    At some point, I suppose terminus-store should be migrated from tokio 0.1 to 0.2.

    See the following related items:

    • https://users.rust-lang.org/t/failed-to-port-mononoke-to-tokio-0-2-experience-report/32478
    • https://qiita.com/zenixls2/items/c594c87bf208a7094905
    • https://www.ncameron.org/blog/migrating-a-crate-from-futures-0-1-to-0-3/
    opened by spl 4
  • MemoryBackedStore cloned on every FileLoad::map

    MemoryBackedStore cloned on every FileLoad::map

    Consider this snippet of the implementation of the FileLoad trait for MemoryBackedStore (src/storage/memory.rs):

    pub struct MemoryBackedStore {
        vec: Arc<sync::RwLock<Vec<u8>>>,
    }
    
    impl FileLoad for MemoryBackedStore {
        // ...
        fn map(&self) -> Box<dyn Future<Item = SharedVec, Error = std::io::Error> + Send> {
            let vec = self.vec.clone();
            Box::new(future::lazy(move || {
                future::ok(SharedVec(Arc::new(vec.read().unwrap().clone())))
            }))
        }
    }
    

    In the expression vec.read().unwrap().clone() here, if my understanding is correct, it appears that the Vec<u8> underlying the RwLock is being cloned.

    I understand that MemoryBackedStore may be primarily intended for testing and, therefore, for relatively small vectors. However, that may not always be the case, and I'd guess that a clone like this could result in excessive memory usage and possibly even a surprise out-of-memory error. (Then again, I could be blowing things out of proportion, pun intended! 💥)

    Would it be a good idea to avoid the .clone() here?

    opened by spl 4
  • Lookup refactor

    Lookup refactor

    Lookups have been refactored.

    For each lookup type (SubjectLookup, SubjectPredicateLookup, PredicateLookup, ObjectLookup) there is now a 'Layer' version (LayerSubjectLookup, LayerSubjectPredciateLookup, LayerPredicateLookup, LayerObjectLookup) which represents a lookup in either the layer's additions or the layer's removals, without taking parent layers into account.

    The original implementations for SubjectLookup, SubjectPredicateLookup, PredicateLookup and ObjectLookup have been removed, and have been replaced by single implementations, GenericSubjectLookup, GenericSubjectPredicateLookup, GenericPredicateLookup and GenericObjectLookup. These are implemented in terms of the new Layer Lookup traits.

    Unlike the old Lookups these are not implemented recursively, so stack exhaustion should no longer be an issue.

    opened by matko 4
  • Implement layer export and import more generically

    Implement layer export and import more generically

    This implements layer export and import for the PersistentLayerStore, which currently backs both the memory and the directory layer store (yes it is a bit of a misnomer).

    This closes #58. It is also an initial step towards #62, implementing the pack functionality as a separate trait in its own file instead of as part of a bigger trait. It also makes use of the async_trait macro as mentioned in #56.

    opened by matko 3
  • Is terminusdb-store able to handle hundred billion triples for current version?

    Is terminusdb-store able to handle hundred billion triples for current version?

    We have a large triple store (over 50 billion triples) in virtuoso and want to know if terminudb can handle graph of this size or plan to do this. Thanks!

    opened by leonqli 3
  • Rewrite vbyte.rs

    Rewrite vbyte.rs

    • Remove unnecessary VByte struct
    • Add error checking to decode
    • Optimize encode loop
    • Add slice wrapper for encoding that does error checking
    • Add Vec wrapper for encoding that ensures the right size buffer
    • Fix encoding_len for 0
    • Lots of documentation
    • Lots of tests
    opened by spl 3
  • Typed dicts

    Typed dicts

    This pull implements typed dictionaries, a new dictionary type for terminusdb values which uses a different efficient encoding for different types, is able to retrieve by type, and is able to do ranged queries over ordered types. It avoids linear-time operations, making every lookup constant-time or logaritmic-time. It is intended to replace the existing pfc implementation, at least for the values, but probably also for the nodes and predicates.

    This is a work in progress.

    opened by matko 1
  • Refactor to go fully in-memory first, and persistent second

    Refactor to go fully in-memory first, and persistent second

    A lot of the data structures are currently written around the idea that we're persisting layers directly to some persistent storage. The memory store is then implemented as a special kind of persistence which just saves things in a hashmap.

    This is kinda weird and the wrong way around. We actually almost always want to build up stuff in memory first, and only then (possibly after some more checks, like schema checking) persist to disk.

    Therefore, we should consider refactoring terminusdb-store to be in-memory first. That means that none of the data structures, nor the layer structures, should have to know about an underlying storage method (FileLoad, FileStore, etc), but instead directly work on Bytes. Builders likewise should not have to know anything about storage, but build directly into Bytes.

    Doing all this will also reduce the complexity of the typing in store, as a lot of that is now complicated by the fact that in many places we want to work with builders without knowing this underlying storage method. If all actual persisting logic is moved into the various LayerStores, we probably don't have to include FileLoad or FileStore in any of the typing anymore.

    triage 
    opened by matko 0
  • open files concurrently while creating a new layer

    open files concurrently while creating a new layer

    When creating a new base or child layer, the PersistentLayerStore opens the files sequentially. They should be opened concurrently to prevent unnecessary waiting.

    triage 
    opened by matko 1
  • Support wildcard and superfluous deletes

    Support wildcard and superfluous deletes

    Currently deletes are per-triple. This means that deleting all triples for a particular subject has to be done by querying that subject's triples, and deleting them all individually. Besides being an annoying way of deleting, it also leads to a duplication of all the data being deleted, and we lose semantic value, as we do not save the fact that we deleted everything related to a particular subject.

    I propose we implement wildcard deletes. Wildcard deletes would assert that for a particular subject, predicate, or object, or for a particular subject-predicate or predicate-object pair, all triples have been deleted in a particular layer.

    Additionally, it should be possible to insert such a wildcard delete even if there's no previous insert that'd match the wildcard. This would allow us to shortcut many queries searching for data by specifying at a certain layer that this data will not be found no matter how deep the query drills into the layer stack.

    I'm not sure if we can implement wildcard deletes with the present structures or if a new set of structures will be needed.

    triage 
    opened by matko 2
  • New recursive iteration strategy

    New recursive iteration strategy

    Our current iterators drill down the entire layer stack to find triples, returning them in a stable order. This order is stable even if you do delta rollups, which allows fast comparisons in some cases.

    In the majority of cases though, the benefits of this approach are irrelevant, and we get to our first result much slower than we would otherwise. It'd be good if we have an iteration strategy which recursively descends through the layers, returning triples as they are found.

    This is especially useful for cases where we're really only interested in a single result, such as when there can be only one result due to a cardinality constraint enforced externally.

    Implement this new iteration strategy as a new set of iterators accessible through a new set of methods in layer.

    planning-week 
    opened by matko 0
Owner
TerminusDB
The database for data-people - CONCEDO NULLI
TerminusDB
Scalable and fast data store optimised for time series data such as financial data, events, metrics for real time analysis

OnTimeDB Scalable and fast data store optimised for time series data such as financial data, events, metrics for real time analysis OnTimeDB is a time

Stuart 2 Apr 5, 2022
Asyncronous Rust Mysql driver based on Tokio.

mysql-async Tokio based asynchronous MySql client library for rust programming language. Installation Library hosted on crates.io. [dependencies] mysq

Anatoly I 292 Dec 30, 2022
A tokio-uring backed runtime for Rust

tokio-uring A proof-of-concept runtime backed by io-uring while maintaining compatibility with the Tokio ecosystem. This is a proof of concept and not

Tokio 726 Jan 4, 2023
Incomplete Redis client and server implementation using Tokio - for learning purposes only

mini-redis mini-redis is an incomplete, idiomatic implementation of a Redis client and server built with Tokio. The intent of this project is to provi

Tokio 2.3k Jan 4, 2023
Thin wrapper around [`tokio::process`] to make it streamable

process-stream Wraps tokio::process::Command to future::stream. Install process-stream = "0.2.2" Example usage: From Vec<String> or Vec<&str> use proc

null 4 Jun 25, 2022
📺 Netflix in Rust/ React-TS/ NextJS, Actix-Web, Async Apollo-GraphQl, Cassandra/ ScyllaDB, Async SQLx, Kafka, Redis, Tokio, Actix, Elasticsearch, Influxdb Iox, Tensorflow, AWS

Fullstack Movie Streaming Platform ?? Netflix in RUST/ NextJS, Actix-Web, Async Apollo-GraphQl, Cassandra/ ScyllaDB, Async SQLx, Spark, Kafka, Redis,

null 34 Apr 17, 2023
AsyncRead/AsyncWrite interface for rustls-on-Tokio

rustls-tokio-stream rustls-tokio-stream is a Rust crate that provides an AsyncRead/AsyncWrite interface for rustls. Examples Create a server and clien

Deno 7 May 17, 2023
Awesome books, tutorials, courses, and resources for the Tokio asynchronous runtime ecosystem. ⚡

Awesome Tokio Tokio is an asynchronous runtime for the Rust programming language. It provides the building blocks needed for writing network applicati

Marcus Cvjeticanin 59 Oct 27, 2023
High performance and distributed KV store w/ REST API. 🦀

About Lucid KV High performance and distributed KV store w/ REST API. ?? Introduction Lucid is an high performance, secure and distributed key-value s

Lucid ᵏᵛ 306 Dec 28, 2022
PickleDB-rs is a lightweight and simple key-value store. It is a Rust version for Python's PickleDB

PickleDB PickleDB is a lightweight and simple key-value store written in Rust, heavily inspired by Python's PickleDB PickleDB is fun and easy to use u

null 155 Jan 5, 2023
Pure rust embeddable key-value store database.

MHdb is a pure Rust database implementation, based on dbm. See crate documentation. Changelog v1.0.3 Update Cargo.toml v1.0.2 Update Cargo.toml v1.0.1

Magnus Hirth 7 Dec 10, 2022
RefineDB - A strongly-typed document database that runs on any transactional key-value store.

RefineDB - A strongly-typed document database that runs on any transactional key-value store.

Heyang Zhou 375 Jan 4, 2023
RedisLess is a fast, lightweight, embedded and scalable in-memory Key/Value store library compatible with the Redis API.

RedisLess is a fast, lightweight, embedded and scalable in-memory Key/Value store library compatible with the Redis API.

Qovery 145 Nov 23, 2022
Log structured append-only key-value store from Rust In Action with some enhancements.

riakv Log structured, append only, key value store implementation from Rust In Action with some enhancements. Features Persistent key value store with

Arindam Das 5 Oct 29, 2022
A LSM-based Key-Value Store in Rust

CobbleDB A LSM-based Key-Value Store in Rust Motivation There is no open-source LSM-based key-value store in Rust natively. Some crates are either a w

Yizheng Jiao 2 Oct 25, 2021
A sessioned Merkle key/value store

storage A sessioned Merkle key/value store The crate was designed to be the blockchain state database. It provides persistent storage layer for key-va

Findora Foundation 15 Oct 25, 2022
Garage is a lightweight S3-compatible distributed object store

Garage [ Website and documentation | Binary releases | Git repository | Matrix channel ] Garage is a lightweight S3-compatible distributed object stor

Deuxfleurs 156 Dec 30, 2022
A rust Key-Value store based on Redis.

Key-Value Store A Key-Value store that uses Redis to store data. Built using an async web framework in Rust with a full Command-Line interface and log

Miguel David Salcedo 0 Jan 14, 2022
Rust lib for a Vec-like structure that can store different types of different sizes contiguous with each other in memory.

hvec In memory of Anna Harren, who coined the term turbofish - which you'll see a lot of if you use this crate. The main purpose of this crate is the

Vi 2 Oct 23, 2022