a tokio-enabled data store for triple data

TerminusDB

Last update: Dec 18, 2022

Related tags

Database terminusdb-store

Overview

terminusdb-store, a tokio-enabled data store for triple data

Overview

This library implements a way to store triple data - data that consists of a subject, predicate and an object, where object can either be some value, or a node (a string that can appear both in subject and object position).

An example of triple data is:

cow says value(moo).
duck says value(quack).
cow likes node(duck).
duck hates node(cow).

In cow says value(moo), cow is the subject, says is the predicate, and value(moo) is the object.

In cow likes node(duck), cow is the subject, likes is the predicate, and node(duck) is the object.

terminusdb-store allows you to store a lot of such facts, and search through them efficiently.

This library is intended as a common base for anyone who wishes to build a database containing triple data. It makes very few assumptions on what valid data is, only focusing on the actual storage aspect.

This library is tokio-enabled. Any i/o and locking happens through futures, and as a result, many of the functions in this library return futures. These futures are intended to run on a tokio runtime, and many of them will fail outside of one. If you do not wish to use tokio, there's a small sync wrapper in store::sync which embeds its own tokio runtime, exposing a purely synchronous API.

Usage

Add this to your Cargo.toml:

[dependencies]
terminus-store = "0.16"

create a directory where you want the store to be, then open that store with

let store = terminus_store::open_directory_store("/path/to/store").await.unwrap();

Or use the sync wrapper:

let store = terminus_store::open_sync_directory_store("/path/to/store").unwrap();

For more information, visit the documentation on docs.rs.

Roadmap

We are constantly developing terminusdb-store to make it a high quality succinct graph representation versioned datastorage layer. To help facilitate understanding of our aims for this project we have laid out a Roadmap. If you would like to assist in the development of terminusdb-store, or you think something should be added to the roadmap please contact us.

License

terminus-store is licensed under Apache 2.0.

Contributing

See CONTRIBUTING.md

src/structure: format and minimize trait bounds
There are two related changes here. Neither one involves a change to the function of the code. Both are localized to the src/structure directory to reduce the number of changes to review.

The first change is to run rustfmt over all of the source files in the directory. This is mostly a whitespace change.

The second is reduce the usage of trait bounds to only where they are needed. The main changes were:

In general, the structs did not need trait bounds. I removed all except one, which was necessary.

There were a large number of Clone bounds that were not needed. I removed all except the necessary ones, which I attached to functions instead of the implementation.

I reduced the number of AsRef<[u8]> bounds to make it clear which functions used it and which did not.
opened by spl 9
Use sled as a storage backend

https://github.com/spacejam/sled

Sled is an ACID kv-database with LSM/B+Tree performance characteristics written in rust. It's uses prefix encoding for keys which strikes me as very well aligned with terminusdb's storage strategy. It is currently in beta, though some of its caveats wouldn't really apply to tdb currently; e.g. its GC isn't very advanced, but that doesn't matter for tdb since we don't have a GC yet either. :) Sled's readme has a great overview of the project's goals and implementation.

I think it would be interesting to explore using sled as a (/ an alternative) storage backend for terminusdb-store.
wontfix

opened by infogulch 7
How to represent a sequence of bytes
I'm opening up this issue to discuss the appropriate representation for a buffer (i.e. an arbitrary contiguous sequence of bytes) in terminus-store. This discussion will help me to get an understanding for the motivation and mechanics of the current approach and to probe for reactions to an alternative approach, which I propose at the end. Please feel free to comment on anything or to correct my understanding if necessary.

Currently, the predominant view of a buffer appears to be M: AsRef<[u8]>. This type implies two things:

A given data: M has the operation data.as_ref() that returns &[u8]. This gives a read-only view of a buffer that can be shared between threads without the option of writing to it.

The struct containing the data: M owns the value referencing the buffer. There is no borrowing of references here.

This appears to have been changed from a previously predominant view of a buffer as a slice: data: &'a [u8] (1deedbf3e73d99012e70d96fe821cbc7b26d454e, bf6416bc6cf4c83b9a179e0a6a41dd21b000c69b, ad7dd425e85f75d9bdb7336a54a5e3560acbfed2, e5a50a00338ee5304327e424b72ca7eaf584eccc, c6a14f90b505efba9571da19e6f7f3faa84be5f7). This view meant:

The data: &'a [u8] cannot be shared between threads.

The view into the data lasts no longer than the buffer's owner, who has the 'a lifetime.

Now, given that the buffers currently seem to be backed by one of the two following structs:

pub struct SharedVec(pub Arc<Vec<u8>>);

pub struct SharedMmap(Option<Arc<FileBacking>>);

which both have Arc, I presume that the data is being shared read-only between threads. (I'm actually not yet clear on where the sharing is occurring, so if you want to enlighten me, I'd appreciate it!) If there was no sharing, I think the slice approach is better, since (a) there is less runtime work to manage usage of the buffers and (b) the type system keeps track of the lifetimes.

I think using M: AsRef<[u8]> is somewhat painful as schema for typing a buffer. It's too general and leads to trait bounds such as M: 'static + AsRef<[u8]> + Clone + Send + Sync in many places.

After doing some research, I think something like Bytes from the bytes crate would work better. Bytes is a thread-shareable container representing a contiguous sequence of bytes. It satisfies 'static + AsRef<[u8]> + Clone + Send + Sync. It also supports operations like split_to and split_off, which I think would work well when you want to segment a buffer into different representations. Replacing data: M with data: Bytes would make many of the trait bounds disappear.

Unfortunately, Bytes does not support memmap::Mmap, which means it would not suit terminus-store's current usage of AsRef<[u8]>. However, I've already implemented an adaptation of Bytes that does support memmap::Mmap. Others have, too. See https://github.com/tokio-rs/bytes/issues/359.

Here are some questions prompted by the above:

What's the best way to represent a contiguous sequence of bytes in terminus-store?

Does it need to be read-only?

Does it need to be shared between threads?

Would it be useful to use a less general type than AsRef<[u8]>? Could that type be a struct instead of a set of trait bounds?
opened by spl 5
Replace AsRef<[u8]> with Bytes, stop using mmap
Transition all uses of AsRef<[u8]> to Bytes

Use clone in LogArrayIterator, remove OwnedLogArrayIterator

Track first element in LogArray, remove LogArraySlice

Use Bytes in FileLoad, remove Map associated type

Read files into memory, remove memmap::Mmap

Closes #39
opened by spl 4
Various LogArray improvements
More documentation

More tests

Consistent naming

More error checking

Better error reporting

Cleaned up and unified entry and decode functions

Cleaned up future-based functions (removed Box/dyn, used combinators)

Increased code coverage in tests

Fixed monotonic check
opened by spl 4
Migrate from tokio 0.1 to 0.2
At some point, I suppose terminus-store should be migrated from tokio 0.1 to 0.2.

See the following related items:

https://users.rust-lang.org/t/failed-to-port-mononoke-to-tokio-0-2-experience-report/32478

https://qiita.com/zenixls2/items/c594c87bf208a7094905

https://www.ncameron.org/blog/migrating-a-crate-from-futures-0-1-to-0-3/
opened by spl 4
MemoryBackedStore cloned on every FileLoad::map
Consider this snippet of the implementation of the FileLoad trait for MemoryBackedStore (src/storage/memory.rs):

pub struct MemoryBackedStore { vec: Arc<sync::RwLock<Vec<u8>>>, }

impl FileLoad for MemoryBackedStore { // ... fn map(&self) -> Box<dyn Future<Item = SharedVec, Error = std::io::Error> + Send> { let vec = self.vec.clone(); Box::new(future::lazy(move || { future::ok(SharedVec(Arc::new(vec.read().unwrap().clone()))) })) } }

In the expression vec.read().unwrap().clone() here, if my understanding is correct, it appears that the Vec<u8> underlying the RwLock is being cloned.

I understand that MemoryBackedStore may be primarily intended for testing and, therefore, for relatively small vectors. However, that may not always be the case, and I'd guess that a clone like this could result in excessive memory usage and possibly even a surprise out-of-memory error. (Then again, I could be blowing things out of proportion, pun intended! 💥)

Would it be a good idea to avoid the .clone() here?
opened by spl 4
Lookup refactor

Lookups have been refactored.

For each lookup type (SubjectLookup, SubjectPredicateLookup, PredicateLookup, ObjectLookup) there is now a 'Layer' version (LayerSubjectLookup, LayerSubjectPredciateLookup, LayerPredicateLookup, LayerObjectLookup) which represents a lookup in either the layer's additions or the layer's removals, without taking parent layers into account.

The original implementations for SubjectLookup, SubjectPredicateLookup, PredicateLookup and ObjectLookup have been removed, and have been replaced by single implementations, GenericSubjectLookup, GenericSubjectPredicateLookup, GenericPredicateLookup and GenericObjectLookup. These are implemented in terms of the new Layer Lookup traits.

Unlike the old Lookups these are not implemented recursively, so stack exhaustion should no longer be an issue.

opened by matko 4
Implement layer export and import more generically

This implements layer export and import for the PersistentLayerStore, which currently backs both the memory and the directory layer store (yes it is a bit of a misnomer).

This closes #58. It is also an initial step towards #62, implementing the pack functionality as a separate trait in its own file instead of as part of a bigger trait. It also makes use of the async_trait macro as mentioned in #56.

opened by matko 3
Is terminusdb-store able to handle hundred billion triples for current version?

We have a large triple store (over 50 billion triples) in virtuoso and want to know if terminudb can handle graph of this size or plan to do this. Thanks!

opened by leonqli 3
Rewrite vbyte.rs
Remove unnecessary VByte struct

Add error checking to decode

Optimize encode loop

Add slice wrapper for encoding that does error checking

Add Vec wrapper for encoding that ensures the right size buffer

Fix encoding_len for 0

Lots of documentation

Lots of tests
opened by spl 3
Typed dicts

This pull implements typed dictionaries, a new dictionary type for terminusdb values which uses a different efficient encoding for different types, is able to retrieve by type, and is able to do ranged queries over ordered types. It avoids linear-time operations, making every lookup constant-time or logaritmic-time. It is intended to replace the existing pfc implementation, at least for the values, but probably also for the nodes and predicates.

This is a work in progress.

opened by matko 1
Refactor to go fully in-memory first, and persistent second

A lot of the data structures are currently written around the idea that we're persisting layers directly to some persistent storage. The memory store is then implemented as a special kind of persistence which just saves things in a hashmap.

This is kinda weird and the wrong way around. We actually almost always want to build up stuff in memory first, and only then (possibly after some more checks, like schema checking) persist to disk.

Therefore, we should consider refactoring terminusdb-store to be in-memory first. That means that none of the data structures, nor the layer structures, should have to know about an underlying storage method (FileLoad, FileStore, etc), but instead directly work on Bytes. Builders likewise should not have to know anything about storage, but build directly into Bytes.

Doing all this will also reduce the complexity of the typing in store, as a lot of that is now complicated by the fact that in many places we want to work with builders without knowing this underlying storage method. If all actual persisting logic is moved into the various LayerStores, we probably don't have to include FileLoad or FileStore in any of the typing anymore.
triage

opened by matko 0
open files concurrently while creating a new layer

When creating a new base or child layer, the PersistentLayerStore opens the files sequentially. They should be opened concurrently to prevent unnecessary waiting.
triage

opened by matko 1
Support wildcard and superfluous deletes

Currently deletes are per-triple. This means that deleting all triples for a particular subject has to be done by querying that subject's triples, and deleting them all individually. Besides being an annoying way of deleting, it also leads to a duplication of all the data being deleted, and we lose semantic value, as we do not save the fact that we deleted everything related to a particular subject.

I propose we implement wildcard deletes. Wildcard deletes would assert that for a particular subject, predicate, or object, or for a particular subject-predicate or predicate-object pair, all triples have been deleted in a particular layer.

Additionally, it should be possible to insert such a wildcard delete even if there's no previous insert that'd match the wildcard. This would allow us to shortcut many queries searching for data by specifying at a certain layer that this data will not be found no matter how deep the query drills into the layer stack.

I'm not sure if we can implement wildcard deletes with the present structures or if a new set of structures will be needed.
triage

opened by matko 2
New recursive iteration strategy

Our current iterators drill down the entire layer stack to find triples, returning them in a stable order. This order is stable even if you do delta rollups, which allows fast comparisons in some cases.

In the majority of cases though, the benefits of this approach are irrelevant, and we get to our first result much slower than we would otherwise. It'd be good if we have an iteration strategy which recursively descends through the layers, returning triples as they are found.

This is especially useful for cases where we're really only interested in a single result, such as when there can be only one result due to a cardinality constraint enforced externally.

Implement this new iteration strategy as a new set of iterators accessible through a new set of methods in layer.
planning-week

opened by matko 0

Owner

TerminusDB

The database for data-people - CONCEDO NULLI

GitHub

Scalable and fast data store optimised for time series data such as financial data, events, metrics for real time analysis

OnTimeDB Scalable and fast data store optimised for time series data such as financial data, events, metrics for real time analysis OnTimeDB is a time

2 Apr 5, 2022

Asyncronous Rust Mysql driver based on Tokio.

mysql-async Tokio based asynchronous MySql client library for rust programming language. Installation Library hosted on crates.io. [dependencies] mysq

292 Dec 30, 2022

A tokio-uring backed runtime for Rust

tokio-uring A proof-of-concept runtime backed by io-uring while maintaining compatibility with the Tokio ecosystem. This is a proof of concept and not

726 Jan 4, 2023

Incomplete Redis client and server implementation using Tokio - for learning purposes only

mini-redis mini-redis is an incomplete, idiomatic implementation of a Redis client and server built with Tokio. The intent of this project is to provi

2.3k Jan 4, 2023

Thin wrapper around [`tokio::process`] to make it streamable

process-stream Wraps tokio::process::Command to future::stream. Install process-stream = "0.2.2" Example usage: From Vec<String> or Vec<&str> use proc

4 Jun 25, 2022

📺 Netflix in Rust/ React-TS/ NextJS, Actix-Web, Async Apollo-GraphQl, Cassandra/ ScyllaDB, Async SQLx, Kafka, Redis, Tokio, Actix, Elasticsearch, Influxdb Iox, Tensorflow, AWS

Fullstack Movie Streaming Platform ?? Netflix in RUST/ NextJS, Actix-Web, Async Apollo-GraphQl, Cassandra/ ScyllaDB, Async SQLx, Spark, Kafka, Redis,

34 Apr 17, 2023

AsyncRead/AsyncWrite interface for rustls-on-Tokio

rustls-tokio-stream rustls-tokio-stream is a Rust crate that provides an AsyncRead/AsyncWrite interface for rustls. Examples Create a server and clien

7 May 17, 2023

Awesome books, tutorials, courses, and resources for the Tokio asynchronous runtime ecosystem. ⚡

Awesome Tokio Tokio is an asynchronous runtime for the Rust programming language. It provides the building blocks needed for writing network applicati

59 Oct 27, 2023

Fault-tolerant Async Actors Built on Tokio

Kameo ???? Fault-tolerant Async Actors Built on Tokio Async: Built on tokio, actors run asyncronously in their own isolated spawned tasks. Supervision

135 Jul 25, 2024

High performance and distributed KV store w/ REST API. 🦀

About Lucid KV High performance and distributed KV store w/ REST API. ?? Introduction Lucid is an high performance, secure and distributed key-value s

306 Dec 28, 2022

PickleDB-rs is a lightweight and simple key-value store. It is a Rust version for Python's PickleDB

PickleDB PickleDB is a lightweight and simple key-value store written in Rust, heavily inspired by Python's PickleDB PickleDB is fun and easy to use u

155 Jan 5, 2023

Pure rust embeddable key-value store database.

MHdb is a pure Rust database implementation, based on dbm. See crate documentation. Changelog v1.0.3 Update Cargo.toml v1.0.2 Update Cargo.toml v1.0.1

7 Dec 10, 2022

RefineDB - A strongly-typed document database that runs on any transactional key-value store.

375 Jan 4, 2023

RedisLess is a fast, lightweight, embedded and scalable in-memory Key/Value store library compatible with the Redis API.

145 Nov 23, 2022

Log structured append-only key-value store from Rust In Action with some enhancements.

riakv Log structured, append only, key value store implementation from Rust In Action with some enhancements. Features Persistent key value store with

5 Oct 29, 2022

A LSM-based Key-Value Store in Rust

CobbleDB A LSM-based Key-Value Store in Rust Motivation There is no open-source LSM-based key-value store in Rust natively. Some crates are either a w

2 Oct 25, 2021

A sessioned Merkle key/value store

storage A sessioned Merkle key/value store The crate was designed to be the blockchain state database. It provides persistent storage layer for key-va

15 Oct 25, 2022

Garage is a lightweight S3-compatible distributed object store

Garage [ Website and documentation | Binary releases | Git repository | Matrix channel ] Garage is a lightweight S3-compatible distributed object stor

156 Dec 30, 2022

A rust Key-Value store based on Redis.

Key-Value Store A Key-Value store that uses Redis to store data. Built using an async web framework in Rust with a full Command-Line interface and log

0 Jan 14, 2022

a tokio-enabled data store for triple data

Related tags

Overview

terminusdb-store, a tokio-enabled data store for triple data

Overview

Usage

Roadmap

License

Contributing

See also

Comments

Owner

TerminusDB

Scalable and fast data store optimised for time series data such as financial data, events, metrics for real time analysis

Asyncronous Rust Mysql driver based on Tokio.

A tokio-uring backed runtime for Rust

Incomplete Redis client and server implementation using Tokio - for learning purposes only

Thin wrapper around [`tokio::process`] to make it streamable

📺 Netflix in Rust/ React-TS/ NextJS, Actix-Web, Async Apollo-GraphQl, Cassandra/ ScyllaDB, Async SQLx, Kafka, Redis, Tokio, Actix, Elasticsearch, Influxdb Iox, Tensorflow, AWS

AsyncRead/AsyncWrite interface for rustls-on-Tokio

Awesome books, tutorials, courses, and resources for the Tokio asynchronous runtime ecosystem. ⚡

Fault-tolerant Async Actors Built on Tokio

High performance and distributed KV store w/ REST API. 🦀

PickleDB-rs is a lightweight and simple key-value store. It is a Rust version for Python's PickleDB

Pure rust embeddable key-value store database.

RefineDB - A strongly-typed document database that runs on any transactional key-value store.

RedisLess is a fast, lightweight, embedded and scalable in-memory Key/Value store library compatible with the Redis API.

Log structured append-only key-value store from Rust In Action with some enhancements.

A LSM-based Key-Value Store in Rust

A sessioned Merkle key/value store

Garage is a lightweight S3-compatible distributed object store

A rust Key-Value store based on Redis.