Uindex - Universal index

Uindex is a data store, for data that can be parsed as sentences in some context-free language. With it, it is possible to create databases (db), add data to them, and query that data. The shape of each db (or, we might say, its schema) is given by a parsing expression grammar (PEG), so that each database holds data in the form of unicode strings structured according to the top production of the provided PEG.

Queries to uindex also take the forms specified in the PEG: they are just sentences in the language described by the PEG. In addition, in queries we can use variables, in place of any of the productions in the PEG, to retrieve unknown values from the database. We can also use more than a single sentence in the queries.

Uindex stores data in such a manner that adding new sentences to a db is O(1) with respect to the size of the db, and that allows queries with no unknowns to be resolved in O(1) also wrt the size of the db.

Example

As an example, we will build a very simple database of triples, subject-verb-object. Words (be they subjects, verbs or objects) will have the form of strings of alphanumeric characters, and sentences will consist in 3 such words separated by spaces. Uindex provides at the moment a sentence termination symbol (either <> or ◊), though the plan is to allow the user to provide it. So an example sentence in this db could be susan likes oranges ◊. The PEG for this would be:

fact        = { word ~ word ~ word }

word        = @{ ASCII_ALPHANUMERIC+ }

WHITESPACE  = { " " | "\t" | "\r" | "\n" }

Uindex uses Pest to deal with PEGs, so look up the pest documentation for the specific syntax for PEGs used by uindex. In particular, the WHITESPACE rule is special for Pest, and if provided, it will be inserted between any 2 other chained or repeated productions (except within atomic productions, which are marked with @.)

The only uindex specific thing in the grammar above is naming the top production fact; uindex requires it.

Now we want to specify which of our productions can be unknowns in our queries. We transform our grammar as follows:

var         = @{ "X" ~ ('0'..'9')+ }

fact        = { word ~ word ~ word }

v_word      = @{ ASCII_ALPHANUMERIC+ }

word        = _{ var | v_word }

WHITESPACE  = { " " | "\t" | "\r" | "\n" }

In short, we provide a production for variables, we prefix the production we want variables to be able to match with v_, and then we provide a new production, with the old name (without the v_ prefix), which is the sum of the old production and var. We can mark as many productions like this as we want, and they can be terminal or not.

To use this grammar, we need to set up some boilerplate. At this moment, uindex can only be used from Rust.

So we store the code above in a file named grammar.pest, which we place at the root of our rust package.

We must add some dependencies to our Cargo.toml:

[dependencies]
uindex = "0.1.1"
uindex_derive = "0.1.1"
pest = "2.1.3"
pest_derive = "2.1.0"
log = "0.4"
env_logger = "0.7.1"

Then, we build our knowledge base based on the grammar, adding this code in a rust module:

use crate::uindex::kbase::DBGen;
use crate::uindex::kbase::DataBase;

extern crate uindex
#[macro_use]
extern crate uindex_derive;

extern crate pest;
#[macro_use]
extern crate pest_derive;

#[derive(DBGen)]
#[grammar = "grammar.pest"]
pub struct DBGenerator;

This provides us with a struct DBGenerator, whose only responsibility is to create databases that can hold sentences according to grammar.pest. So now we can build a database:

let db = DBGenerator::gen_db();

We can add data to it:

db.tell("susan likes oranges ◊");
db.tell("susan likes apples ◊");
db.tell("john likes oranges ◊");
db.tell("john hates apples ◊");

Finally we can query the system like:

true db.ask("john likes apples ◊"); // -> false db.ask("susan likes X1 ◊"); // -> [{X1: oranges}, {X1: apples}] db.ask("X1 likes oranges ◊ X1 likes apples ◊"); // -> [{X1: susan}] db.ask("susan likes X1 ◊ john likes X1 ◊"); // -> [{X1: oranges}] db.ask("susan X1 apples ◊ john X1 apples ◊"); // -> [] ">

db.ask("john likes oranges ◊");  // -> true
db.ask("john likes apples ◊");  // -> false
db.ask("susan likes X1 ◊");  // -> [{X1: oranges}, {X1: apples}]
db.ask("X1 likes oranges ◊ X1 likes apples ◊");  // -> [{X1: susan}]
db.ask("susan likes X1 ◊ john likes X1 ◊");  // -> [{X1: oranges}]
db.ask("susan X1 apples ◊ john X1 apples ◊");  // -> []

And that's it.

Indexing

Uindex keeps the data in a tree structure (with some loops), so querying implies searching a path in the tree. The tokens that are parsed out of the data entered into uindex are placed in the tree in the order they are found, from root to leaf. So it is not convenient to have variables at the start of your queries. It is fine to have them at the begining of sentences when they are not the first sentence in the query, and the variable in question has already been narrowed down. Ill show an example.

Lets imagine a directory of phone numbers, assigning a number to each distnct pair of given-name and surname. If we arrange our db to hold data as "number given-name surname", and we query "X1 john smith", uindex will check almost all the tree to find the answer. However, if we arrange the db to hold data as "surname given-name number", the query "john smith X1" would check a very narrow part of the tree.

If there were the need to get data fast in both ways, i.e. getting the number from the name, and getting the name from the number, it might make sense to have both types of sentence, at the cost of some data redundancy.

API

Grammar

DB Generator

tell

ask

Complexity

Data structures and algorithms

Benchmarks

Here we compare the performance of uindex with the performance of in memory SQLite (driven from python). This is not to say that uindex can be considered an alternative to SQLite, but simply to show that uindex performs acceptably, i.e. that its costs, in terms of both time and space, are sensible and grow sensibly.

Also note that in terms of space, there is work to be done. Typically, a uindex db would take between 2 and 3 times the memory of a SQLite db with the same data.

Simple db, simple query

For this benchmark we used data with a very simple structure, just a set of triples like in the above example, and fully qualified queries that would just retrieve a single row / sentence, to obtain a yes/no answer. For SQLite, we used a single table with 3 varchar columns, with a single index using all 3 columns. We added up to 10.000.000 entries, and measured the time taken to add new entries and to resolve single answer queries. In this benchmark, the performance in both cases did not degrade with the size of the db.

	uindex	SQLite
insert	4.36 +/- 0.45 μs	12.52 +/- 1.26 μs
query	4.00 +/- 0.53 μs	7.94 +/- 0.91 μs

Code for the uindex benchmark

Code for the SQLite benchmark

Simple db, query with intersection

For this benchmark we used the same data as in the previous benchmark, and queries that would extract a value common to 2 rows / sentences. In this benchmark, the performance in both cases also did not degrade with the size of the db.

	uindex	SQLite
query	9.69 +/- 0.73 μs	21.79 +/- 2.55 μs

Code for the uindex benchmark

Code for the SQLite benchmark

Simple db, query returning multiple rows

For this benchmark we used the same data as in the previous benchmarks, and queries that would extract a number of rows, from 100 to 1000. Adding data had the same cost as in the previous benchmark. The cost of querying data grew a bit more with the number of hits for SQLite:

Code for the uindex benchmark

Code for the SQLite benchmark

Db with 3 tables, query joining all 3

Here we set up a db with 3 tables, one of them with foreign keys to the other 2, and query for data in one of the boundary tables providing data from the other. There was no degradation of performance adding up to 1.000.000 entries in each table, neither for uindex nor for SQLite.

	uindex	SQLite
insert	10.25 +/- 0.86 μs	47.56 +/- 8.01 μs
query	16.94 +/- 1.06 μs	12.82 +/- 0.75 μs

Note that for SQLite, in this case we wanted to check for duplicates before inserting, affecting the performance. For uindex this is given.

Code for the uindex benchmark

Code for the SQLite benchmark

Recursive db.

In this benchmark we set up a store of trees of varying width (number of children per branch) and depth (number of branches from the root to a leaf). An example data point in this scheme:

(6 (60 617 64) (31 493 538))

Each parenthesis is a branch, in which the 1st entry is the name of the branch and the rest are the children; so the previous is a depth 2 width 2 tree.

With SQLite, I have tried with 3 tables, Branch Leaf and Child, where Child would hold a "parent" foreign key to Branch and a "child" forign key to either Child or Branch. I haven not found a combination of indexes that gave a performance anywhere near acceptable, so I am not including SQLite results here. It could take 10 and 12 seconds to find a tree in a db with 20.000 trees of depth 2 and width 2; obviously SQLite is not meant for this kind of load.

However, I want to show that uindex has no problem with this kind of structure, So here is the performance of uindex with trees of depth 2 and width 2, depth 2 and width 3, and depth 3 and witdth 3, querying just for the presence of a particular tree (no unknowns in the query). Performance did not degrade with the size of the db, having tested this with 1.000.000 trees in the db:

	insert	query
2-2	10.73 +/- 2.19 μs	7.75 +/- 1.30 μs
2-3	17.40 +/- 3.14 μs	12.26 +/- 1.47 μs
3-3	55.73 +/- 5.47 μs	38.79 +/- 3.26 μs

Code for the uindex benchmark

Code for the SQLite benchmark

TODO

Note that this is a work in progress. At the moment uindex does not even have persistence; it only exists in memory. There is also room for improvement in the sizes of the dbs, and queries would benefit by using some parallellism. There is also work being done to add (numeric and string) constraints to the query variables.

cooptex provides deadlock-free Mutexes.

cooptex provides deadlock-free Mutexes. The [CoopMutex::lock] method wraps the [std::sync::Mutex] return value with a Result that will request

12 Oct 10, 2022

A lock-free thread-owned queue whereby tasks are taken by stealers in entirety via buffer swapping

Swap Queue A lock-free thread-owned queue whereby tasks are taken by stealers in entirety via buffer swapping. This is meant to be used [thread_local]

20 Sep 9, 2022

Quinine is a Rust library that implements atomic, lock-free, but write-once versions of containers like `Box` or `Arc`

4 Feb 19, 2022

tr-lang is a language that aims to bring programming language syntax closer to Turkish.

tr-lang Made with ❤️ in 🇹🇷 tr-lang is a language that aims to bring programming language syntax closer to Turkish. tr-lang is a stack based language

10 Apr 2, 2022

A language server implementation for the WGSL shading language

wgsl-analyzer wgsl-analyzer is a language server plugin for the WGSL Shading language. It comes with a VS Code plugin located in ./editors/code, but d

155 Jan 2, 2023

Rust library that can be reset if you think it's slow

GoodbyeKT Rust library that can be reset if you think it's slow

39 Jun 16, 2022

Rustcat is a port listener that can be used for different purposes.

⚙️ Rustcat ⚙️ -- Basic Netcat Alternative -- About Rustcat is a port listener that can be used for different purposes.

489 Dec 28, 2022

QueingSimulator is an application that can be used to build intuitions about behavior of synchronous request/reply systems

Queueing Simulator QueingSimulator is an application that can be used to build intuitions about behavior of synchronous request/reply systems (such as

7 Sep 11, 2022

`fugit` provides a comprehensive library of `Duration` and `Instant` for the handling of time in embedded systems, doing all it can at compile time.

fugit fugit provides a comprehensive library of Duration and Instant for the handling of time in embedded systems, doing all it can at compile time. T

40 Oct 2, 2022

Uindex is a data store, for data that can be parsed as sentences in some context-free language.

Related tags

Overview

Uindex - Universal index

Example

Indexing

API

Grammar

DB Generator

tell

ask

Complexity

Data structures and algorithms

Benchmarks

Simple db, simple query

Simple db, query with intersection

Simple db, query returning multiple rows

Db with 3 tables, query joining all 3

Recursive db.

TODO

You might also like...

cooptex provides deadlock-free Mutexes.

A lock-free thread-owned queue whereby tasks are taken by stealers in entirety via buffer swapping

Quinine is a Rust library that implements atomic, lock-free, but write-once versions of containers like `Box` or `Arc`

tr-lang is a language that aims to bring programming language syntax closer to Turkish.

A language server implementation for the WGSL shading language

Rust library that can be reset if you think it's slow

Rustcat is a port listener that can be used for different purposes.

QueingSimulator is an application that can be used to build intuitions about behavior of synchronous request/reply systems

`fugit` provides a comprehensive library of `Duration` and `Instant` for the handling of time in embedded systems, doing all it can at compile time.

Owner

Enrique Pérez Arnaud

This is a simple Telegram bot with interface to Firefly III to process and store simple transactions.

osu-link is a program which links osu!stable beatmaps to osu!lazer's new store format, saving you disk space.

A repo to store all my 6-Month Rust learning steps

Gathering some metrics about github projects

Ointers is a library for representing pointers where some bits have been stolen so that they may be used by the programmer for something else

Some tools for streaming frames to rpi-rgb-led-matrix using ZeroMQ, written in Rust.

Solutions of Advent of Code 2021 in Rust, and some other languages.

A rust program to try and detect some types of Hardware Keyloggers.

This is a Telegram bot I'm working on in my free time to learn Rust.

🦜 A hassle-free, highly performant, host it yourself Discord music bot built with Serenity in Rust. Powered by youtube-dl and Genius.