Cleora AI is a general-purpose model for efficient, scalable learning of stable and inductive entity embeddings for heterogeneous relational data.

Synerise

Last update: Dec 20, 2022

Related tags

Machine learning machine-learning ai graphs ml embeddings entity deepwalk datasets hypergraphs synerise pytorch-biggraph cleora-embeddings inductive-entity-embeddings

Overview

Cleora

Cleora is a genus of moths in the family Geometridae. Their scientific name derives from the Ancient Greek geo γῆ or γαῖα "the earth", and metron μέτρον "measure" in reference to the way their larvae, or "inchworms", appear to "measure the earth" as they move along in a looping fashion.

Cleora is a general-purpose model for efficient, scalable learning of stable and inductive entity embeddings for heterogeneous relational data.

Read the whitepaper "Cleora: A Simple, Strong and Scalable Graph Embedding Scheme"

Cleora embeds entities in n-dimensional spherical spaces utilizing extremely fast stable, iterative random projections, which allows for unparalleled performance and scalability.

Types of data which can be embedded include for example:

heterogeneous undirected graphs
heterogeneous undirected hypergraphs
text and other categorical array data
any combination of the above

Key competitive advantages of Cleora:

more than 197x faster than DeepWalk
~4x-8x faster than PyTorch-BigGraph (depends on use case)
star expansion, clique expansion, and no expansion support for hypergraphs
quality of results outperforming or competitive with other embedding frameworks like PyTorch-BigGraph, GOSH, DeepWalk, LINE
can embed extremely large graphs & hypergraphs on a single machine

Embedding times - example:

Algorithm	FB dataset	RoadNet dataset	LiveJournal dataset
Cleora	00:00:43 h	00:21:59 h	01:31:42 h
PyTorch-BigGraph	00:04.33 h	00:31:11 h	07:10:00 h

Link Prediction results - example:

	FB dataset		RoadNet dataset		LiveJournal dataset
Algorithm	MRR	HitRate@10	MRR	HitRate@10	MRR	HitRate@10
Cleora	0.072	0.172	0.929	0.942	0.586	0.627
PyTorch-BigGraph	0.035	0.072	0.850	0.866	0.565	0.672

Cleora design principles

Cleora is built as a multi-purpose "just embed it" tool, suitable for many different data types and formats.

Cleora ingests a relational table of rows representing a typed and undirected heterogeneous hypergraph, which can contain multiple:

typed categorical columns
typed categorical array columns

For example a relational table representing shopping baskets may have the following columns:

user <\t> product <\t> store

With the input file containing values:

user_id <\t> product_id product_id product_id <\t> store_id

Every column has a type, which is used to determine whether spaces of identifiers between different columns are shared or distinct. It is possible for two columns to share a type, which is the case for homogeneous graphs:

user <\t> user

Based on the column format specification, Cleora performs:

Star decomposition of hyper-edges
Creation of pairwise graphs for all pairs of entity types
Embedding of each graph

The final output of Cleora consists of multiple files for each (undirected) pair of entity types in the table.

Those embeddings can then be utilized in a novel way thanks to their dim-wise independence property, which is described further below.

Key technical features of Cleora embeddings

The embeddings produced by Cleora are different from those produced by Node2vec, Word2vec, DeepWalk or other systems in this class by a number of key properties:

efficiency - Cleora is two orders of magnitude faster than Node2Vec or DeepWalk
inductivity - as Cleora embeddings of an entity are defined only by interactions with other entities, vectors for new entities can be computed on-the-fly
updatability - refreshing a Cleora embedding for an entity is a very fast operation allowing for real-time updates without retraining
stability - all starting vectors for entities are deterministic, which means that Cleora embeddings on similar datasets will end up being similar. Methods like Word2vec, Node2vec or DeepWalk return different results with every run.
cross-dataset compositionality - thanks to stability of Cleora embeddings, embeddings of the same entity on multiple datasets can be combined by averaging, yielding meaningful vectors
dim-wise independence - thanks to the process producing Cleora embeddings, every dimension is independent of others. This property allows for efficient and low-parameter method for combining multi-view embeddings with Conv1d layers.
extreme parallelism and performance - Cleora is written in Rust utilizing thread-level parallelism for all calculations except input file loading. In practice this means that the embedding process is often faster than loading the input data.

Key usability features of Cleora embeddings

The technical properties described above imply good production-readiness of Cleora, which from the end-user perspective can be summarized as follows:

heterogeneous relational tables can be embedded without any artificial data pre-processing
mixed interaction + text datasets can be embedded with ease
cold start problem for new entities is non-existent
real-time updates of the embeddings do not require any separate solutions
multi-view embeddings work out of the box
temporal, incremental embeddings are stable out of the box, with no need for re-alignment, rotations or other methods
extremely large datasets are supported and can be embedded within seconds / minutes

Data formats supported by Cleora

Cleora supports 2 input file formats:

TSV (tab-separated values)
JSON

For TSV datasets containing composite fields (categorical array), multiple items within a field are then separated by space.

The specification of an input format is as follows:

--columns="[column modifiers, ::]<column_name> [column modifiers, ::]<column_name> [column modifiers, ::]<column_name> ..."

The allowed column modifiers are:

transient - the field is virtual - it is considered during embedding process, no entity is written for the column
complex - the field is composite, containing multiple entity identifiers separated by space in TSV or an array in JSON
reflexive - the field is reflexive, which means that it interacts with itself, additional output file is written for every such field
ignore - the field is ignored, no output file is written for the field

Allowed combinations of modifiers are:

transient
complex
transient::complex
reflexive::complex

Combinations which don't make sense are:

reflexive - this would represent an identity relation
transient::reflexive - this would generate no output
reflexive::transient::complex - this would generate no output

Running

One can download binary from releases or build oneself. See Building section below.

Command line options (for more info use --help as program argument):

-i --input (name of the input file)
-t --type (type of the input file)
-o --output-dir (output directory for files with embeddings)
-r --relation-name (name of the relation, for output filename generation)
-d --dimension (number of dimensions for output embeddings)
-n --number-of-iterations (number of iterations for the algorithm, usually 3 or 4 works well)
-s --seed (seed integer for embedding initialization)
-c --columns (column format specification)
-p --prepend-field-name (prepend field name to entity in output)
-l --log-every-n (log output every N lines)
-e --in-memory-embedding-calculation (calculate embeddings in memory or with memory-mapped files)
-f --output-format (either textfile (default) or numpy)

An example input file for Cleora (stored in files/samples/edgelist_1.tsv):

a ba bac <\t> abb <\t> r rrr rr
a ab bca <\t> bcc <\t> rr r
ba ab a aa <\t> abb <\t> r rrr

An example of how to run Cleora in order to calculate embeddings:

./cleora -i files/samples/edgelist_1.tsv --columns="complex::reflexive::a b complex::c" -d 128 -n 4 --relation-name=just_a_test -p 0

It generates the following output files for textfile:

just_a_test__a__a.out
just_a_test__a__b.out
just_a_test__a__c.out
just_a_test__b__c.out

containing embedding vectors for respective pairs of columns.

For numpy output format each pair of embedded entities is stored in three files: .entities a list of entities in json format, .npy a numpy array containing embeddings, .occurences numpyarray containing entities occurence counts.

just_a_test__a__a.out{.entities, .npy, .occurences}
just_a_test__a__b.out{.entities, .npy, .occurences}
just_a_test__a__c.out{.entities, .npy, .occurences}
just_a_test__b__c.out{.entities, .npy, .occurences}

Building

Install Rust - https://www.rust-lang.org/tools/install
execute cargo build --release

More details

Algorithm

Hypergraph Expansion. Cleora needs to break down all existing hyperedges into edges as the algorithm relies on the pairwise notion of node transition. Hypergraph expansion to graph is done using two alternative strategies:

Clique Expansion - each hyperedge is transformed into a clique - a subgraph where each pair of nodes is connected with an edge. Space/time complexity of this approach is O(|V| x d + |E| x k^2) where |E| is the number of hyperedges. With the usage of cliques the number of created edges can be significant but guarantees better fidelity to the original hyperedge relationship. We apply this scheme to smaller graphs.
Star Expansion - an extra node is introduced which links to the original nodes contained by a hyperedge. Space/time complexity of this approach is O((|V|+|E|) x d + |E|k). Here we must count in the time and space needed to embed an extra entity for each hyperedge, but we save on the number of created edges, which would be only k for each hyperedge. This approach is suited for large graphs.

Implementation

Cleora is implemented in Rust in a highly parallel architecture, using multithreading and adequate data arrangement for fast CPU access. We exemplify the embedding procedure in Figure, using a very general example of multiple relations in one graph.

For maximum efficiency we created a custom implementation of a sparse matrix data structure - the SparseMatrix struct. It follows the sparse matrix coordinate format (COO). Its purpose is to save space by holding only the coordinates and values of nonzero entities.

Embedding is done in 2 basic steps: graph construction and training.

Let's assume that the basic configuration of the program looks like this:

--input files/samples/edgelist_2.tsv --columns="users complex::products complex::brands" --dimension 3 --number-of-iterations 4

Every SparseMatrix is created based on the program argument --columns. For our example, there will be three SparseMatrix'es that will only read data from the columns:

users and brands by M1
products and brands by M2
users and products by M3

Graph construction Graph construction starts with the creation of a helper matrix P object as a regular 2-D Rust array, which is built according to the selected expansion method. An example involving clique expansion is presented in Figure - a Cartesian product (all combinations) of all columns is created. Each entity identifier from the original input file is hashed with xxhash (https://cyan4973.github.io/xxHash/) - a fast and efficient hashing method. We hash the identifiers to store them in a unified, small data format. From the first line of our example:

u1  p1 p2   b1 b2

we get 4 combinations produced by the Cartesian product:

[4, u1_hash, p1_hash, b1_hash]
[4, u1_hash, p1_hash, b2_hash]
[4, u1_hash, p2_hash, b1_hash]
[4, u1_hash, p2_hash, b2_hash]

At the beginning we insert the total number of combinations (in this case 4). Then we add another 3 rows representing combinations from the second row of the input.

Subsequently, for each relation pair from matrix P we create a separate matrix M as a SparseMatrix struct (the matrices M will usually hold mostly zeros). Each matrix M object is produced in a separate thread in a stepwise fashion. The rows of matrix P object are broadcasted to all matrix M objects, and each matrix M object reads the buffer selecting the appropriate values, updating its content. For example, M3 (users and products) reads the hashes from indexes 1 and 2. After reading the first vector:

[4, u1_hash, p1_hash, b1_hash]

the edge value for u1_hash <-> p1_hash equals 1/4 (1 divided by the total number of Cartesian products). After reading the next vector:

[4, u1_hash, p1_hash, b2_hash]

the edge value for u1_hash <-> p1_hash updates to 1/2 (1/4 + 1/4). After reading the next two, we finally have:

u1_hash <-> p1_hash = 1/2
u1_hash <-> p2_hash = 1/2

Training In this step training proceeds separately for each matrix M, so we will now refer to a single object. The matrix M object is multiplied by a freshly initialized 2-D array representing matrix T_0. Multiplication is done against each column of matrix T_0 object in a separate thread. The obtained columns of the new matrix T_1 object are subsequently merged into the full matrix. T_1 object is L2-normalized, again in a multithreaded fashion across matrix columns. The appropriate matrix representation lets us accelerate memory access taking advantage of CPU caching. Finally, depending on the target iteration number, the matrix object is either returned as program output and printed to file, or fed for next iterations of multiplication against the matrix M object.

Memory consumption

Every SparseMatrix object allocates space for:

|V| objects, each occupying 40 bytes,
2 x |E| objects (in undirected graphs we need to count an edge in both directions), each occupying 24 bytes.

During training we need additonal 2 x d x |V| objects, each occupying 4 bytes (this can be avoided by using memory-mapped files, see --in-memory-embedding-calculation argument for the program).

Comments

Add basic JSON support

I'd like to be able to use Cleora for JSON data, so this is an attempt to implement support. I'm not an experienced Rustacean, so there's definitely some cruft here to be removed, but it's basically working. There is now a file type parameter that can be passed in to parse input data as JSON. Note that I've removed any mention of CSV since it doesn't actually seem to be supported, but correct if I'm wrong. If this seems like something you may eventually want to merge, happy to take suggestions on improvements

opened by michaelmior 7
Question about reproducing Dunhumby product complements/substitutes from white paper
Hello, Thanks for the cool project!

I was trying to reproduce the results of using Cleora to identify product complements/substitutes in the Dunhumby Complete Journey shown in the white paper but had a question about how the transaction data should be formatted and how the column types should be specified when running the Cleora binary:

I've formatted the transaction data by "basket" such that each row contains a user, and a sequence of product_ids

user_id <\t> product_id product_id product_id user_id <\t> product_id product_id user_id <\t> product_id product_id product_id product_id

and so on. and then I've run Cleora using...

for product complements: ./cleora-v1.1.1-x86_64-unknown-linux-gnu -i ./dunhumby_data/dh_clique.txt --columns="transient::user complex::products" -d 1024 -n 1

and for product substitutes: ./cleora-v1.1.1-x86_64-unknown-linux-gnu -i ./dunhumby_data/dh_clique.txt --columns="transient::user complex::products" -d 1024 -n 4

After this just comparing cosine similarities and looking at the top 5 most similar "complements" and "substitutes" for one of the products form the white paper "SOUP RAMEN NOODLES/RAMEN CUPS 3 OZ", the complments and substitutes are not really similar to what is reported in the white paper. E.g. Top 3 Substitutes are

BUTTER BUTTER 8OZ CODIMENTS/SAUCES BBQ SAUCE 18OZ STATIONERY & SCHOOL SUPPLIES TAPE & MAILING PRODUCTS 60CT

instead of

SOUP RAMEN NOODLES/RAMEN CUPS 3 OZ SOUP RAMEN NOODLES/RAMEN CUPS 3 OZ SOUP RAMEN NOODLES/RAMEN CUPS 3 OZ

As reported in the white paper.

I guess that I've not set the data up correctly or have specified the column types incorrectly. Is there code somewhere that describes using Cleora for this type of problem? Otherwise any hints would be greatly appreciated. Thanks!
opened by tblazina 4
Interesting phenomenon

Hello Cleora team, a very interesting and clever solution for creating embeddings. However, I noticed a behavior that I cannot explain. When creating embeddings with one column (a category of a single node type) that contains both a start and an end node (simple edge list), nodes that are further away from each other generate a vector that is closer to each other. E.g .: (a) -> (b) -> (c) -> (d)

as Edge List: a b b c c d

The vectors a and d are closer together than the vectors a and b (by Cosin value)

Volume: approx. 5.5 million nodes and 41 million edges

I created the embeddings with the following call:

--columns = 'complex :: reflexive :: nodes' -d = 128 -i = 'node.edgelist' -n = 4

As I understand the pattern, the reflexive relationship in a column of a single type (complex) should cover an edge list with a category of node types. What am I doing wrong with the configuration or is this an issue?

A short tip would be very appreciated.

Best

opened by dstaehler 4

Check for malformed lines in input. Computation partially proceeds without warning instead of aborting on Windows.

I have spoken with Jack Dabrowski today regarding some problems with processing a large input file on Windows. The input and output files can be found below. Here is a sample command execution for reproduction purposes:

C:\Path\To\Repo\cleora\target\debug\cleora.exe --input edges.tsv --dimension 100 --number-of-iterations 10 --columns="media complex::tropes" --output-dir Output

This was run on a Windows 10 machine and yields the following debug information...

C:\Path\To\Repo\cleora>C:\Path\To\Repo\cleora\target\debug\cleora.exe --input edges.tsv --dimension 100 --number-of-iterations 10 --columns="media complex::tropes" --output-dir Output
[2022-06-24T16:47:38Z INFO  cleora] Reading args...
[src\[main.rs:202](http://main.rs:202/)] &config = Configuration {
    produce_entity_occurrence_count: true,
    embeddings_dimension: 100,
    max_number_of_iteration: 10,
    seed: None,
    prepend_field: false,
    log_every_n: 10000,
    in_memory_embedding_calculation: true,
    input: "edges.tsv",
    file_type: Tsv,
    output_dir: Some(
        "Output",
    ),
    output_format: TextFile,
    relation_name: "emb",
    columns: [
        Column {
            name: "media",
            transient: false,
            complex: false,
            reflexive: false,
            ignored: false,
        },
        Column {
            name: "tropes",
            transient: false,
            complex: true,
            reflexive: false,
            ignored: false,
        },
    ],
}
[2022-06-24T16:47:38Z INFO  cleora] Starting calculation...
[src\[pipeline.rs:25](http://pipeline.rs:25/)] &sparse_matrices = [
    SparseMatrix {
        col_a_id: 0,
        col_a_name: "media",
        col_b_id: 1,
        col_b_name: "tropes",
        edge_count: 0,
        hash_2_id: {},
        id_2_hash: [],
        row_sum: [],
        pair_index: {},
        entries: [],
    },
]
[2022-06-24T16:47:38Z INFO  cleora::sparse_matrix] Number of entities: 6629
[2022-06-24T16:47:38Z INFO  cleora::sparse_matrix] Number of edges: 13985
[2022-06-24T16:47:38Z INFO  cleora::sparse_matrix] Number of entries: 27970
[2022-06-24T16:47:38Z INFO  cleora::sparse_matrix] Total memory usage by the struct ~ 0 MB
[2022-06-24T16:47:40Z INFO  cleora::pipeline] Number of lines processed: 10000
[2022-06-24T16:47:41Z INFO  cleora::pipeline] Number of lines processed: 20000
[2022-06-24T16:47:43Z INFO  cleora::pipeline] Number of lines processed: 30000
[2022-06-24T16:47:44Z INFO  cleora::pipeline] Number of lines processed: 40000
[2022-06-24T16:47:46Z INFO  cleora::pipeline] Number of lines processed: 50000
[2022-06-24T16:47:49Z INFO  cleora::pipeline] Number of lines processed: 60000
[2022-06-24T16:47:53Z INFO  cleora::pipeline] Number of lines processed: 70000
[2022-06-24T16:47:56Z INFO  cleora] Finished Sparse Matrices calculation in 18 sec
[2022-06-24T16:47:56Z INFO  cleora::embedding] Start initialization. Dims: 100, entities: 6629.
[2022-06-24T16:47:56Z INFO  cleora::embedding] Done initializing. Dims: 100, entities: 6629.
[2022-06-24T16:47:56Z INFO  cleora::embedding] Start propagating. Number of iterations: 10.
[2022-06-24T16:47:56Z INFO  cleora::embedding] Done iter: 0. Dims: 100, entities: 6629, num data points: 27970.
[2022-06-24T16:47:57Z INFO  cleora::embedding] Done iter: 1. Dims: 100, entities: 6629, num data points: 27970.
[2022-06-24T16:47:57Z INFO  cleora::embedding] Done iter: 2. Dims: 100, entities: 6629, num data points: 27970.
[2022-06-24T16:47:57Z INFO  cleora::embedding] Done iter: 3. Dims: 100, entities: 6629, num data points: 27970.
[2022-06-24T16:47:57Z INFO  cleora::embedding] Done iter: 4. Dims: 100, entities: 6629, num data points: 27970.
[2022-06-24T16:47:57Z INFO  cleora::embedding] Done iter: 5. Dims: 100, entities: 6629, num data points: 27970.
[2022-06-24T16:47:57Z INFO  cleora::embedding] Done iter: 6. Dims: 100, entities: 6629, num data points: 27970.
[2022-06-24T16:47:58Z INFO  cleora::embedding] Done iter: 7. Dims: 100, entities: 6629, num data points: 27970.
[2022-06-24T16:47:58Z INFO  cleora::embedding] Done iter: 8. Dims: 100, entities: 6629, num data points: 27970.
[2022-06-24T16:47:58Z INFO  cleora::embedding] Done iter: 9. Dims: 100, entities: 6629, num data points: 27970.
[2022-06-24T16:47:58Z INFO  cleora::embedding] Done propagating.
[2022-06-24T16:47:58Z INFO  cleora::embedding] Start saving embeddings.
[2022-06-24T16:47:58Z INFO  cleora::embedding] Done saving embeddings.
[2022-06-24T16:47:58Z INFO  cleora::embedding] Finalizing embeddings calculations!
[2022-06-24T16:47:58Z INFO  cleora] Finished in 20 sec

C:\Path\To\Repo\cleora>

I was told the following:

1. my binary (gnu-linux) on your input file throws an exception during data loading phase and fails immediately:
thread 'main' panicked at 'index out of bounds: the len is 1 but the index is 1', src/entity.rs:227:29

2. This is caused by lines containing only a single entity, without any other corresponding entities (e.g. line number 101 in your input file).
Such inputs are meaningless (because they do not represent an edge in the graph), and we do not handle them currently.

3. The code should throw an error & abort, but apparently on Windows, the exception happens silently, and the code proceeds to the next phase, despite not having loaded all inputs into memory successfully.

We will introduce a proper workaround (handle the case without errors + display a warning that "such lines are meaningless and will be skipped").

If there are any other materials necessary for addressing this issue, please reach out. Thank you very much!

Output: Output.zip

Input: https://drive.google.com/file/d/1YjTSQ-DMEaOE5wRbO1bN4SuXv__PBE0W/view?usp=sharing

opened by PierceLBrooks 3

User-Item Embedding

Hello, Thank you very much for this work. The performance of your algorithm is stunning! We are testing Cleora for a user-items embedding task. I have run into some result and wondering if this is by design or my mistake. My TSV file is simple and follows the format of "user item"

u1 <\t> i1 u2 <\t> i1 u1<\t> i2 u3 <\t> i2

As you can see, the relation between users and items is many to many. Im running a simple embedding task ./cleora --input ~/test.tsv --columns="users items" --dimension 32 --number-of-iterations 4

In the resulting embeddings it seems that users and items are "remote" from each other, as in the image below (cluster 0 is users and 1 is items). That is very different than cases in which we used simple matrix factorization, where we saw users are closer to the items they buy than other items, but here it seems that these relationships are somewhat lost. Does my question make sense? Is this result expected in this case?

Many thanks!

opened by a-agmon 3
To build cleora on Ubuntu 20.04 needs clang-11
Hello all, Great project, missing dependency in readme dependency on clang-11 (simd-json dependency) for ubuntu 20.04 I had to install from llvm repo:

export LLVM_VERSION=12 sudo add-apt-repository "deb http://apt.llvm.org/focal/ llvm-toolchain-focal-12 main" sudo apt-get install -y clang-$LLVM_VERSION lldb-$LLVM_VERSION lld-$LLVM_VERSION clangd-$LLVM_VERSION
opened by AlexMikhalev 3
Calculation of embedding for new entity on-the-fly

As this module claims that Cleora embeddings of an entity are defined only by interactions with other entities, vectors for new entities can be computed on-the-fly.

Could you please help me with how to calculate embeddings for a new node? It will be really helpful if you can share a jupyter notebook on this as well.

opened by nik3211 3
Performance on homogeneous graphs

Hi! I was trying to measure the performance of the Cleora on the homogeneous graphs used in the YouTube original paper, with no success. Specifically, I was trying to evaluate Blogcatalog, as it is the smallest graph.

I used the ./target/release/cleora -c "node node" configuration to get a single embedding per node. This way, the performance is close to random; however, on a different dataset with far simpler structure (CoCit dataset from the VERSE paper), I was able to obtain poor but not random performance - 0.21 Macro-F1 avg. for 5% labelled nodes, compared with ±30 for DW/VERSE.

Could you clarify if I am doing something horribly wrong, or the method needs some other hyperparameters to run properly? I would imagine adding a homogeneous example used across literature would strengthen the repository greatly. :)

opened by xgfs 3
Remove binaries from code repository and instead use Github Releases

Usually it's not a good practice to keep binaries in git source repository. It bloats repository size forever, which in turn causes a lot more data to be transferred when repository is cloned. GitHub provides convenient solution for keeping and managing binary releases.

opened by kosciej 3
/lib64/libc.so.6: version `GLIBC_2.18' not found
I am trying to run Cleora on a simple dataset. My TSV file is simple and follows the format of "leads attributes"

l1 <\t> a1 l2 <\t> a1 l1<\t> a2 l3 <\t> a2

I am trying to run an embedding task with the following command from one of the jupyter notebook examples. I am trying to run it on a linux machine.

#!/bin/bash ! ./new_method/cleora-v1.1.1-x86_64-unknown-linux-gnu --input DATA_PATH+'edges.tsv' --columns="leads attributes" --dimension 32 --number-of-iterations 4

I am getting the following error -

./new_method/cleora-v1.1.1-x86_64-unknown-linux-gnu: /lib64/libc.so.6: version `GLIBC_2.18' not found (required by ./new_method/cleora-v1.1.1-x86_64-unknown-linux-gnu)

Is there a solution to this ? Even with the latest version of the linux binary file, I get the same error.
opened by judas123 2
Turn off cleora logs

Hi Team,

I want to turn off the cleora line processing logs. Please help me with this. Thanks.

2021-11-09T06:22:17Z INFO cleora::entity] Number of lines processed: 2960000 [2021-11-09T06:22:17Z INFO cleora::entity] Number of lines processed: 2970000 [2021-11-09T06:22:17Z INFO cleora::entity] Number of lines processed: 2980000 [2021-11-09T06:22:17Z INFO cleora::entity] Number of lines processed: 2990000 [2021-11-09T06:22:17Z INFO cleora::entity] Number of lines processed: 3000000 [2021-11-09T06:22:17Z INFO cleora::entity] Number of lines processed: 3010000 [2021-11-09T06:22:17Z INFO cleora::entity] Number of lines processed: 3020000

opened by sonalgarg-zomato 2
help in understanding output file format
Hi

I was running cleora using the command below:

cleora-v1.2.3-x86_64-apple-darwin --columns transient::cluster_id StarNode --dimension 1024 -n 5 --input fb_cleora_input_star.txt -o output

I got something similar to the following output: (I added some spacing just for better readability)

39361 1024 1 1 0.029419877 ..... -0.0073362226 16260 7 0.033474464 ..... -0.00906976 . . . 22459 1 0.010709517 ..... 0.026430061

I cant figure out what does the 1st (1, 16260, ..., 22459) and the 2nd (1, 7, ..., 1) columns represent?

Thanks
opened by asafalina 1

pyo3 instegration and adding support for parquet output and s3 stores

I'd like to add a few features.

Integration with pyo3 bindings which will enable to publish library as a python package and use without using subprocess
Support for a parquet output persistence: output_format="parquet" because writing to parquet row by row is inefficient thus and additional parameter will be required to write with the chunks: chunk_size=3000
Support for a s3 as a input and output store

Example usage:

import cleora

output_dir = 's3://output'
fb_cleora_input_clique_filename = "s3://input/fb_cleora_input_clique.txt"
fb_cleora_input_star_filename = "s3://input/fb_cleora_input_star.txt"

cleora.run(
    input=[fb_cleora_input_clique_filename],
    type_name="tsv",
    dimension=1024,
    max_iter=5,
    seed=10,
    prepend_field=False,
    log_every=1000,
    in_memory_embedding_calculation=True,
    cols_str="complex::reflexive::CliqueNode",
    output_dir=output_dir,
    output_format="parquet",
    relation_name="emb",
    chunk_size=3000,
)

opened by qooba 0

Calculating embeddings for new nodes after training

I am trying to run Cleora on a simple dataset. My TSV file is simple and follows the format of "leads attributes"

l1 <\t> a1 l2 <\t> a1 l1<\t> a2 l3 <\t> a2

Leads are connected to some attributes.

I have Set A which is used to train embeddings for all nodes ( leads and attributes ) in the set.

For new nodes with the same format of "leads attributes" in Set B, I calculate embeddings by using the following 2 methods. Then I use the embeddings for all "leads" nodes of Set A to train XGBoost model and predict on "leads" nodes of Set B to calculate the AUC.

Method 1

I jointly train embeddings by combining Set A and Set B. I get the embeddings for all "leads" nodes. On Set B, the XGBoost model AUC (trained on "leads" embeddings of Set A) is ~0.8

Method 2

I used another method as suggested in another closed issue https://github.com/Synerise/cleora/issues/21 - where I train the embeddings only on Set A. Then for all "leads" nodes of Set B, I extract the embeddings of all the attributes a particular lead is connected to, average and do L2 normalization. Then with the XGBoost model trained on Set A "leads" embeddings, I predict on "leads" embeddings of Set B. The AUC drops to 0.65

Any reason why there is a drop in the AUC using Method 2 which was suggested to calculate embeddings for incoming nodes on the fly ? The alternative is method 1 where I have to retrain the graph by including new nodes every time.

Thanks

opened by judas123 2
[paper] definition of M vs Fig. 2.
Hi! Apologies if this is not the best place to ask this but. I've been reading Your paper and I hope You could clarify this little thing which is confusing to me:

Section III.B offers the definition of matrix M, M_ab = e_ab/deg(v_a).

Figure 2. presents toy example that includes explicit M matrices.

Are those latter matrices in 2. meant to follow the definition from 1.? If so, I must be misinterpreting something about this definition. Based on it I was expecting the rows of Ms in Fig. 2. to be normalized. Can I ask for example what is the value of deg(v_a) for the third row (i.e. v_a = p2_hash, if I understand correctly) of M_3 in Fig. 2. It seems that the entries in this row were constructed as follows:

taking v_b=u1_hash, we have e_ab=2 and deg(v_a)=4, and the entry in M_3 reads 1/2;

and taking v_b=u2_hash, we have e_ab=1 and deg(v_a)=3, and the entry in M_3 reads 1/3; so it seems that deg(v_a) is not just a function of v_a, which - for me - is not captured in the presentation in Section III.

What am I missing? Thanks in advance for any hints! :)
opened by olszewskip 1

Releases(v1.2.3)

v1.2.3(Jun 29, 2022)
Changed

Bump libs (#60).

Fixed

Check for malformed lines in input (#59).

Source code(tar.gz)
Source code(zip)
cleora-v1.2.3-x86_64-apple-darwin(1.75 MB)
cleora-v1.2.3-x86_64-pc-windows-msvc(2.02 MB)
cleora-v1.2.3-x86_64-unknown-linux-gnu(1.88 MB)
cleora-v1.2.3-x86_64-unknown-linux-musl(1.88 MB)
v1.2.2(Jun 24, 2022)
Changed

Allow cleora to accept multiple input files as positional args. Named argument 'input' is getting deprecated (#55).

Source code(tar.gz)
Source code(zip)
cleora-v1.2.2-x86_64-apple-darwin(1.75 MB)
cleora-v1.2.2-x86_64-pc-windows-msvc(2.02 MB)
cleora-v1.2.2-x86_64-unknown-linux-gnu(1.88 MB)
cleora-v1.2.2-x86_64-unknown-linux-musl(1.88 MB)
v1.2.1(Apr 13, 2022)
Changed

Optimize "--output-format numpy" mode, so it doesn't require additional memory when writing output file (#50).

Bump libs (#52).

Source code(tar.gz)
Source code(zip)
cleora-v1.2.1-x86_64-apple-darwin(1.75 MB)
cleora-v1.2.1-x86_64-pc-windows-msvc(2.03 MB)
cleora-v1.2.1-x86_64-unknown-linux-gnu(1.88 MB)
cleora-v1.2.1-x86_64-unknown-linux-musl(1.88 MB)
v1.2.0(Mar 17, 2022)
Added

Use default hasher for vector init. (#47).

Source code(tar.gz)
Source code(zip)
cleora-v1.2.0-x86_64-apple-darwin(1.54 MB)
cleora-v1.2.0-x86_64-pc-windows-msvc(1.71 MB)
cleora-v1.2.0-x86_64-unknown-linux-gnu(1.67 MB)
cleora-v1.2.0-x86_64-unknown-linux-musl(1.67 MB)
v1.1.1(May 14, 2021)
Added

Init embedding with seed during training (#27).

Source code(tar.gz)
Source code(zip)
cleora-v1.1.1-x86_64-apple-darwin(1.63 MB)
cleora-v1.1.1-x86_64-pc-windows-msvc(1.73 MB)
cleora-v1.1.1-x86_64-unknown-linux-gnu(1.73 MB)
cleora-v1.1.1-x86_64-unknown-linux-musl(1.73 MB)
v1.1.0(Dec 23, 2020)
Changed

Bumped env_logger to 0.8.2, smallvec to 1.5.1, removed fnv hasher (#11).

Added

Tests (snapshots) for in-memory and memory-mapped files calculations of embeddings (#12).

Support for NumPy output format (available via --output-format program argument) (#15).

Jupyter notebooks with experiments (#16).

Improved

Used vector for hash_to_id mappings, non-allocating cartesian product, ryu crate for faster write (#13).

Sparse Matrix refactor (cleanup, simplification, using iter, speedup). Use Cargo.toml data for clap crate (#17).

Unify and simplify embeddings calculation for in-memory and mmap matrices (#18).

Source code(tar.gz)
Source code(zip)
cleora-v1.1.0-x86_64-apple-darwin(1.61 MB)
cleora-v1.1.0-x86_64-pc-windows-msvc(1.77 MB)
cleora-v1.1.0-x86_64-unknown-linux-gnu(1.74 MB)
cleora-v1.1.0-x86_64-unknown-linux-musl(1.74 MB)
v1.0.1(Nov 23, 2020)
Fixed

Skip reading invalid UTF-8 line (#8).

Fix clippy warnings (#7).

Added

JSON support (#3).

Snapshot testing (#5).

Source code(tar.gz)
Source code(zip)
cleora-v1.0.1-x86_64-apple-darwin(1.57 MB)
cleora-v1.0.1-x86_64-pc-windows-msvc(1.72 MB)
cleora-v1.0.1-x86_64-unknown-linux-gnu(1.68 MB)
cleora-v1.0.1-x86_64-unknown-linux-musl(1.68 MB)
v1.0.0(Nov 23, 2020)

Initial release.
Source code(tar.gz)
Source code(zip)
cleora-v1.0.0-linux(3.10 MB)
cleora-v1.0.0-macos(1.71 MB)

Owner

Synerise

AI Driven Growth Operating System

GitHub https://cleora.ai

A Rust🦀 implementation of CRAFTML, an Efficient Clustering-based Random Forest for Extreme Multi-label Learning

craftml-rs A Rust implementation of CRAFTML, an Efficient Clustering-based Random Forest for Extreme Multi-label Learning (Siblini et al., 2018). Perf

15 Nov 6, 2022

Experimenting with Rust's fundamental data model

ferrilab Redefining the Rust fundamental data model bitvec funty radium Introduction The ferrilab project is a collection of crates that provide more

13 Dec 13, 2022

Flexible, reusable reinforcement learning (Q learning) implementation in Rust

Rurel Rurel is a flexible, reusable reinforcement learning (Q learning) implementation in Rust. Release documentation In Cargo.toml: rurel = "0.2.0"

60 Dec 29, 2022

General matrix multiplication with custom configuration in Rust. Supports no_std and no_alloc environments.

microgemm General matrix multiplication with custom configuration in Rust. Supports no_std and no_alloc environments. The implementation is based on t

4 Nov 6, 2023

scalable and fast unofficial osu! server implementation

gamma! the new bancho server for theta! built for scalability and speed configuration configuration is done either through gamma.toml, or through envi

3 Jan 7, 2023

A stable, linearithmic sort in constant space written in Rust

A stable, linearithmic sort in constant space written in Rust. Uses the method described in "Fast Stable Merging And Sorting In Constant Extra Space"

4 Mar 30, 2022

pyke Diffusers is a modular Rust library for optimized Stable Diffusion inference 🔮

pyke Diffusers is a modular Rust library for pretrained diffusion model inference to generate images, videos, or audio, using ONNX Runtime as a backen

12 Jan 5, 2023

Stable Diffusion v1.4 ported to Rust's burn framework

Stable-Diffusion-Burn Stable-Diffusion-Burn is a Rust-based project which ports the V1 stable diffusion model into the deep learning framework, Burn.

156 Aug 8, 2023

Stable Diffusion XL ported to Rust's burn framework

Stable-Diffusion-XL-Burn Stable-Diffusion-XL-Burn is a Rust-based project which ports stable diffusion xl into the Rust deep learning framework burn.

194 Sep 4, 2023

Dynamically get the suggested clusters in the data for unsupervised learning.

Python implementation of the Gap Statistic Purpose Dynamically identify the suggested number of clusters in a data-set using the gap statistic. Full e

163 Dec 9, 2022

Self Organizing Map (SOM) is a type of Artificial Neural Network (ANN) that is trained using an unsupervised, competitive learning to produce a low dimensional, discretized representation (feature map) of higher dimensional data.

som Self Organizing Map Pre-requisites Setup rust To download Rustup and install Rust, run the following in your terminal, then follow the on-screen i

5 Nov 4, 2020

Narwhal and Tusk A DAG-based Mempool and Efficient BFT Consensus.

This repo contains a prototype of Narwhal and Tusk. It supplements the paper Narwhal and Tusk: A DAG-based Mempool and Efficient BFT Consensus.

134 Dec 8, 2022

Masked Language Model on Wasm

Masked Language Model on Wasm This project is for OPTiM TECH BLOG. Please see below: WebAssemblyを用いてBERTモデルをフロントエンドで動かす Demo Usage Build image docker

20 Sep 23, 2022

Docker for PyTorch rust bindings `tch`. Example of pretrain model.

tch-rs-pretrain-example-docker Docker for PyTorch rust bindings tch-rs. Example of pretrain model. Docker files support the following install libtorch

5 Oct 7, 2022

This is a rewrite of the RAMP (Rapid Assistance in Modelling the Pandemic) model

RAMP from scratch This is a rewrite of the RAMP (Rapid Assistance in Modelling the Pandemic) model, based on the EcoTwins-withCommuting branch, in Rus

3 Oct 26, 2022

A neural network model that can approximate any non-linear function by using the random search algorithm for the optimization of the loss function.

random_search A neural network model that can approximate any non-linear function by using the random search algorithm for the optimization of the los

2 Apr 1, 2022

m2cgen (Model 2 Code Generator) - is a lightweight library which provides an easy way to transpile trained statistical models into a native code

Transform ML models into a native code (Java, C, Python, Go, JavaScript, Visual Basic, C#, R, PowerShell, PHP, Dart, Haskell, Ruby, F#, Rust) with zero dependencies

2.3k Dec 31, 2022

Using OpenAI Codex's "davinci-edit" Model for Gradual Type Inference

OpenTau: Using OpenAI Codex for Gradual Type Inference Current implementation is focused on TypeScript Python implementation comes next Requirements r

11 Dec 18, 2022

Your one stop CLI for ONNX model analysis.

Your one stop CLI for ONNX model analysis. Featuring graph visualization, FLOP counts, memory metrics and more! ⚡️ Quick start First, download and ins

20 Dec 30, 2022

Cleora AI is a general-purpose model for efficient, scalable learning of stable and inductive entity embeddings for heterogeneous relational data.

Related tags

Overview

Cleora

Cleora design principles

Key technical features of Cleora embeddings

Key usability features of Cleora embeddings

Data formats supported by Cleora

Running

Building

More details

Algorithm

Implementation

Memory consumption

Comments

Releases(v1.2.3)

v1.2.3(Jun 29, 2022)

Changed

Fixed

v1.2.2(Jun 24, 2022)

Changed

v1.2.1(Apr 13, 2022)

Changed

v1.2.0(Mar 17, 2022)

Added

v1.1.1(May 14, 2021)

Added

v1.1.0(Dec 23, 2020)

Changed

Added

Improved

v1.0.1(Nov 23, 2020)

Fixed

Added

v1.0.0(Nov 23, 2020)

Owner

Synerise

A Rust🦀 implementation of CRAFTML, an Efficient Clustering-based Random Forest for Extreme Multi-label Learning

Experimenting with Rust's fundamental data model

Flexible, reusable reinforcement learning (Q learning) implementation in Rust

General matrix multiplication with custom configuration in Rust. Supports no_std and no_alloc environments.

scalable and fast unofficial osu! server implementation

A stable, linearithmic sort in constant space written in Rust

pyke Diffusers is a modular Rust library for optimized Stable Diffusion inference 🔮

Stable Diffusion v1.4 ported to Rust's burn framework

Stable Diffusion XL ported to Rust's burn framework

Dynamically get the suggested clusters in the data for unsupervised learning.

Self Organizing Map (SOM) is a type of Artificial Neural Network (ANN) that is trained using an unsupervised, competitive learning to produce a low dimensional, discretized representation (feature map) of higher dimensional data.

Narwhal and Tusk A DAG-based Mempool and Efficient BFT Consensus.

Masked Language Model on Wasm

Docker for PyTorch rust bindings `tch`. Example of pretrain model.

This is a rewrite of the RAMP (Rapid Assistance in Modelling the Pandemic) model

A neural network model that can approximate any non-linear function by using the random search algorithm for the optimization of the loss function.

m2cgen (Model 2 Code Generator) - is a lightweight library which provides an easy way to transpile trained statistical models into a native code

Using OpenAI Codex's "davinci-edit" Model for Gradual Type Inference

Your one stop CLI for ONNX model analysis.