Hamming Weight Tree from the paper Online Nearest Neighbor Search in Hamming Space

Related tags

Machine learning hwt
Overview

hwt

Hamming Weight Tree from the paper Online Nearest Neighbor Search in Hamming Space

To understand how the data structure works, please see the docs.

Benchmarks

Most recent benchmark for 1-NN:

1-NN Benchmark

You can find benchmark output here.

If you would like to run the benchmarks yourself, just run cargo bench at the command line. I recommend using RUSTFLAGS='-C target-cpu=native' cargo bench instead since both linear search and this tree are both significantly faster when using modern instructions.

Comments
  • Move features into tree for performance

    Move features into tree for performance

    The current performance is unacceptable. After profiling the code, it is clear that there are issues with cache locality. Here is the closure taking up 71.85% of all the time:

    let lookup_distance = |index| (lookup(index) ^ feature).count_ones();
    

    This performs a single trivial operation, so it looks innocuous, but it seems that the way hwt allows the user to store features outside of the tree is actually working against it, making it much slower than linear search (except maybe on huge data sets > 1B).

    The API should be changed to give back features and not indices. The user can use their features to look up their data afterwards, which should have negligible performance impact since they only need to do a few lookups (k lookups for k-NN).

    This has two negative impacts:

    • This could make the tree slightly larger.
    • This will make it harder for the user to store the data however they want.

    Seeing as performance is the most important thing here, that takes precedence over these issues.

    opened by vadixidav 4
  • Use packed_simd to speed up FeatureHeap::add

    Use packed_simd to speed up FeatureHeap::add

    Profiling currently shows a lot of time is spent in FeatureHeap::add doing XOR, popcount, and comparations. We can speed this up with SIMD by doing the XOR, popcnt, and compares in parallel.

    opened by vadixidav 0
  • Support approximate nearest neighbor

    Support approximate nearest neighbor

    We can currently support approximate nearest neighbor search trivially by simply allowing neighbors that are at the current search radius + n. This means that we can be off by up to a distance of n. The performance gain should be high and it is easy to control precisely how much off you want to be (exactly by n at most).

    opened by vadixidav 0
  • failed to selection version for ahash

    failed to selection version for ahash

    $ cargo build
        Updating crates.io index
    error: failed to select a version for the requirement `ahash = "^0.1.18"`
    candidate versions found which didn't match: 0.8.0, 0.7.6, 0.7.5, ...
    location searched: crates.io index
    required by package `hwt v0.4.2`
        ... which is depended on by `bfp v0.1.0 (/workplace/amooren/sandbox/src/Amooren/bfp)`
    

    Clean crate, just adding hwt latest version.

    opened by mooreniemi 0
  • Dynamic sized features

    Dynamic sized features

    I'm interested in using this crate to collate hashes produced by my img_hash library. I started to work on an implementation of a multiple vantage-point tree but I haven't had the time to fill out the tests or even finish the documentation and I have a feeling it's going to perform pretty abysmally by default. I really prefer your approach in this crate as I feel like we're working in the same problem space.

    However, one of the features of img_hash is that it can produce hashes of varying sizes depending on a user's preference between performance and accuracy. By default, I have it producing 64-bit hashes so representing them as u128 would be pretty wasteful, and I also want to experiment with hash sizes of 256 bits and maybe greater so in that case u128 would require discarding information from the hash.

    In the 3.0.0-alpha I have yet to release I have a generic representation of hash bytes which gives the option of either allocating them on the heap (for dynamic sizes) or inline (to save a misdirection and possibly improve codegen). I think this kind of generic approach would benefit hwt as well but I'm wondering what your opinion is of it.

    I don't entirely understand the HWT algorithm so I don't know if 128-bit features are integral to its function but I have a feeling it wouldn't be too difficult to generalize over any size feature as long as they're all the same size in the tree.

    opened by abonander 10
  • Convert search module to use a hand-optimized iterator for each level of the tree

    Convert search module to use a hand-optimized iterator for each level of the tree

    Currently the search module is implemented in a convenient, but inefficient, way. It defines an iterator for 2 bit slices, then another one for 4 bit slices that flat maps two 2 bit slice ones, and so on. This causes the type name length (Rust compiler/type issue) to grow out of control and then requires Box to be used to wrap the types. In addition to the type name length issue, it is also very wasteful because of how it uses up to 128 128-bit operations on 1-bit numbers. The search module should be refactored (possibly at the expense of code size) to accommodate speed. SIMD should be used where possible.

    opened by vadixidav 0
  • Performance tracking issue

    Performance tracking issue

    Performance is currently only beating linear (which is incredibly fast on modern processors) for exact search at > 1M features. This is probably beatable, but it will require some optimizations. Below is the list of optimizations and their issues that need to be made:

    k-NN

    opened by vadixidav 1
  • Support lowes ratio retrieval

    Support lowes ratio retrieval

    Once we reach a search distance for the second nearest neighbor that exceeds the Lowes ratio, we know that the Lowes ratio test has succeeded. We should return a Result<u128, (u128, u128)> for this operation. If the Lowes ratio test succeeds, we return the nearest neighbor. If the Lowes ratio test fails, we return the nearest neighbor and also the feature that caused the Lowes ratio test to fail (which is the second nearest neighbor).

    This will speed up photogrammetry algorithms because we can stop searching once the Lowes ratio has been fulfilled.

    opened by vadixidav 0
Owner
Rust Computer Vision
Dedicated to making computer vision easier than OpenCV and faster than C++
Rust Computer Vision
HNSW ANN from the paper "Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs"

hnsw Hierarchical Navigable Small World Graph for fast ANN search Enable the serde feature to serialize and deserialize HNSW. Tips A good default for

Rust Computer Vision 93 Dec 30, 2022
Rust implementation of multi-index hashing for neighbor searches on binary codes in the Hamming space

mih-rs Rust implementation of multi-index hashing (MIH) for neighbor searches on binary codes in the Hamming space, described in the paper Norouzi, Pu

Shunsuke Kanda 8 Sep 23, 2022
Graph-based Approximate Nearest Neighbor Search

granne* granne (graph-based retrieval of approximate nearest neighbors) is a Rust library for approximate nearest neighbor search based on Hierarchica

null 283 Dec 21, 2022
🚀 efficient approximate nearest neighbor search algorithm collections library written in Rust 🦀 .

?? efficient approximate nearest neighbor search algorithm collections library written in Rust ?? .

Hora-Search 2.3k Jan 3, 2023
C library for finding nearest (most similar) element in a set

VP-tree nearest neighbor search A relatively simple and readable Rust implementation of Vantage Point tree search algorithm. The VP tree algorithm doe

Kornel 28 Aug 20, 2022
Rust-port of spotify/annoy as a wrapper for Approximate Nearest Neighbors in C++/Python optimized for memory usage.

Rust-port of spotify/annoy as a wrapper for Approximate Nearest Neighbors in C++/Python optimized for memory usage.

Arthur·Thomas 13 Mar 10, 2022
Rust-port of spotify/annoy as a wrapper for Approximate Nearest Neighbors in C++/Python optimized for memory usage.

Fareast This library is a rust port of spotify/annoy , currently only index serving is supported. It also provides FFI bindings for jvm, dotnet and da

Arthur·Thomas 13 Mar 10, 2022
A simple url checker for finding fraud url(s) or nearest url

urlchecker A simple url checker for finding fraud url(s) or nearest url while being fast (threading) Eg:- use std::collections::HashMap; use urlchecke

Subconscious Compute 2 Aug 7, 2022
Instance Distance is a fast pure-Rust implementation of the Hierarchical Navigable Small Worlds paper

Fast approximate nearest neighbor searching in Rust, based on HNSW index

Instant Domain Search, Inc. 135 Dec 24, 2022
MO's Trading - an online contest for high frequency trading

MO's Trading - an online contest for high frequency trading

Runji Wang 29 Dec 14, 2022
An 8080 Space Invaders emulator in Rust

Space Invade.rs An 8080 Space Invaders emulator written in Rust This is an 8080 emulator running the 1978 Space Invaders game by Taito, written in Rus

Cedric Beust 23 Dec 27, 2022
A stable, linearithmic sort in constant space written in Rust

A stable, linearithmic sort in constant space written in Rust. Uses the method described in "Fast Stable Merging And Sorting In Constant Extra Space"

Dylan MacKenzie 4 Mar 30, 2022
TopK algorithm implementation in Rust (Filtered Space-Saving)

TopK TopK algorithm implementation in Rust. This crate currently provides the Filtered Space-Saving algorithm. Version numbers follow the semver conve

null 6 Feb 24, 2023
K-dimensional tree in Rust for fast geospatial indexing and lookup

kdtree K-dimensional tree in Rust for fast geospatial indexing and nearest neighbors lookup Crate Documentation Usage Benchmark License Usage Add kdtr

Rui Hu 154 Jan 4, 2023
An Implementation of the Context Tree Weighting (CTW) Sequence Prediction Algorithm

Context Tree Weighting (CTW) CTW is a lightweight, practical and well performing sequence prediction algorithm discovered by Frans Willems, Yuri Shtar

null 7 Dec 23, 2022
Qdrant - vector similarity search engine with extended filtering support

Vector Similarity Search Engine with extended filtering support Qdrant (read: quadrant ) is a vector similarity search engine. It provides a productio

qdrant 3.5k Dec 30, 2022
A neural network model that can approximate any non-linear function by using the random search algorithm for the optimization of the loss function.

random_search A neural network model that can approximate any non-linear function by using the random search algorithm for the optimization of the los

ph04 2 Apr 1, 2022
Machine learning framework for building object trackers and similarity search engines

Similari Similari is a framework that helps build sophisticated tracking systems. The most frequently met operations that can be efficiently implement

In-Sight 71 Dec 28, 2022
Nearest neighbor search algorithms including a ball tree and a vantage point tree.

petal-neighbors Nearest neighbor search algorithms including a ball tree and a vantage point tree. Examples The following example shows how to find tw

Petabi 6 Oct 19, 2022
HNSW ANN from the paper "Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs"

hnsw Hierarchical Navigable Small World Graph for fast ANN search Enable the serde feature to serialize and deserialize HNSW. Tips A good default for

Rust Computer Vision 93 Dec 30, 2022