HNSW ANN from the paper "Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs"

Overview

hnsw

Discord Crates.io MIT/Apache docs.rs LoC

Hierarchical Navigable Small World Graph for fast ANN search

Enable the serde feature to serialize and deserialize HNSW.

Tips

A good default for M and M0 parameters is 12 and 24 respectively. According to the paper, M0 should always be double M, but you can change both of them freely.

Example

Binary feature search using hamming distance

use hnsw::{Hnsw, Searcher};
use rand_pcg::Pcg64;
use space::{MetricPoint, Neighbor};

struct Hamming(u8);

impl MetricPoint for Hamming {
    type Metric = u8;

    fn distance(&self, other: &Self) -> u8 {
        (self.0 ^ other.0).count_ones() as u8
    }
}

fn test_hnsw_discrete() -> (Hnsw<Hamming, Pcg64, 12, 24>, Searcher<u8>) {
    let mut searcher = Searcher::default();
    let mut hnsw = Hnsw::new();

    let features = [
        0b0001, 0b0010, 0b0100, 0b1000, 0b0011, 0b0110, 0b1100, 0b1001,
    ];

    for &feature in &features {
        hnsw.insert(Hamming(feature), &mut searcher);
    }

    (hnsw, searcher)
}

#[test]
fn insertion_discrete() {
    test_hnsw_discrete();
}

#[test]
fn nearest_neighbor_discrete() {
    let (hnsw, mut searcher) = test_hnsw_discrete();
    let mut neighbors = [Neighbor {
        index: !0,
        distance: !0,
    }; 8];

    hnsw.nearest(&Hamming(0b0001), 24, &mut searcher, &mut neighbors);
    // Distance 1
    neighbors[1..3].sort_unstable();
    // Distance 2
    neighbors[3..6].sort_unstable();
    // Distance 3
    neighbors[6..8].sort_unstable();
    assert_eq!(
        neighbors,
        [
            Neighbor {
                index: 0,
                distance: 0
            },
            Neighbor {
                index: 4,
                distance: 1
            },
            Neighbor {
                index: 7,
                distance: 1
            },
            Neighbor {
                index: 1,
                distance: 2
            },
            Neighbor {
                index: 2,
                distance: 2
            },
            Neighbor {
                index: 3,
                distance: 2
            },
            Neighbor {
                index: 5,
                distance: 3
            },
            Neighbor {
                index: 6,
                distance: 3
            }
        ]
    );
}

Please refer to the space documentation for the trait and types regarding distance. It also contains special Bits128 - Bits4096 tuple structs that wrap an array of bytes and enable SIMD capability. Benchmarks provided use these SIMD impls.

Floating-point search using euclidean distance

An implementation is currently not provided for euclidean distance after a recent refactor. Hamming distance was more relevant at the time, and so that was prioritized. To implement euclidean distance, do something roughly like the following:

use space::MetricPoint;
struct Euclidean<'a>(&'a [f32]);

impl MetricPoint for Euclidean<'_> {
    type Metric = u32;
    fn distance(&self, rhs: &Self) -> u64 {
        self.0
            .iter()
            .zip(rhs.0.iter())
            .map(|(&a, &b)| (a - b).powi(2))
            .sum::<f32>()
            .sqrt().to_bits()
    }
}

Note that the above implementation may have some numerical error on high dimensionality. In that case use a Kahan sum instead. It also may not utilize SIMD, but using an array may help with that.

Benchmarks

Here is a recall graph that you can compare to its alternatives:

Recall Graph

For more benchmarks and how to benchmark, see benchmarks.md.

Implementation

This is based on the paper "Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs" by Yu. A. Malkov and D. A. Yashunin. This paper builds on the original paper for NSW. There are multiple papers written by the authors on NSW, which preceeded HNSW.

For more details about parameters and details of the implementation, see implementation.md.

Credit

This is in no way a direct copy or reimplementation of the original implementation. This was made purely based on the paper without reference to the original headers. The paper is very well written and easy to understand, with some minor exceptions. Thank you to the authors for your valuble contribution.

Questions? Contributions? Excited?

Please visit the Rust CV Discord.

Comments
  • Dependency on packed_simd

    Dependency on packed_simd

    Hnsw depends on packed_simd, which is no longer in development.

    On my mac, this is hnsw is showing the same issues seen with pack_simd in other libraries: https://github.com/rust-lang/cargo/issues/9139

    It would be great to move to https://github.com/rust-lang/stdsimd maybe

    packed_simd v0.3.3
    └── hnsw v0.3.1
    
    opened by bwindsor22 9
  • ndarray example?

    ndarray example?

    I wanted to implement distance as ndarray's dot but I'm finding the type errors inscrutable so far. Are there any examples anywhere of using this crate and ndarray together?

    I implemented:

    struct Embedding;
    impl Metric<ArrayView1<'_, f32>> for Embedding {
        type Unit = u64;
    
        fn distance(&self, a: &ArrayView1<f32>, b: &ArrayView1<f32>) -> Self::Unit {
            a.dot(b) as u64
        }
    }
    

    But when I try to insert like so:

        let mut searcher: Searcher<_> = Searcher::default();
        let mut index = Hnsw::new(Embedding);
        for i in 0..n {
            let row: ArrayView1<f32> = v.row(i);
            index.insert(row, &mut searcher);
        }
    

    I hit:

    error[E0599]: the method `insert` exists for struct `Hnsw<Embedding, _, _, {_: usize}, {_: usize}>`, but its trait bounds were not satisfied
      --> src/hnsw.rs:64:15
       |
    13 | struct Embedding;
       | ----------------- doesn't satisfy `Embedding: space::Metric<_>`
    ...
    64 |         index.insert(row, &mut searcher);
       |               ^^^^^^ method cannot be called on `Hnsw<Embedding, _, _, {_: usize}, {_: usize}>` due to unsatisfied trait bounds
       |
       = note: the following trait bounds were not satisfied:
               `Embedding: space::Metric<_>`
    note: the following trait must be implemented
      --> /home/alex/.cargo/registry/src/github.com-1ecc6299db9ec823/space-0.17.0/src/lib.rs:50:1
       |
    50 | / pub trait Metric<P> {
    51 | |     type Unit: Unsigned + Ord + Copy;
    52 | |
    53 | |     fn distance(&self, a: &P, b: &P) -> Self::Unit;
    54 | | }
       | |_^
    
    

    It's not jumping out to me what's missing given I see no bounds on P and the bounds of the Self::Unit are met.

    opened by mooreniemi 5
  • improve readme

    improve readme

    The readme currently features a lot of benchmark pictures but no code example.

    Adding a separate benchmark file (linked in the readme) and scaling down to one picture plus one or two typical use case examples in the readme would make the library easier to use.

    opened by nestordemeure 3
  • Benchmark different filters and hash functions to use for previously visited nodes

    Benchmark different filters and hash functions to use for previously visited nodes

    Currently a HashSet<u32> is being used to check if a node is in the set of seen nodes. The u32 in the set is the index in the zero layer which is unique for all elements. The u32 is very small and we need to be able to perform a hash on it very quickly. Additionally, the data structure for this filter should likely be changed for performance and memory consumption reasons.

    Since HNSW performs an ANN search, we can use things like bloom filters which produce some false positives, so long as there are no false negatives (could introduce infinite cycles in the graph).

    I would like to investigate and benchmark the following filters and hash algorithms (more will be added to this list as needed):

    Filters:

    • Bloom filter
    • Cuckoo filter

    Hash functions:

    • SipHash
    • Anything in this crate
    • aHash or other AES-based keyed-hashes

    All of these should be benchmarked in every combination with each other specifically for the purposes of the HNSW nearest-neighbor search. Whatever gives the best performance with negligable impact on recall will be chosen.

    opened by vadixidav 2
  • Example in README doesn't work

    Example in README doesn't work

    HNSW was renamed to Hnsw in version 0.7 so the readme is out of date in that respect. Also I went to the docs looking for recommended values for M and M0 as these values now need to be set, but don't see any relevant documentation for them. I guess I should look to the original paper, although it would be better if it was mentioned within the crate docs

    opened by xd009642 1
  • Make it easier to use arrays with HNSW

    Make it easier to use arrays with HNSW

    In https://github.com/LukeMathWalker/linfa/issues/15, it was decided that a new interface needed to be developed to make it easier to add arrays to HNSW. Due to a lack of abstractions surrounding numbers (mostly due to const generics), we will need to support specific array lengths. Each array length supported should be trivially convertible into an aligned SIMD type for usage in HNSW.

    opened by vadixidav 1
  • Create non-discrete HNSW

    Create non-discrete HNSW

    Right now this crate only has DiscreteHNSW, which is incredibly useful for some specific tasks, particularly in performance computer vision. However, HNSW has nothing to do with discrete spaces at all, and the DiscreteHNSW is just a bunch of optimizations specific to discrete spaces. A general HNSW should be created that can operate on any two feature vectors for which distance can be computed where the computed distance is an f32. A trait should be created for this distance type, and it should also be implemented on things that support the DiscreteDistance trait.

    opened by vadixidav 1
  • BoW interface

    BoW interface

    This provides enough insight into the HNSW layers to create a BoW implementation. Previously, it wasn't possible to publically access the API for searching inner-layers of the HNSW. This adds a convenient API for doing that and also extracting the features and zero-layer items of items returned by the search_layer method.

    opened by vadixidav 0
  • Allow M_MAX and M_MAX0 to be paramaterizable

    Allow M_MAX and M_MAX0 to be paramaterizable

    For the time being, this will need to be done with generic-array. Ultimately, const generics will be used to do this, but it may be a few months before that is even possible on nightly.

    opened by vadixidav 0
  • Main thread panicks in examples

    Main thread panicks in examples

    I was trying out one of the examples in the examples folder when the execution failed due to: thread 'main' panicked at 'source slice length (1) does not match destination slice length (2)'

    The error refers to line 308 in src/hnsw/hnsw_const.rs. This happens since the minimum array length you're getting from the K-NN search is less than the default one set on the Opt object to hold nearest neighbors. Therefore copy_from_slice() panicks.

    I managed to solve this by decreasing by 1 the default number (initially set to 2) of nearest neighbors in Opt structure since I think there will always be at least 1 node in the graph.

    I believe there's a more sofisticated way to fix this, but I haven't yet found it 🙃

    This is the full execution log

    Generating 100 random bitstrings...
    Done.
    Generating 100 independent random query strings...
    Done.
    Computing the correct nearest neighbor distance for all 100 queries...
    Done.
    Generating HNSW...
    Done.
    Computing recall graph...
    thread 'main' panicked at 'source slice length (1) does not match destination slice length (2)', /Users/cinderella/.cargo/registry/src/github.com-1ecc6299db9ec823/hnsw-0.11.0/src/hnsw/hnsw_const.rs:308:14
    note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
    
    opened by lucas-cauhe 0
  • HNSW for biology

    HNSW for biology

    Dear HNSW author,

    This is Jianshu, a bioinformatics phd student at Georgia Tech. I am writing to you to ask your interest to form a new collaboration. Specifically, applying HNSW into genome classification problems, that is to find the closest genome in a big genome database to see the close related ones in the database so that enivronmental microbiologist can tell what taxonomy the query genome is. This will have a very big impact on the field and will definitely have a lot of citations. I want to completely rely on rust for this project without using any other language considering the advantages compared to C++ and python. I am not an expert in rust but have been using it for 2 years. The biology needed and taxonomy related information will be my strong part. I also know all the classification software in this field and have benign using/modifying them for my master and half of my Ph.D I am confident that HNSW in rust will greatly improved the speed of genome classifiers. Hopefully we can come up with a paper in the end. My email is [email protected]

    Let me know if you are interest and if something I mentioned above is not clear.

    Many thanks,

    Jianshu

    opened by jianshu93 1
  • Add Knn impl for a wrapper on HNSW

    Add Knn impl for a wrapper on HNSW

    In space 0.13.0, the Knn trait was added. We want to implement this trait for HNSW, but it needs a wrapper that provides the ef value to the search method. This should be simple to do.

    opened by vadixidav 0
  • Turn HNSW into Map and Set variants

    Turn HNSW into Map and Set variants

    HNSW is more difficult to use than it should be. HNSW should be changed to have an HNSWMap and HNSWSet variant. The current behavior should be preserved in that features can be accessed by index and that the index should be returned on adding something to either type of HNSW. However, the HNSW should also additionally store an item in it. The set variant will simply set that item to (), inuring no overhead.

    opened by vadixidav 0
  • feature() doc typo

    feature() doc typo

    In the doc it says under feature(): The item must be retrieved from HNSW::search_layer.

    This should probably just be removed from the doc, or it should be corrected to say "must not be" rather than "must be".

    opened by vadixidav 0
Owner
Rust Computer Vision
Dedicated to making computer vision easier than OpenCV and faster than C++
Rust Computer Vision
Instance Distance is a fast pure-Rust implementation of the Hierarchical Navigable Small Worlds paper

Fast approximate nearest neighbor searching in Rust, based on HNSW index

Instant Domain Search, Inc. 135 Dec 24, 2022
Hamming Weight Tree from the paper Online Nearest Neighbor Search in Hamming Space

hwt Hamming Weight Tree from the paper Online Nearest Neighbor Search in Hamming Space To understand how the data structure works, please see the docs

Rust Computer Vision 7 Oct 9, 2021
SelfOrgMap 5 Nov 4, 2020
🍋: A General Lock following paper "Optimistic Lock Coupling: A Scalable and Efficient General-Purpose Synchronization Method"

Optimistic Lock Coupling from paper "Optimistic Lock Coupling: A Scalable and Efficient General-Purpose Synchronization Method" In actual projects, th

LemonHX 22 Oct 13, 2022
An implementation of Joshua Yanovski's Ghost Cell paper.

A novel safe and zero-cost borrow-checking paradigm from the GhostCell paper. Motivation A number of collections, such as linked-lists, binary-trees,

null 350 Dec 22, 2022
Instance Distance is a fast pure-Rust implementation of the Hierarchical Navigable Small Worlds paper

Fast approximate nearest neighbor searching in Rust, based on HNSW index

Instant Domain Search, Inc. 135 Dec 24, 2022
This repo contains the schedule for summer paper meetup

21' Summer ?? Paper Meetup Annoucement ?? 6/21/2021: our first summer love paper meetup will start on July 10th, 2021. Papers to contribute ?? Check o

msft system virtual meetup 24 Sep 22, 2022
Source code for our paper "Higher-order finite elements for embedded simulation"

Higher-order Finite Elements for Embedded Simulation This repository contains the source code used to produce the results for our paper: Longva, A., L

Interactive Computer Graphics 18 Sep 30, 2022
Multilayered Linkable Spontaneous Anonymous Group - Implemented as is from paper. Not Monero specific

MLSAG This is a pure Rust implementation of the Multilayered Linkable Spontaneous Anonymous Group construction. This implementation has not been revie

Crate Crypto 19 Dec 4, 2022
e-paper/e-ink monitor linux driver

Ardoise: e-paper/e-ink monitor Goal: Create a e-paper/e-ink monitorfor linux. My personal use is a typewriter. Written in Rust Latency is 0,2s when th

Cyril Jacquet 1 Dec 8, 2022
Hamming Weight Tree from the paper Online Nearest Neighbor Search in Hamming Space

hwt Hamming Weight Tree from the paper Online Nearest Neighbor Search in Hamming Space To understand how the data structure works, please see the docs

Rust Computer Vision 7 Oct 9, 2021
fast rust implementation of online nonnegative matrix factorization as laid out in the paper "detect and track latent factors with online nonnegative matrix factorization"

ONMF status: early work in progress. still figuring this out. code still somewhat messy. api still in flux. fast rust implementation of online nonnega

null 2 Apr 10, 2020
An implementation of the paper "Honey Badger of BFT Protocols" in Rust. This is a modular library of consensus.

Honey Badger Byzantine Fault Tolerant (BFT) consensus algorithm Welcome to a Rust library of the Honey Badger Byzantine Fault Tolerant (BFT) consensus

null 335 Dec 25, 2022
A command line application which sets your wall paper with new image generating pollens once they arrive.

pollenwall Table of Contents pollenwall About Installation Binary releases Build from source Usage Command Line Arguments Running as a service MacOS L

Pollinations.AI 2 Jan 7, 2022
An Rust implementation of the Modified Merkle Patricia Tree described by ETH Yellow Paper

Merkle Patricia Tree的Rust实现 博客文章:https://dere.press/2022/01/24/eth-trie/ 本实现参考下列项目: https://ethereum.github.io/yellowpaper/paper.pdf https://github.co

M4tsuri 3 Dec 13, 2022
ePaperify: Framebuffer/image pre-processing library for e-Paper displays

ePaperify: Framebuffer/image pre-processing library for e-Paper displays

Jackson Ming Hu 5 Mar 15, 2022
Play videos on IT8951-controlled e-paper displays

it8951-video Play videos on IT8951-controlled e-paper displays via USB. This has been tested with a Waveshare 7.8inch e-Paper HAT display. Design This

Andreas Dzialocha 4 Nov 28, 2022
Supporting code for the paper "Optimized Homomorphic Evaluation of Boolean Functions" submitted to Eurocrypt 2024

This repository contains the code related to the paper Optimized Homomorphic Evaluation of Boolean Functions. The folder search_algorithm contains the

CryptoExperts 3 Oct 23, 2023