A simple and lightweight fuzzy search engine that works in memory, searching for similar strings (a pun here).

Overview

simsearch

Build Status crates.io docs.rs

A simple and lightweight fuzzy search engine that works in memory, searching for similar strings (a pun here).

Documentation

Usage

Add the following to your Cargo.toml:

[dependencies]
simsearch = "0.2"

Example

use simsearch::SimSearch;

let mut engine: SimSearch<u32> = SimSearch::new();

engine.insert(1, "Things Fall Apart");
engine.insert(2, "The Old Man and the Sea");
engine.insert(3, "James Joyce");

let results: Vec<u32> = engine.search("thngs");

assert_eq!(results, &[1]);

By default, Jaro-Winkler distance is used. An alternative Levenshtein distance, which is SIMD-accelerated but only works for ASCII byte strings, can be specified with custom SearchOptions:

use simsearch::{SimSearch, SearchOptions};

let options = SearchOptions::new().levenshtein(true);
let mut engine: SimSearch<u32> = SimSearch::new_with(options);

Also try the interactive demo by:

$ cargo run --release --example books

Contribution

All kinds of contribution are welcomed.

  • Issus. Feel free to open an issue when you find typos, bugs, or have any question.
  • Pull requests. New collection, better implementation, more tests, more documents and typo fixes are all welcomed.

License

Licensed under MIT license (LICENSE-MIT or http://opensource.org/licenses/MIT)

Comments
  • Incorporating Levenshtein distance as another option in addition to Jaro-Winkler

    Incorporating Levenshtein distance as another option in addition to Jaro-Winkler

    Hey, Thanks for your work on this library. I was wondering whether you would be interested in adding Levenshtein distance as an option? I am recently working on accelerating bounded Levenshtein distance calculations with SIMD, which is around 20-30x faster than a normal scalar Levenshtein distance algorithm for medium length strings, and I thought that it would be nice and beneficial to use it in your library. The library is on crates.io: https://crates.io/crates/triple_accel.

    However, there are some tradeoffs though: my library cannot handle unicode strings and null bytes due to the SIMD implementation. This could be addressed by allowing the user to opt in to using the SIMD implementation explicitly. Of course, my library supports automatically falling back to a scalar version if SIMD is not supported as a CPU feature.

    If you are interested, we could discuss a little more about the implementation. I am probably able to work on this in the future.

    opened by Daniel-Liu-c0deb0t 5
  • Added SIMD-accelerated Levenshtein distance support based on the `triple_accel` library

    Added SIMD-accelerated Levenshtein distance support based on the `triple_accel` library

    I added support for Levenshtein distance by using the triple_accel library. Levenshtein distance can be optionally enabled by the user by using SearchOptions. I also edited the readme, documentation, and benchmarks to reflect this change. I tried to follow the existing code style. The code passes all tests on my laptop, and for reference, here is an overview of the benchmark results:

    add books: ~1.4 ms
    search_jaro_winkler: ~212 us
    search_levenshtein: ~131 us
    

    SIMD Levenshtein distance should be even faster with longer strings.

    opened by Daniel-Liu-c0deb0t 4
  • [feature req] tokenization of other languages

    [feature req] tokenization of other languages

    simsearch-rs totally fails on Korean text. How involved would it be to have it tokenize hangeul?

    searching 던지(요) should match id 866 (name: -던지(요), but gives a top result of id 1527 (name: 에게로 【너에게로 다가간다】).

    raw json

    use std::{error::Error, fs};
    
    use serde::{Deserialize, Serialize};
    use simsearch::SimSearch;
    
    #[derive(Debug, Deserialize, Serialize)]
    struct Entry {
        id: i64,
        name: String,
        typical_example: String,
        translation: String,
        category_eng: String,
        category_kor: String,
    }
    
    fn main() -> Result<(), Box<dyn Error + 'static>> {
        let mut engine: SimSearch<i64> = SimSearch::new();
    
        let data = fs::read_to_string("./output.json")?;
        let entries: Vec<Entry> = serde_json::from_str(&data)?;
    
        for entry in &entries {
            engine.insert(entry.id, &entry.name);
        }
    
        let results = engine.search("던지(요)");
        let top_match = results[1];
    
        let entry = entries.iter().filter(|e| e.id == top_match).collect::<Vec<_>>();
    
        Ok(())
    }
    
    opened by andrewzah 2
  • Use criterion for benches

    Use criterion for benches

    Hi Andy and thanks for you crate!

    I wanted to implement a fuzzy search in rust for some time, and thanks to your post on reddit I've discovered that the work is already done ;)

    I looked at your benches and found out that you're not using criterion, which can draw fancy plots for you and provide a comparison with a previous run (very helpful when you're trying to improve the performance), so here's a PR that migrates the benches. The only other thing apart from just migrating I've also made the add 100 bench to insert items into an empty engine on each benchmark iteration, which I believe made the bench to be more fair.

    opened by mexus 2
  • fix #11 too slow insert

    fix #11 too slow insert

    Hi! Please check this PR about fixing too slow insert. This solution based on two hashmaps:

    • Id to id_num
    • id_num to Id

    SlotMap or slab may be better to use here, but hashmaps fine too.

    Notes:

    • it is breaking compatibility changes, because Id now have Eq + Hash trait bounds now
    • one extra clone for Id on each insert

    Try it on this example https://github.com/estin/simsearch-parse-cities

    $ cargo run --release
    
    opened by estin 1
  • too slow insert on large set of data

    too slow insert on large set of data

    Hi!

    I will try to parse and build engine on cities500.text (196648 cities) from free data by http://download.geonames.org/export/dump/ And found it's very slow process...

    https://github.com/andylokandy/simsearch-rs/blob/cade2652f78957b8f587ef0d731489000dae7f12/src/lib.rs#L136

    It's really needed to delete key on each insert?

    And delete is too slow by self https://github.com/andylokandy/simsearch-rs/blob/cade2652f78957b8f587ef0d731489000dae7f12/src/lib.rs#L262

    Why not used HashMap to map id to id_num?

    When insert many items to engine on each insert searching id_num on whole list of inserted keys before

    And I can make PR to fix, but no sure, what I understand this behavior right

    Sorry for my poor English.

    opened by estin 1
  • Support option to use max score of tokens instead of sum

    Support option to use max score of tokens instead of sum

    I find that making a change to use the max score of a token improved the search output for me. Particularly, it returns exact matches on a particular token first which is what I want.

    I changed the following line

    https://github.com/andylokandy/simsearch-rs/blob/cade2652f78957b8f587ef0d731489000dae7f12/src/lib.rs#L239

    To

        result_scores
            .entry(*id_num)
            .and_modify(|v| *v = v.max(score))
            .or_insert(score);
    

    An option for this could be incorporated in SearchOptions? Maybe a field where you can specify:

    enum TokenScoreMethod {
        Sum,
        Max
    }
    
    opened by rossmacarthur 1
  • Use latest version of triple_accel: v0.3.4

    Use latest version of triple_accel: v0.3.4

    Running cargo update on dependent library caused compile errors in older triple_accel.

    All tests (with cargo test --release) pass, seems like triple_accel didn't break compatibility.

    opened by Mr-Andersen 1
  • Performance improved 2x by reworking the ids data structure

    Performance improved 2x by reworking the ids data structure

    Rather than storing a vec of tuples, we push the IDs into one vector.

    In order to facilitate deletes, the vector stores options.

    The drawback is a lot of deletes might create a sparse vector that occupies memory

    The advantage, though, is 2x faster searches on large-ish (few tens of thousands) datasets, cause we eliminate a big iteration

    Definitely seems worth it

    In addition, when you don't do a lot of deletes, it's gonna be more memory efficient

    opened by Ivshti 1
  • Return scores with search results

    Return scores with search results

    First, thanks for the great library! I'm using it in a hobby password manager project of mine. It's the only Rust library I've found that actually works nicely for searching password items.

    The current search functions, search and search_tokens, only return the result item ids. This is fine for most situations, as the results are already sorted by the score.

    However, an ability to get the scores along with the ids would allow sorting items with equal scores in a custom way. For instance, in my password manager, I would want to sort items with equal scores alphabetically.


    As a side note, the biggest issue I have right now is that search result order is not stable. Searching with the same term on the same data set multiple times returns items with equal scores in different orders:

    use simsearch::SimSearch;
    
    fn main() {
        // Generate items
        let items: Vec<_> = (1..=50).map(|n| format!("Sample item {}", n)).collect();
    
        let mut ss = SimSearch::new();
        for i in &items {
            ss.insert(i, i);
        }
    
        for i in 0..3 {
            println!("Run #{i}");
            let res = ss.search("sample item");
            for id in &res[0..2] {
                println!("  {id}");
            }
            println!("---");
        }
    }
    

    Running this prints out:

    Run #0
      Sample item 28
      Sample item 21
    ---
    Run #1
      Sample item 1
      Sample item 29
    ---
    Run #2
      Sample item 37
      Sample item 38
    ---
    

    I think the reason for this instability is the use of HashMap internally. Perhaps it could be fixed by changing the hasher implementation, but ultimately I think returning the scores is a more versatile solution.

    opened by luryus 2
Owner
Andy Lok
Andy Lok
⚡ Insanely fast, 🌟 Feature-rich searching. lnx is the adaptable deployment of the tantivy search engine you never knew you wanted. Standing on the shoulders of giants.

✨ Feature Rich | ⚡ Insanely Fast An ultra-fast, adaptable deployment of the tantivy search engine via REST. ?? Standing On The Shoulders of Giants lnx

lnx 679 Jan 1, 2023
⚡ Insanely fast, 🌟 Feature-rich searching. lnx is the adaptable deployment of the tantivy search engine you never knew you wanted. Standing on the shoulders of giants.

✨ Feature Rich | ⚡ Insanely Fast An ultra-fast, adaptable deployment of the tantivy search engine via REST. ?? Standing On The Shoulders of Giants lnx

lnx 0 Apr 25, 2022
stringsext - search for multi-byte encoded strings in binary data

title stringsext - search for multi-byte encoded strings in binary data stringsext is a Unicode enhancement of the GNU strings tool with additional fu

Jens Getreu 89 Dec 14, 2022
Finding all pairs of similar documents time- and memory-efficiently

Finding all pairs of similar documents This software provides time- and memory-efficient all pairs similarity searches in documents. Problem definitio

null 53 Jan 2, 2023
Shogun search - Learning the principle of search engine. This is the first time I've written Rust.

shogun_search Learning the principle of search engine. This is the first time I've written Rust. A search engine written in Rust. Current Features: Bu

Yuxiang Liu 5 Mar 9, 2022
🔍TinySearch is a lightweight, fast, full-text search engine. It is designed for static websites.

tinysearch TinySearch is a lightweight, fast, full-text search engine. It is designed for static websites. TinySearch is written in Rust, and then com

null 2.2k Dec 31, 2022
🔎 A simple in-memory search for collections and key-value stores.

Indicium Search ?? A simple in-memory search for collections (Vec, HashMap, BTreeMap, etc) and key-value stores. Features autocompletion. There are ma

Dylan Bowker 41 Oct 28, 2022
Searching for plain-text files for lines that match a given string. Built with Rust.

Getting Started This is a minimal grep command-line utility built on Rust. It provides searching for plain-text files for lines that match a given str

Harsh Karande 0 Dec 31, 2021
🦔 Fast, lightweight & schema-less search backend. An alternative to Elasticsearch that runs on a few MBs of RAM.

?? Fast, lightweight & schema-less search backend. An alternative to Elasticsearch that runs on a few MBs of RAM.

Valerian Saliou 17.4k Jan 2, 2023
A lightweight full-text search library that provides full control over the scoring calculations

probly-search · A full-text search library, optimized for insertion speed, that provides full control over the scoring calculations. This start initia

Quantleaf 20 Nov 26, 2022
Lightning Fast, Ultra Relevant, and Typo-Tolerant Search Engine

MeiliSearch Website | Roadmap | Blog | LinkedIn | Twitter | Documentation | FAQ ⚡ Lightning Fast, Ultra Relevant, and Typo-Tolerant Search Engine ?? M

MeiliSearch 31.6k Dec 31, 2022
Perlin: An Efficient and Ergonomic Document Search-Engine

Table of Contents 1. Perlin Perlin Perlin is a free and open-source document search engine library build on top of perlin-core. Since the first releas

CurrySoftware GmbH 70 Dec 9, 2022
Tantivy is a full-text search engine library inspired by Apache Lucene and written in Rust

Tantivy is a full text search engine library written in Rust. It is closer to Apache Lucene than to Elasticsearch or Apache Solr in the sense it is no

tantivy 7.4k Dec 28, 2022
Tantivy is a full-text search engine library inspired by Apache Lucene and written in Rust

Tantivy is a full-text search engine library written in Rust. It is closer to Apache Lucene than to Elasticsearch or Apache Solr in the sense it is no

Quickwit OSS 7.5k Jan 9, 2023
Configurable quick search engine shortcuts for your terminal and browser.

Quicksearch Configurable quick search engine shortcuts for your terminal and browser. Installation Run cargo install quicksearch to install Configurat

Rahul Pai 2 Oct 14, 2022
High-performance log search engine.

NOTE: This project is under development, please do not depend on it yet as things may break. MinSQL MinSQL is a log search engine designed with simpli

High Performance, Kubernetes Native Object Storage 359 Nov 27, 2022
AI-powered search engine for Rust

txtai: AI-powered search engine for Rust txtai executes machine-learning workflows to transform data and build AI-powered text indices to perform simi

NeuML 69 Jan 2, 2023
A full-text search engine in rust

Toshi A Full-Text Search Engine in Rust Please note that this is far from production ready, also Toshi is still under active development, I'm just slo

Toshi Search 3.8k Jan 7, 2023
Cross-platform, cross-browser, cross-search-engine duckduckgo-like bangs

localbang Cross-platform, cross-browser, cross-search-engine duckduckgo-like bangs What are "bangs"?? Bangs are a way to define where to search inside

Jakob Kruse 7 Nov 23, 2022