A simple and lightweight fuzzy search engine that works in memory, searching for similar strings (a pun here).

Andy Lok

Last update: Dec 10, 2022

Related tags

Text search simsearch-rs

Overview

`simsearch`

A simple and lightweight fuzzy search engine that works in memory, searching for similar strings (a pun here).

Documentation

Usage

Add the following to your Cargo.toml:

[dependencies]
simsearch = "0.2"

Example

use simsearch::SimSearch;

let mut engine: SimSearch<u32> = SimSearch::new();

engine.insert(1, "Things Fall Apart");
engine.insert(2, "The Old Man and the Sea");
engine.insert(3, "James Joyce");

let results: Vec<u32> = engine.search("thngs");

assert_eq!(results, &[1]);

By default, Jaro-Winkler distance is used. An alternative Levenshtein distance, which is SIMD-accelerated but only works for ASCII byte strings, can be specified with custom SearchOptions:

use simsearch::{SimSearch, SearchOptions};

let options = SearchOptions::new().levenshtein(true);
let mut engine: SimSearch<u32> = SimSearch::new_with(options);

Also try the interactive demo by:

$ cargo run --release --example books

Contribution

All kinds of contribution are welcomed.

Issus. Feel free to open an issue when you find typos, bugs, or have any question.
Pull requests. New collection, better implementation, more tests, more documents and typo fixes are all welcomed.

License

Licensed under MIT license (LICENSE-MIT or http://opensource.org/licenses/MIT)

Comments

Incorporating Levenshtein distance as another option in addition to Jaro-Winkler

Hey, Thanks for your work on this library. I was wondering whether you would be interested in adding Levenshtein distance as an option? I am recently working on accelerating bounded Levenshtein distance calculations with SIMD, which is around 20-30x faster than a normal scalar Levenshtein distance algorithm for medium length strings, and I thought that it would be nice and beneficial to use it in your library. The library is on crates.io: https://crates.io/crates/triple_accel.

However, there are some tradeoffs though: my library cannot handle unicode strings and null bytes due to the SIMD implementation. This could be addressed by allowing the user to opt in to using the SIMD implementation explicitly. Of course, my library supports automatically falling back to a scalar version if SIMD is not supported as a CPU feature.

If you are interested, we could discuss a little more about the implementation. I am probably able to work on this in the future.

opened by Daniel-Liu-c0deb0t 5
Added SIMD-accelerated Levenshtein distance support based on the `triple_accel` library
I added support for Levenshtein distance by using the triple_accel library. Levenshtein distance can be optionally enabled by the user by using SearchOptions. I also edited the readme, documentation, and benchmarks to reflect this change. I tried to follow the existing code style. The code passes all tests on my laptop, and for reference, here is an overview of the benchmark results:

add books: ~1.4 ms search_jaro_winkler: ~212 us search_levenshtein: ~131 us

SIMD Levenshtein distance should be even faster with longer strings.
opened by Daniel-Liu-c0deb0t 4

[feature req] tokenization of other languages

simsearch-rs totally fails on Korean text. How involved would it be to have it tokenize hangeul?

searching 던지(요) should match id 866 (name: -던지(요), but gives a top result of id 1527 (name: 에게로 【너에게로 다가간다】).

raw json

use std::{error::Error, fs};

use serde::{Deserialize, Serialize};
use simsearch::SimSearch;

#[derive(Debug, Deserialize, Serialize)]
struct Entry {
    id: i64,
    name: String,
    typical_example: String,
    translation: String,
    category_eng: String,
    category_kor: String,
}

fn main() -> Result<(), Box<dyn Error + 'static>> {
    let mut engine: SimSearch<i64> = SimSearch::new();

    let data = fs::read_to_string("./output.json")?;
    let entries: Vec<Entry> = serde_json::from_str(&data)?;

    for entry in &entries {
        engine.insert(entry.id, &entry.name);
    }

    let results = engine.search("던지(요)");
    let top_match = results[1];

    let entry = entries.iter().filter(|e| e.id == top_match).collect::<Vec<_>>();

    Ok(())
}

opened by andrewzah 2

Use criterion for benches

Hi Andy and thanks for you crate!

I wanted to implement a fuzzy search in rust for some time, and thanks to your post on reddit I've discovered that the work is already done ;)

I looked at your benches and found out that you're not using criterion, which can draw fancy plots for you and provide a comparison with a previous run (very helpful when you're trying to improve the performance), so here's a PR that migrates the benches. The only other thing apart from just migrating I've also made the add 100 bench to insert items into an empty engine on each benchmark iteration, which I believe made the bench to be more fair.

opened by mexus 2
fix #11 too slow insert
Hi! Please check this PR about fixing too slow insert. This solution based on two hashmaps:

Id to id_num

id_num to Id

SlotMap or slab may be better to use here, but hashmaps fine too.

Notes:

it is breaking compatibility changes, because Id now have Eq + Hash trait bounds now

one extra clone for Id on each insert

Try it on this example https://github.com/estin/simsearch-parse-cities

$ cargo run --release
opened by estin 1
too slow insert on large set of data

Hi!

I will try to parse and build engine on cities500.text (196648 cities) from free data by http://download.geonames.org/export/dump/ And found it's very slow process...

https://github.com/andylokandy/simsearch-rs/blob/cade2652f78957b8f587ef0d731489000dae7f12/src/lib.rs#L136

It's really needed to delete key on each insert?

And delete is too slow by self https://github.com/andylokandy/simsearch-rs/blob/cade2652f78957b8f587ef0d731489000dae7f12/src/lib.rs#L262

Why not used HashMap to map id to id_num?

When insert many items to engine on each insert searching id_num on whole list of inserted keys before

And I can make PR to fix, but no sure, what I understand this behavior right

Sorry for my poor English.

opened by estin 1
Support option to use max score of tokens instead of sum
I find that making a change to use the max score of a token improved the search output for me. Particularly, it returns exact matches on a particular token first which is what I want.

I changed the following line

https://github.com/andylokandy/simsearch-rs/blob/cade2652f78957b8f587ef0d731489000dae7f12/src/lib.rs#L239

To

result_scores .entry(*id_num) .and_modify(|v| *v = v.max(score)) .or_insert(score);

An option for this could be incorporated in SearchOptions? Maybe a field where you can specify:

enum TokenScoreMethod { Sum, Max }
opened by rossmacarthur 1
Use latest version of triple_accel: v0.3.4

Running cargo update on dependent library caused compile errors in older triple_accel.

All tests (with cargo test --release) pass, seems like triple_accel didn't break compatibility.

opened by Mr-Andersen 1
Performance improved 2x by reworking the ids data structure

Rather than storing a vec of tuples, we push the IDs into one vector.

In order to facilitate deletes, the vector stores options.

The drawback is a lot of deletes might create a sparse vector that occupies memory

The advantage, though, is 2x faster searches on large-ish (few tens of thousands) datasets, cause we eliminate a big iteration

Definitely seems worth it

In addition, when you don't do a lot of deletes, it's gonna be more memory efficient

opened by Ivshti 1
Return scores with search results
First, thanks for the great library! I'm using it in a hobby password manager project of mine. It's the only Rust library I've found that actually works nicely for searching password items.

The current search functions, search and search_tokens, only return the result item ids. This is fine for most situations, as the results are already sorted by the score.

However, an ability to get the scores along with the ids would allow sorting items with equal scores in a custom way. For instance, in my password manager, I would want to sort items with equal scores alphabetically.

As a side note, the biggest issue I have right now is that search result order is not stable. Searching with the same term on the same data set multiple times returns items with equal scores in different orders:

use simsearch::SimSearch; fn main() { // Generate items let items: Vec<_> = (1..=50).map(|n| format!("Sample item {}", n)).collect(); let mut ss = SimSearch::new(); for i in &items { ss.insert(i, i); } for i in 0..3 { println!("Run #{i}"); let res = ss.search("sample item"); for id in &res[0..2] { println!(" {id}"); } println!("---"); } }

Running this prints out:

Run #0 Sample item 28 Sample item 21 --- Run #1 Sample item 1 Sample item 29 --- Run #2 Sample item 37 Sample item 38 ---

I think the reason for this instability is the use of HashMap internally. Perhaps it could be fixed by changing the hasher implementation, but ultimately I think returning the scores is a more versatile solution.
opened by luryus 2

Owner

Andy Lok

GitHub

⚡ Insanely fast, 🌟 Feature-rich searching. lnx is the adaptable deployment of the tantivy search engine you never knew you wanted. Standing on the shoulders of giants.

✨ Feature Rich | ⚡ Insanely Fast An ultra-fast, adaptable deployment of the tantivy search engine via REST. ?? Standing On The Shoulders of Giants lnx

679 Jan 1, 2023

⚡ Insanely fast, 🌟 Feature-rich searching. lnx is the adaptable deployment of the tantivy search engine you never knew you wanted. Standing on the shoulders of giants.

✨ Feature Rich | ⚡ Insanely Fast An ultra-fast, adaptable deployment of the tantivy search engine via REST. ?? Standing On The Shoulders of Giants lnx

0 Apr 25, 2022

A simple and lightweight fuzzy search engine that works in memory, searching for similar strings (a pun here).

Related tags

Overview

simsearch

Usage

Example

Contribution

License

Comments

Owner

Andy Lok

⚡ Insanely fast, 🌟 Feature-rich searching. lnx is the adaptable deployment of the tantivy search engine you never knew you wanted. Standing on the shoulders of giants.

⚡ Insanely fast, 🌟 Feature-rich searching. lnx is the adaptable deployment of the tantivy search engine you never knew you wanted. Standing on the shoulders of giants.

stringsext - search for multi-byte encoded strings in binary data

Finding all pairs of similar documents time- and memory-efficiently

Shogun search - Learning the principle of search engine. This is the first time I've written Rust.

🔍TinySearch is a lightweight, fast, full-text search engine. It is designed for static websites.

🔎 A simple in-memory search for collections and key-value stores.

Searching for plain-text files for lines that match a given string. Built with Rust.

🦔 Fast, lightweight & schema-less search backend. An alternative to Elasticsearch that runs on a few MBs of RAM.

A lightweight full-text search library that provides full control over the scoring calculations

Lightning Fast, Ultra Relevant, and Typo-Tolerant Search Engine

Perlin: An Efficient and Ergonomic Document Search-Engine

Tantivy is a full-text search engine library inspired by Apache Lucene and written in Rust

Tantivy is a full-text search engine library inspired by Apache Lucene and written in Rust

Configurable quick search engine shortcuts for your terminal and browser.

High-performance log search engine.

AI-powered search engine for Rust

A full-text search engine in rust

Cross-platform, cross-browser, cross-search-engine duckduckgo-like bangs

`simsearch`