A lightweight full-text search library that provides full control over the scoring calculations

Overview

probly-search · GitHub license Coverage Status Latest Version PRs Welcome

A full-text search library, optimized for insertion speed, that provides full control over the scoring calculations.

This start initially as a port of the Node library NDX.

Demo

Recipe (title) search with 50k documents.

https://quantleaf.github.io/probly-search-demo/

Features

  • Three ways to do scoring

    • BM25 ranking function to rank matching documents. The same ranking function that is used by default in Lucene >= 6.0.0.
    • zero-to-one, a library unique scoring function that provides a normalized score that is bounded by 0 and 1. Perfect for matching titles/labels with queries.
    • Ability to fully customize your own scoring function by implenting the ScoreCalculator trait.
  • Trie based dynamic Inverted Index.

  • Multiple fields full-text indexing and searching.

  • Per-field score boosting.

  • Configurable tokenizer and term filter.

  • Free text queries with query expansion.

  • Fast allocation, but latent deletion.

Documentation

Documentation is under development. For now read the source tests.

Example

Creating an index with a document that has 2 fields. Query documents, and remove a document.

use std::collections::HashSet;
use probly_search::{
    index::{add_document_to_index, create_index, remove_document_from_index, Index},
    query::{
        query,
        score::default::{bm25, zero_to_one},
        QueryResult,
    },
};


// Create index with 2 fields
let mut index = create_index::<usize>(2);

// Create docs from a custom Doc struct
let doc_1 = Doc {
    id: 0,
    title: "abc".to_string(),
    description: "dfg".to_string(),
};

let doc_2 = Doc {
    id: 1,
    title: "dfgh".to_string(),
    description: "abcd".to_string(),
};

// Add documents to index
add_document_to_index(
    &mut index,
    &[title_extract, description_extract],
    tokenizer,
    filter,
    doc_1.id,
    doc_1.clone(),
);

add_document_to_index(
    &mut index,
    &[title_extract, description_extract],
    tokenizer,
    filter,
    doc_2.id,
    doc_2,
);

// Search, expected 2 results
let mut result = query(
    &mut index,
    &"abc",
    &mut bm25::new(),
    tokenizer,
    filter,
    &[1., 1.],
    None,
);
assert_eq!(result.len(), 2);
assert_eq!(
    result[0],
    QueryResult {
        key: 0,
        score: 0.6931471805599453
    }
);
assert_eq!(
    result[1],
    QueryResult {
        key: 1,
        score: 0.28104699650060755
    }
);

// Remove documents from index
let mut removed_docs = HashSet::new();
remove_document_from_index(&mut index, &mut removed_docs, doc_1.id);

// Vacuum to remove completely
vacuum_index(&mut index, &mut removed_docs);

// Search, expect 1 result
result = query(
    &mut index,
    &"abc",
    &mut bm25::new(),
    tokenizer,
    filter,
    &[1., 1.],
    Some(&removed_docs),
);
assert_eq!(result.len(), 1);
assert_eq!(
    result[0],
    QueryResult {
        key: 1,
        score: 0.1166450426074421
    }
);

Go through source tests in for the BM25 implementation and zero-to-one implementation for more query examples.

License

MIT

Comments
  • Pass document by reference.

    Pass document by reference.

    I think it makes more sense to pass a reference to the document when adding to the index as it is only a reference that is passed to the extraction function for each field.

    Also I have a use case where I need to store the original documents and this prevents an unnecessary clone when adding to the index.

    Thanks 🙏

    opened by tmpfs 8
  • Question about removed_docs argument to query

    Question about removed_docs argument to query

    Thanks for the library, I spent quite a bit of time researching the available libraries for my project and this one strikes the perfect balance for my needs, in particular the support for webassembly is critical!

    It seems that if I vacuum the index then there is no need to pass removed_docs as the last argument to query?

    Am i right in thinking that the removed_docs argument exists to support the case when documents have been removed but not yet vacuumed from the index and the query should ignore them?

    opened by tmpfs 2
  • Use methods rather than functions

    Use methods rather than functions

    The API feels a bit strange calling add_document_to_index(), remove_document_from_index() and vacuum_index() as they all accept &mut index as the first parameter.

    Whilst we are making breaking changes I suggest using methods on Index:

    impl Index {
      fn add_document(&mut self, ..) {}
      fn remove_document(&mut self, ..) {}
      fn vacuum(&mut self, ..) {}
    }
    

    I think this is more idiomatic. What do you think?

    opened by tmpfs 1
  • Improve performance changes (draft)

    Improve performance changes (draft)

    I was playing our with the library so I created a few changes I could possibly split into multiple PRs. Would like to hear you thoughts about this

    Changes for improved performance

    • add_document use Cow for hash keys.
    • hashbrown Hashmap instead of std::collections. Faster for small hash keys. Though does not provide the same level of HashDoS resistance
    • TermData query_terms property is replaced with query_terms_len to prevent unnecessary copies of all query terms during query (not benchmarked yet) (Breaking change for the ScoreCalculator trait)

    Bench results on my computer add_100k_docs Master 301.19 ms

    This branch 221.24 ms

    API change ideas

    • Filter is removed, since the Tokenizer can just be seen as a preprocessing step to indexation and query and can do anything that the Filter does. One argument less to worry about. (Breaking change)
    opened by marcus-pousette 0
  • query method requires a mutable reference to Index

    query method requires a mutable reference to Index

    The query API requires you to pass a mutable reference to the index. The reason for this is that vacuuming might occur (complete deletion of removed documents) which is a mutation. It would be more ideal if the the query would just take a reference to the index instead, yet still, if possible, use the trie index walk during query as an opportunity to vacuum (or as an opportunity in finding out where to vacuum later).

    Based on discussions in #15

    opened by marcus-pousette 0
  • Allow extraction from fields that may be collections.

    Allow extraction from fields that may be collections.

    The current API only supports fields that extract &str directly from the document however there are times when we need to borrow from a collection in the document.

    My use case is that I have a HashSet<String> to represent tags associated with the document.

    With this PR my extract functions look like this:

    // Label
    fn label_extract(d: &Document) -> Vec<&str> {
        vec![d.2.label()]
    }
    
    // Tags
    fn tags_extract(d: &Document) -> Vec<&str> {
        d.2.tags().iter().map(|s| &s[..]).collect()
    }
    

    I expected the extra heap allocation for the Vec to have a minor impact on performance but running the benchmarks shows a small performance improvement 🤷

    opened by tmpfs 3
  • Field accessor trait?

    Field accessor trait?

    In the current solution you define extraction functions and pass the as an array in add_document:

    pub fn add_document<D> (
            &mut self,
            field_accessors: &[FieldAccessor<D>],
            tokenizer: Tokenizer,
            key: T,
            doc: &D,
        )
        ...
    

    A maybe more idiomatic solution would be to create a Trait, called something like Indexable

    trait Indexable
    {
        fn key(&self) -> &str,
        fn extract_fields(&self) -> &[&str]
    } 
    

    Then the add_document method will look like

    pub fn add_document<D: Indexable> (
            &mut self,
            doc: &D,
    )
    
    opened by marcus-pousette 3
  • Indexing numerical values?

    Indexing numerical values?

    A common use case is to combine a text search with a timestamp sort. It would be interesting to see whether this library could support a numerical index in addition to a text index without too much rework, or breaking changes.

    Problematic part of building this support is that we need possibly write a completely new index implementation, or make smart abstraction of some generic index. In addition the query API will become more complicated since we need to distinguish the types of the fields we query, and need to provide a different API for numbers ("lt", "lte", "ge", "geq" operators).

    A silly solution to this problem would be to convert numbers to byte arrays, and see each byte as a string token i.e. This would let us do simple "lt", "lte" queries in the same way as doing prefix searches (i.e. counting three leading 0 zeros in [0,0,0,123] would let us quickly know this number is below or equal to 255).

    A better solution would be to implement this thoroughly where we would build a dedicated index for numerical values.

    opened by marcus-pousette 1
  • Fuzzy matching?

    Fuzzy matching?

    I have a use case where I expect to have a small number of documents (several hundred to several thousand) and I would want the search to do fuzzy matching on sub-strings within terms so the search would be as greedy as possible.

    This would affect performance so should be optional for this sort of "autocomplete" use case. Is this something that interests you?

    Happy to do the legwork under your guidance to land this 👍

    opened by tmpfs 16
  • zero-to-one scoring: Repeating terms same score

    zero-to-one scoring: Repeating terms same score

    https://quantleaf.github.io/probly-search-demo/

    Current behaviour garlic garlic -> 0.5 score garlic garlic garlic -> 0.5 score

    Expected behaviour

    garlic garlic -> 0.5 score garlic garlic garlic -> Some score< 0.5

    opened by marcus-pousette 0
Releases(2.0.0-alpha-1)
Owner
Quantleaf
Quantleaf
Tantivy is a full-text search engine library inspired by Apache Lucene and written in Rust

Tantivy is a full text search engine library written in Rust. It is closer to Apache Lucene than to Elasticsearch or Apache Solr in the sense it is no

tantivy 7.4k Dec 28, 2022
Tantivy is a full text search engine library written in Rust.

Tantivy is a full text search engine library written in Rust. It is closer to Apache Lucene than to Elasticsearch or Apache Solr in the sense it is no

Quickwit OSS 7.4k Dec 30, 2022
Tantivy is a full-text search engine library inspired by Apache Lucene and written in Rust

Tantivy is a full-text search engine library written in Rust. It is closer to Apache Lucene than to Elasticsearch or Apache Solr in the sense it is no

Quickwit OSS 7.5k Jan 9, 2023
A full-text search and indexing server written in Rust.

Bayard Bayard is a full-text search and indexing server written in Rust built on top of Tantivy that implements Raft Consensus Algorithm and gRPC. Ach

Bayard Search 1.8k Dec 26, 2022
A full-text search engine in rust

Toshi A Full-Text Search Engine in Rust Please note that this is far from production ready, also Toshi is still under active development, I'm just slo

Toshi Search 3.8k Jan 7, 2023
Shogun search - Learning the principle of search engine. This is the first time I've written Rust.

shogun_search Learning the principle of search engine. This is the first time I've written Rust. A search engine written in Rust. Current Features: Bu

Yuxiang Liu 5 Mar 9, 2022
A simple and lightweight fuzzy search engine that works in memory, searching for similar strings (a pun here).

simsearch A simple and lightweight fuzzy search engine that works in memory, searching for similar strings (a pun here). Documentation Usage Add the f

Andy Lok 116 Dec 10, 2022
🦔 Fast, lightweight & schema-less search backend. An alternative to Elasticsearch that runs on a few MBs of RAM.

?? Fast, lightweight & schema-less search backend. An alternative to Elasticsearch that runs on a few MBs of RAM.

Valerian Saliou 17.4k Jan 2, 2023
Test Electronic Control Units around the world in a transparent network.

Eclipse openDuT Test Electronic Control Units around the world in a transparent network. Eclipse openDuT provides an open framework to automate the te

Eclipse openDuT 6 Dec 22, 2023
Image search example by approximate nearest-neighbor library In Rust

rust-ann-search-example Image search example by approximate nearest-neighbor library In Rust use - tensorflow 0.17.0 - pretrain ResNet50 - hora (Ru

vaaaaanquish 8 Jan 3, 2022
Lightning Fast, Ultra Relevant, and Typo-Tolerant Search Engine

MeiliSearch Website | Roadmap | Blog | LinkedIn | Twitter | Documentation | FAQ ⚡ Lightning Fast, Ultra Relevant, and Typo-Tolerant Search Engine ?? M

MeiliSearch 31.6k Dec 31, 2022
High-performance log search engine.

NOTE: This project is under development, please do not depend on it yet as things may break. MinSQL MinSQL is a log search engine designed with simpli

High Performance, Kubernetes Native Object Storage 359 Nov 27, 2022
Perlin: An Efficient and Ergonomic Document Search-Engine

Table of Contents 1. Perlin Perlin Perlin is a free and open-source document search engine library build on top of perlin-core. Since the first releas

CurrySoftware GmbH 70 Dec 9, 2022
AI-powered search engine for Rust

txtai: AI-powered search engine for Rust txtai executes machine-learning workflows to transform data and build AI-powered text indices to perform simi

NeuML 69 Jan 2, 2023
🔎 Impossibly fast web search, made for static sites.

Stork Impossibly fast web search, made for static sites. Stork is two things. First, it's an indexer: it indexes your loosely-structured content and c

James Little 2.5k Dec 27, 2022
⚡ Insanely fast, 🌟 Feature-rich searching. lnx is the adaptable deployment of the tantivy search engine you never knew you wanted. Standing on the shoulders of giants.

✨ Feature Rich | ⚡ Insanely Fast An ultra-fast, adaptable deployment of the tantivy search engine via REST. ?? Standing On The Shoulders of Giants lnx

lnx 679 Jan 1, 2023
⚡ Insanely fast, 🌟 Feature-rich searching. lnx is the adaptable deployment of the tantivy search engine you never knew you wanted. Standing on the shoulders of giants.

✨ Feature Rich | ⚡ Insanely Fast An ultra-fast, adaptable deployment of the tantivy search engine via REST. ?? Standing On The Shoulders of Giants lnx

lnx 0 Apr 25, 2022
Rapidly Search and Hunt through Windows Event Logs

Rapidly Search and Hunt through Windows Event Logs Chainsaw provides a powerful ‘first-response’ capability to quickly identify threats within Windows

F-Secure Countercept 1.8k Dec 31, 2022
Cross-platform, cross-browser, cross-search-engine duckduckgo-like bangs

localbang Cross-platform, cross-browser, cross-search-engine duckduckgo-like bangs What are "bangs"?? Bangs are a way to define where to search inside

Jakob Kruse 7 Nov 23, 2022