A lightweight full-text search library that provides full control over the scoring calculations

Quantleaf

Last update: Nov 26, 2022

Related tags

Overview

probly-search ·

A full-text search library, optimized for insertion speed, that provides full control over the scoring calculations.

This start initially as a port of the Node library NDX.

Demo

Recipe (title) search with 50k documents.

https://quantleaf.github.io/probly-search-demo/

Features

Three ways to do scoring
- BM25 ranking function to rank matching documents. The same ranking function that is used by default in Lucene >= 6.0.0.
- zero-to-one, a library unique scoring function that provides a normalized score that is bounded by 0 and 1. Perfect for matching titles/labels with queries.
- Ability to fully customize your own scoring function by implenting the ScoreCalculator trait.
Trie based dynamic Inverted Index.
Multiple fields full-text indexing and searching.
Per-field score boosting.
Configurable tokenizer and term filter.
Free text queries with query expansion.
Fast allocation, but latent deletion.

Documentation

Documentation is under development. For now read the source tests.

Example

Creating an index with a document that has 2 fields. Query documents, and remove a document.

use std::collections::HashSet;
use probly_search::{
    index::{add_document_to_index, create_index, remove_document_from_index, Index},
    query::{
        query,
        score::default::{bm25, zero_to_one},
        QueryResult,
    },
};


// Create index with 2 fields
let mut index = create_index::<usize>(2);

// Create docs from a custom Doc struct
let doc_1 = Doc {
    id: 0,
    title: "abc".to_string(),
    description: "dfg".to_string(),
};

let doc_2 = Doc {
    id: 1,
    title: "dfgh".to_string(),
    description: "abcd".to_string(),
};

// Add documents to index
add_document_to_index(
    &mut index,
    &[title_extract, description_extract],
    tokenizer,
    filter,
    doc_1.id,
    doc_1.clone(),
);

add_document_to_index(
    &mut index,
    &[title_extract, description_extract],
    tokenizer,
    filter,
    doc_2.id,
    doc_2,
);

// Search, expected 2 results
let mut result = query(
    &mut index,
    &"abc",
    &mut bm25::new(),
    tokenizer,
    filter,
    &[1., 1.],
    None,
);
assert_eq!(result.len(), 2);
assert_eq!(
    result[0],
    QueryResult {
        key: 0,
        score: 0.6931471805599453
    }
);
assert_eq!(
    result[1],
    QueryResult {
        key: 1,
        score: 0.28104699650060755
    }
);

// Remove documents from index
let mut removed_docs = HashSet::new();
remove_document_from_index(&mut index, &mut removed_docs, doc_1.id);

// Vacuum to remove completely
vacuum_index(&mut index, &mut removed_docs);

// Search, expect 1 result
result = query(
    &mut index,
    &"abc",
    &mut bm25::new(),
    tokenizer,
    filter,
    &[1., 1.],
    Some(&removed_docs),
);
assert_eq!(result.len(), 1);
assert_eq!(
    result[0],
    QueryResult {
        key: 1,
        score: 0.1166450426074421
    }
);

Go through source tests in for the BM25 implementation and zero-to-one implementation for more query examples.

License

MIT

Comments

Pass document by reference.

I think it makes more sense to pass a reference to the document when adding to the index as it is only a reference that is passed to the extraction function for each field.

Also I have a use case where I need to store the original documents and this prevents an unnecessary clone when adding to the index.

Thanks 🙏

opened by tmpfs 8
Question about removed_docs argument to query

Thanks for the library, I spent quite a bit of time researching the available libraries for my project and this one strikes the perfect balance for my needs, in particular the support for webassembly is critical!

It seems that if I vacuum the index then there is no need to pass removed_docs as the last argument to query?

Am i right in thinking that the removed_docs argument exists to support the case when documents have been removed but not yet vacuumed from the index and the query should ignore them?

opened by tmpfs 2
Use methods rather than functions
The API feels a bit strange calling add_document_to_index(), remove_document_from_index() and vacuum_index() as they all accept &mut index as the first parameter.

Whilst we are making breaking changes I suggest using methods on Index:

impl Index { fn add_document(&mut self, ..) {} fn remove_document(&mut self, ..) {} fn vacuum(&mut self, ..) {} }

I think this is more idiomatic. What do you think?
opened by tmpfs 1
Improve performance changes (draft)
I was playing our with the library so I created a few changes I could possibly split into multiple PRs. Would like to hear you thoughts about this

Changes for improved performance

add_document use Cow for hash keys.

hashbrown Hashmap instead of std::collections. Faster for small hash keys. Though does not provide the same level of HashDoS resistance

TermData query_terms property is replaced with query_terms_len to prevent unnecessary copies of all query terms during query (not benchmarked yet) (Breaking change for the ScoreCalculator trait)

Bench results on my computer add_100k_docs Master 301.19 ms

This branch 221.24 ms

API change ideas

Filter is removed, since the Tokenizer can just be seen as a preprocessing step to indexation and query and can do anything that the Filter does. One argument less to worry about. (Breaking change)
opened by marcus-pousette 0
query method requires a mutable reference to Index

The query API requires you to pass a mutable reference to the index. The reason for this is that vacuuming might occur (complete deletion of removed documents) which is a mutation. It would be more ideal if the the query would just take a reference to the index instead, yet still, if possible, use the trie index walk during query as an opportunity to vacuum (or as an opportunity in finding out where to vacuum later).

Based on discussions in #15

opened by marcus-pousette 0
Allow extraction from fields that may be collections.
The current API only supports fields that extract &str directly from the document however there are times when we need to borrow from a collection in the document.

My use case is that I have a HashSet<String> to represent tags associated with the document.

With this PR my extract functions look like this:

// Label fn label_extract(d: &Document) -> Vec<&str> { vec![d.2.label()] } // Tags fn tags_extract(d: &Document) -> Vec<&str> { d.2.tags().iter().map(|s| &s[..]).collect() }

I expected the extra heap allocation for the Vec to have a minor impact on performance but running the benchmarks shows a small performance improvement 🤷
opened by tmpfs 3

Field accessor trait?

In the current solution you define extraction functions and pass the as an array in add_document:

pub fn add_document<D> (
        &mut self,
        field_accessors: &[FieldAccessor<D>],
        tokenizer: Tokenizer,
        key: T,
        doc: &D,
    )
    ...

A maybe more idiomatic solution would be to create a Trait, called something like Indexable

trait Indexable
{
    fn key(&self) -> &str,
    fn extract_fields(&self) -> &[&str]
}

Then the add_document method will look like

pub fn add_document<D: Indexable> (
        &mut self,
        doc: &D,
)

opened by marcus-pousette 3

Indexing numerical values?

A common use case is to combine a text search with a timestamp sort. It would be interesting to see whether this library could support a numerical index in addition to a text index without too much rework, or breaking changes.

Problematic part of building this support is that we need possibly write a completely new index implementation, or make smart abstraction of some generic index. In addition the query API will become more complicated since we need to distinguish the types of the fields we query, and need to provide a different API for numbers ("lt", "lte", "ge", "geq" operators).

A silly solution to this problem would be to convert numbers to byte arrays, and see each byte as a string token i.e. This would let us do simple "lt", "lte" queries in the same way as doing prefix searches (i.e. counting three leading 0 zeros in [0,0,0,123] would let us quickly know this number is below or equal to 255).

A better solution would be to implement this thoroughly where we would build a dedicated index for numerical values.

opened by marcus-pousette 1
Fuzzy matching?

I have a use case where I expect to have a small number of documents (several hundred to several thousand) and I would want the search to do fuzzy matching on sub-strings within terms so the search would be as greedy as possible.

This would affect performance so should be optional for this sort of "autocomplete" use case. Is this something that interests you?

Happy to do the legwork under your guidance to land this 👍

opened by tmpfs 16
zero-to-one scoring: Repeating terms same score

https://quantleaf.github.io/probly-search-demo/

Current behaviour garlic garlic -> 0.5 score garlic garlic garlic -> 0.5 score

Expected behaviour

garlic garlic -> 0.5 score garlic garlic garlic -> Some score< 0.5

opened by marcus-pousette 0

Releases(2.0.0-alpha-1)

2.0.0-alpha-1(Aug 12, 2022)
Breaking change: add_document_to_index now accepts documents as references rather than values. The Tokenizer and Filter type now returns slices rather than owned strings

Source code(tar.gz)
Source code(zip)
1.2.4(Aug 7, 2021)
Improved zero-to-one scoring

Optimization in indexing

Source code(tar.gz)
Source code(zip)
1.2.3(Aug 2, 2021)
Bugfix: Zero-To-One scoring, prevent scoring surpassing 1

Source code(tar.gz)
Source code(zip)
1.1.3(Aug 2, 2021)
Performance improvement by representing index with arena instread of smart pointers

Source code(tar.gz)
Source code(zip)
1.1.2(Jun 25, 2021)
Performance improved by using Mutex instead of RwLock for tree nodes

Benchmarks

Source code(tar.gz)
Source code(zip)
1.1.1(Jun 11, 2021)
Bug fix, empty tokens from tokenizer causing panic

Source code(tar.gz)
Source code(zip)
1.1.0(Jun 10, 2021)
Thread safe with Arc/RwLock index

Source code(tar.gz)
Source code(zip)
1.0.0(Jun 9, 2021)

New scoring function "zero-to-one" A more generalized query function
Source code(tar.gz)
Source code(zip)

Owner

Quantleaf

GitHub

Tantivy is a full-text search engine library inspired by Apache Lucene and written in Rust

Tantivy is a full text search engine library written in Rust. It is closer to Apache Lucene than to Elasticsearch or Apache Solr in the sense it is no

7.4k Dec 28, 2022

Tantivy is a full text search engine library written in Rust.

Tantivy is a full text search engine library written in Rust. It is closer to Apache Lucene than to Elasticsearch or Apache Solr in the sense it is no

7.4k Dec 30, 2022

Tantivy is a full-text search engine library inspired by Apache Lucene and written in Rust

Tantivy is a full-text search engine library written in Rust. It is closer to Apache Lucene than to Elasticsearch or Apache Solr in the sense it is no

7.5k Jan 9, 2023

A full-text search and indexing server written in Rust.

Bayard Bayard is a full-text search and indexing server written in Rust built on top of Tantivy that implements Raft Consensus Algorithm and gRPC. Ach

1.8k Dec 26, 2022

A full-text search engine in rust

Toshi A Full-Text Search Engine in Rust Please note that this is far from production ready, also Toshi is still under active development, I'm just slo

3.8k Jan 7, 2023

Shogun search - Learning the principle of search engine. This is the first time I've written Rust.

shogun_search Learning the principle of search engine. This is the first time I've written Rust. A search engine written in Rust. Current Features: Bu

5 Mar 9, 2022

A simple and lightweight fuzzy search engine that works in memory, searching for similar strings (a pun here).

simsearch A simple and lightweight fuzzy search engine that works in memory, searching for similar strings (a pun here). Documentation Usage Add the f

116 Dec 10, 2022

🦔 Fast, lightweight & schema-less search backend. An alternative to Elasticsearch that runs on a few MBs of RAM.

?? Fast, lightweight & schema-less search backend. An alternative to Elasticsearch that runs on a few MBs of RAM.

17.4k Jan 2, 2023

Test Electronic Control Units around the world in a transparent network.

Eclipse openDuT Test Electronic Control Units around the world in a transparent network. Eclipse openDuT provides an open framework to automate the te

6 Dec 22, 2023

Image search example by approximate nearest-neighbor library In Rust

rust-ann-search-example Image search example by approximate nearest-neighbor library In Rust use - tensorflow 0.17.0 - pretrain ResNet50 - hora (Ru

8 Jan 3, 2022

Lightning Fast, Ultra Relevant, and Typo-Tolerant Search Engine

31.6k Dec 31, 2022

High-performance log search engine.

NOTE: This project is under development, please do not depend on it yet as things may break. MinSQL MinSQL is a log search engine designed with simpli

High Performance, Kubernetes Native Object Storage

359 Nov 27, 2022

Perlin: An Efficient and Ergonomic Document Search-Engine

Table of Contents 1. Perlin Perlin Perlin is a free and open-source document search engine library build on top of perlin-core. Since the first releas

70 Dec 9, 2022

AI-powered search engine for Rust

txtai: AI-powered search engine for Rust txtai executes machine-learning workflows to transform data and build AI-powered text indices to perform simi

69 Jan 2, 2023

🔎 Impossibly fast web search, made for static sites.

Stork Impossibly fast web search, made for static sites. Stork is two things. First, it's an indexer: it indexes your loosely-structured content and c

2.5k Dec 27, 2022

⚡ Insanely fast, 🌟 Feature-rich searching. lnx is the adaptable deployment of the tantivy search engine you never knew you wanted. Standing on the shoulders of giants.

✨ Feature Rich | ⚡ Insanely Fast An ultra-fast, adaptable deployment of the tantivy search engine via REST. ?? Standing On The Shoulders of Giants lnx

679 Jan 1, 2023

⚡ Insanely fast, 🌟 Feature-rich searching. lnx is the adaptable deployment of the tantivy search engine you never knew you wanted. Standing on the shoulders of giants.

✨ Feature Rich | ⚡ Insanely Fast An ultra-fast, adaptable deployment of the tantivy search engine via REST. ?? Standing On The Shoulders of Giants lnx

0 Apr 25, 2022

Rapidly Search and Hunt through Windows Event Logs

Rapidly Search and Hunt through Windows Event Logs Chainsaw provides a powerful ‘first-response’ capability to quickly identify threats within Windows

1.8k Dec 31, 2022

Cross-platform, cross-browser, cross-search-engine duckduckgo-like bangs

localbang Cross-platform, cross-browser, cross-search-engine duckduckgo-like bangs What are "bangs"?? Bangs are a way to define where to search inside

7 Nov 23, 2022

A lightweight full-text search library that provides full control over the scoring calculations

Related tags

Overview

probly-search ·

Demo

Features

Documentation

Example

License

Comments

Changes for improved performance

API change ideas

Releases(2.0.0-alpha-1)

2.0.0-alpha-1(Aug 12, 2022)

1.2.4(Aug 7, 2021)

1.2.3(Aug 2, 2021)

1.1.3(Aug 2, 2021)

1.1.2(Jun 25, 2021)

1.1.1(Jun 11, 2021)

1.1.0(Jun 10, 2021)

1.0.0(Jun 9, 2021)

Owner

Quantleaf

Tantivy is a full-text search engine library inspired by Apache Lucene and written in Rust

Tantivy is a full text search engine library written in Rust.

Tantivy is a full-text search engine library inspired by Apache Lucene and written in Rust

A full-text search and indexing server written in Rust.

A full-text search engine in rust

Shogun search - Learning the principle of search engine. This is the first time I've written Rust.

A simple and lightweight fuzzy search engine that works in memory, searching for similar strings (a pun here).

🦔 Fast, lightweight & schema-less search backend. An alternative to Elasticsearch that runs on a few MBs of RAM.

Test Electronic Control Units around the world in a transparent network.

Image search example by approximate nearest-neighbor library In Rust

Lightning Fast, Ultra Relevant, and Typo-Tolerant Search Engine

High-performance log search engine.

Perlin: An Efficient and Ergonomic Document Search-Engine

AI-powered search engine for Rust

🔎 Impossibly fast web search, made for static sites.

⚡ Insanely fast, 🌟 Feature-rich searching. lnx is the adaptable deployment of the tantivy search engine you never knew you wanted. Standing on the shoulders of giants.

⚡ Insanely fast, 🌟 Feature-rich searching. lnx is the adaptable deployment of the tantivy search engine you never knew you wanted. Standing on the shoulders of giants.

Rapidly Search and Hunt through Windows Event Logs

Cross-platform, cross-browser, cross-search-engine duckduckgo-like bangs