Tantivy is a full-text search engine library inspired by Apache Lucene and written in Rust

tantivy

Last update: Dec 28, 2022

Related tags

Overview

Tantivy is a full text search engine library written in Rust.

It is closer to Apache Lucene than to Elasticsearch or Apache Solr in the sense it is not an off-the-shelf search engine server, but rather a crate that can be used to build such a search engine.

Tantivy is, in fact, strongly inspired by Lucene's design.

Benchmark

The following benchmark break downs performance for different type of queries / collection.

Your mileage WILL vary depending on the nature of queries and their load.

Features

Full-text search
Configurable tokenizer (stemming available for 17 Latin languages with third party support for Chinese (tantivy-jieba and cang-jie), Japanese (lindera and tantivy-tokenizer-tiny-segmente) and Korean (lindera + lindera-ko-dic-builder)
Fast (check out the 🐎 ✨ benchmark ✨ 🐎 )
Tiny startup time (<10ms), perfect for command line tools
BM25 scoring (the same as Lucene)
Natural query language (e.g. (michael AND jackson) OR "king of pop")
Phrase queries search (e.g. "michael jackson")
Incremental indexing
Multithreaded indexing (indexing English Wikipedia takes < 3 minutes on my desktop)
Mmap directory
SIMD integer compression when the platform/CPU includes the SSE2 instruction set
Single valued and multivalued u64, i64, and f64 fast fields (equivalent of doc values in Lucene)
&[u8] fast fields
Text, i64, u64, f64, dates, and hierarchical facet fields
LZ4 compressed document store
Range queries
Faceted search
Configurable indexing (optional term frequency and position indexing)
Cheesy logo with a horse

Non-features

Distributed search is out of the scope of Tantivy. That being said, Tantivy is a library upon which one could build a distributed search. Serializable/mergeable collector state for instance, are within the scope of Tantivy.

Getting started

Tantivy works on stable Rust (>= 1.27) and supports Linux, MacOS, and Windows.

Tantivy's simple search example
tantivy-cli and its tutorial - tantivy-cli is an actual command line interface that makes it easy for you to create a search engine, index documents, and search via the CLI or a small server with a REST API. It walks you through getting a wikipedia search engine up and running in a few minutes.
Reference doc for the last released version

How can I support this project?

There are many ways to support this project.

Use Tantivy and tell us about your experience on Gitter or by email ([email protected])
Report bugs
Write a blog post
Help with documentation by asking questions or submitting PRs
Contribute code (you can join our Gitter)
Talk about Tantivy around you

Contributing code

We use the GitHub Pull Request workflow: reference a GitHub ticket and/or include a comprehensive commit message when opening a PR.

Clone and build locally

Tantivy compiles on stable Rust but requires Rust >= 1.27. To check out and run tests, you can simply run:

    git clone https://github.com/tantivy-search/tantivy.git
    cd tantivy
    cargo build

Run tests

Some tests will not run with just cargo test because of fail-rs. To run the tests exhaustively, run ./run-tests.sh.

Debug

You might find it useful to step through the programme with a debugger.

A failing test

Make sure you haven't run cargo clean after the most recent cargo test or cargo build to guarantee that the target/ directory exists. Use this bash script to find the name of the most recent debug build of Tantivy and run it under rust-gdb:

find target/debug/ -maxdepth 1 -executable -type f -name "tantivy*" -printf '%TY-%Tm-%Td %TT %p\n' | sort -r | cut -d " " -f 3 | xargs -I RECENT_DBG_TANTIVY rust-gdb RECENT_DBG_TANTIVY

Now that you are in rust-gdb, you can set breakpoints on lines and methods that match your source code and run the debug executable with flags that you normally pass to cargo test like this:

$gdb run --test-threads 1 --test $NAME_OF_TEST

An example

By default, rustc compiles everything in the examples/ directory in debug mode. This makes it easy for you to make examples to reproduce bugs:

rust-gdb target/debug/examples/$EXAMPLE_NAME
$ gdb run

Comments

Case of corrupted segment

Probably related to #897

Describe the bug In continuation of https://gitter.im/tantivy-search/tantivy?at=5ff589c503529b296bd23728 Firstly, I found memory bloating of Tantivy application. After debugging I found that merging thread have been failing every time on merging broken store in a particular segment. Example of the whole segment I've sent to your personally in Gitter.

Which version of tantivy are you using? https://github.com/tantivy-search/tantivy/commit/a4f33d3823f1bad3ff7a59877f1608615acabe6e

What happened I used poor man debugging and launched Tantivy with patched function (added print only):

    fn write_storable_fields(&self, store_writer: &mut StoreWriter) -> crate::Result<()> {
        for reader in &self.readers {
            let store_reader = reader.get_store_reader()?;
            if reader.num_deleted_docs() > 0 {
                for doc_id in reader.doc_ids_alive() {
                    let doc = store_reader.get(doc_id);
                    if let Err(ref err) = doc {
                        println!("Error: {:?}\nSegment ID: {:?}\nDocID: {}", err, reader.segment_id(), doc_id);
                    }
                    store_writer.store(&doc?)?;
                }
            } else {
                store_writer.stack(&store_reader)?;
            }
        }
        Ok(())
    }

Stdout after failed merging:

Error: IOError(Custom { kind: InvalidData, error: "Reach end of buffer while reading VInt" })
Segment ID: Seg("e6ece22e")
DocID: 53

opened by ppodolsky 33

Make it possible to write the index from independent thread easily.
(This is a follow up from #549)

Ideally what I need is the ability to write to index from independent threads each having own writer.

This problem is very common and hurts some Very Important Project relying on tantivy (toshi, plume). (Invoking the name @fdb-hiroshima and @hntd187 for the discussion).

Web server are typically multithreaded and requests may spawn the need to add or delete documents. Dealing with a Arc<RwLock<IndexWriter>> might feel dirty, and rust beginners may not really understand the logic behind that.

On the other hand, the IndexWriter already relies on a channel to dispatch indexing operation to its own small thread pool. Stamping is also done using Atomics. There is actually no real reason to prevent .add_document() and and .delete_term() to happen from different threads.

The problem is

what should happen with .commit()

what should happen with .rollback()

Especially would this ability to index from several threads confuse people on

what a commit is in tantivy? (.commit() ensure that all operations that happened but the .commit() are processed)

how multithreaded indexing works ? (indexing from several thread will no help. Tantivy has its own multithreading system).

Also,

should .commit() and .rollback() block other operations? (It is technically possible to have .commit() only block other .commit() operations.)

Should .commit() and .rollback() return futures?
opened by fulmicoton 25
Block-Max WAND implementation proof of concept
This is a proof-of-concept implementation of BMW algorithm for union queries. There are still some parts missing in the algorithm itself, but also I haven't touched anything related to indexing yet.

Two things in particular I want to point out for now:

BMW requires sorting of posting lists; I wrapped them in Box thinking that it should be faster to move than lists themselves; not yet sure this is the best way, I haven't spent too much time thinking about it so far;

BMW needs that additional piece of feedback from the collector: the current threshold; right now, I am passing a predicate but it definitely has its disadvantages. one alternative is to have someone, say, the collector, set that between steps; again, not entirely sure what the way to do this is.

Any comments at this stage would be appreciated.
opened by elshize 22
garbage_collect_files works only on second try

Describe the bug the snippet code I use is here

garbage_collect_files does not work the first time I use it, only the second time. First time means first call of garbage_collect_files in one thread, second means second call of garbage_collect_files in another thread

Which version of tantivy are you using? 0.9.1

To Reproduce

repo where is all the code I used, look at drop_index() function in src/bin/server.rs

I use POST request to trigger garbage_collect_files like this: POST http://localhost:8080/drop

When I post at first time all documents are deleted but ./index directory does not become empty, actually it has 36Mb data in it. Original size of index directory was 129Mb When I post at second time directory size drops to 8,0K. Nothing else but second call to garbage_collect_files happens.

I don't know, maybe it is an expected behavior, but is weird.
bug

opened by kkonevets 21

memory leak problem

Hi, i'm a rust rookie. my program suffer a memory leak problem, it could been killed by kernel after running along time. Getting data from redis，and adding to tantivy with rust:

fn main() {
    let mut index_writer = index.writer_with_num_threads(num_threads, buffer).unwrap();
    index_writer.set_merge_policy(mergePolicy);
    loop {
        let mut datas = RedisService::get_data();
        let len = datas.len();
        if len == 0 {
            //sleep
        }
        for data in datas {
            let mut doc = Document::new();
            let json_value = json::parse(&data).unwrap();
            //doc.add_text,add_u64;
            index_writer.add_document(doc);
        }
        index_writer.commit();
    }
}

question

opened by vsop-479 20

[RFC] Python bindings

This patch-set implements python bindings for Tantivy.

The bindings are written using PyO3 and thus require rust nightly.

Some parts of the Tantivy API are omitted for now (like the SegmentReader) while some parts are made a little bit more pythonic using optional arguments instead of builder patterns.

The search method of the Searcher struct is incomplete and only allows a single TopCollector as its collector argument. Since the method is generic and python doesn't support generics, it has proven to be quite tricky to allow multiple differing collectors to be passed to the function in a dynamic way.

The first couple of patches add some functionality or expose some structs publicly in Tantivy that are required for the python bindings.

The last patch finally adds the python bindings as a sub-crate to Tantivy. To build the bindings a setup.py file is provided that requires setuptools-rust or pyo3-pack can be used. Both of the methods allow the bindings to be uploaded to pypi.

opened by poljar 20
Make it possible to stream the terms matching an Automaton
Closes #273

Ok, so I've started down the path, but I'm for sure lost.

Not sure how I would test with Automaton - still reading up on the fst code there

Not sure how to use the fst StreamBuilder, but it feels close on the code implementation, but I have compiler issues, which could just be about spreading the generic type love around more.
opened by drusellers 19
Replace `chrono` with `time`
Fixes #1304.

BREAKING API CHANGE: The type DateTime is no longer (re-)exported. Instead you need to use time::OffsetDateTime explicitly.

Remarks:

Please check the error handling and mappings. time returns results instead of panicking like chrono does. For the FastValue impl I had to use `.expect("valid UNIX timestamp").

I had to adjust some debug strings in tests that are used for matching test results. Copy&paste from the failed test outputs. Not very handy, please verify.

TODO:

[x] Add changelog entry
opened by uklotzde 17
Tantivy fails with `Failed purge deletes` error
I am getting a failure starting my application

Failed purge deletes: Error(PathDoesNotExist("/SOME_PATH/c475e13ef3ca45128b1f8d9ee42fe994.term")

This is on OS X. My computer crashed recently. What's a good way to move forward here? Happy to help writing some code if that helps :)
bug high priority
opened by winding-lines 16
- Knock-knock? - Who's there? - Broken segment!

Describe the bug The same load profile as in #969 - deletions, addings and mergings. Now it happens on querying after several hours of serving. I think the reason is basically the same. At startup and during several hours afterwards all queries were ok but after generations of merges searcher.doc started to throw VInt decoding error.

Which version of tantivy are you using? https://github.com/tantivy-search/tantivy/commit/bf6e6e8a7cc1826212ba2500b08ecb53dfcdeda1

To Reproduce Sent broken segment to you in gitter.

opened by ppodolsky 15
The Snippet::fragments member is misleading and needs a rename
Is your feature request related to a problem? Please describe. I am creating an app where I would like to break different fragment onto their own lines.

What my app currently outputs:

[ueuccfcgbx] Fucshia Arch: https://blog.quarkslab.com/playing-around-with-the-fuchsia-operating-system.html address space is context-switched by **Zircon**.Contrary to other OSes however, the IOMMU (Input-Output MMU), plays an important role on **Zircon**: it is

Notice how **Zircon**.Contrary is concatenated.

What I would like:

[ueuccfcgbx] Fucshia Arch: https://blog.quarkslab.com/playing-around-with-the-fuchsia-operating-system.html address space is context-switched by **Zircon**. Contrary to other OSes however, the IOMMU(Input-Output MMU), plays an important role on **Zircon**: it is

Currently a Snippet stores fragments (a string that has all of the text of the snippet) and a vector of HighlightSections (what parts of fragments should be highlighted). I would like to break each fragment onto its own line to make this output more readable/understandable. Currently there is no way to know where one fragment ends, and another begins.

Describe the solution you'd like Change the type of Snippet::fragments to Vec<String> and add a fragment_number member to HighlightSection.

[Optional] describe alternatives you've considered Add a new member to Snippet similar to highlighted called fragment_sections, that is a Vec of start and stop points for the different fragments
enhancement
opened by liamwarfield 15
refactor multivalue fastfield, refactor range query

Introduce MakeZero trait, remove make_zero from FastValue Merge two multivalue fastfield implementations into one prepare range query on fastfield for different types

opened by PSeitz 1
How to impl a custom store?

I'm new to tantivy, per my understanding, currently StoreWriter will store the docs in .store file. Seems there is no store trait defined, how to impl a custom store, e.g. store the docs in a key/value store?

opened by gembin 0
Update prettytable-rs requirement from 0.9.0 to 0.10.0
Updates the requirements on prettytable-rs to permit the latest version.

Release notes

Sourced from prettytable-rs's releases.

v0.10.0

Fixed

Fix panic due to incorrect ANSI escape handling #137

Fix display of empty tables #127

Changed

Remove the unsafe code in Table::as_ref #146

Switch atty to is-terminal #151

Minimal Supported Rust Version bumped to 1.56

Thanks

@alexanderkjall and @5225225 fuzzer work and fixing panics

@david0u0 fixing #145 Undefined behavior (UB) on Table::as_ref

@nschoellhorn fixing empty table display

Changelog

Sourced from prettytable-rs's changelog.

0.10.0 (2022-12-27)

Fixed

Fix panic due to incorrect ANSI escape handling (#137)

Fix display of empty tables (#127)

Changed

Remove the unsafe code in Table::as_ref (#146)

Switch atty to is-terminal (#151)

Minimal Supported Rust Version bumped to 1.56

Thanks

@alexanderkjall and @5225225 fuzzer work and fixing panics

@david0u0 fixing (#145) Undefined behavior (UB) on Table::as_ref

#127: phsym/prettytable-rs#127 #137: phsym/prettytable-rs#137 #145: phsym/prettytable-rs#145 #146: phsym/prettytable-rs#146 #151: phsym/prettytable-rs#151

Commits

4d66e6e Release 0.10.0 (#152)

e206f71 Switch atty to is-terminal (#151)

8abc3a3 Fix fmt (#149)

051bfb0 Fix display of empty tables (#127)

3d17077 Fix panic due to incorrect ANSI escape handling, add fuzzer (#137)

25cfe72 Remove the unsafe code in Table::as_ref (#146)

See full diff in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

dependencies rust
opened by dependabot[bot] 1

Releases(0.18.1)

0.18.1(Oct 20, 2022)
Hotfix on positions. #1629 (@fmassot, @fulmicoton, @PSeitz)

Source code(tar.gz)
Source code(zip)
0.18(May 26, 2022)
For date values chrono has been replaced with time (@uklotzde) #1304

Add histgram aggregation (@PSeitz)

Add support for fastfield on text fields (@PSeitz)

Add terms aggregation (@PSeitz)

Add support for zstd compression (@kryesh)

Source code(tar.gz)
Source code(zip)
0.17(Mar 9, 2022)
LogMergePolicy now triggers merges if the ratio of deleted documents reaches a threshold (@shikhar @fulmicoton) #115

Adds a searcher Warmer API (@shikhar @fulmicoton)

Change to non-strict schema. Ignore fields in data which are not defined in schema. Previously this returned an error. #1211

Facets are necessarily indexed. Existing index with indexed facets should work out of the box. Index without facets that are marked with index: false should be broken (but they were already broken in a sense). (@fulmicoton) #1195 .

Bugfix that could in theory impact durability in theory on some filesystems #1224

Schema now offers not indexing fieldnorms (@lpouget) #922

Reduce the number of fsync calls #1225

Fix opening bytes index with dynamic codec (@PSeitz) #1278

Added an aggregation collector compatible with Elasticsearch (@PSeitz)

Added a JSON schema type @fulmicoton #1251

Added support for slop in phrase queries @halvorboe #1068

Source code(tar.gz)
Source code(zip)
0.16.1(Sep 10, 2021)

Major Bugfix on multivalued fastfield. #1151
Source code(tar.gz)
Source code(zip)
0.15.3(Jun 30, 2021)
Major bugfix. Deleting documents was broken when the index was sorted by a field. (@appaquet, @fulmicoton) #1101

Source code(tar.gz)
Source code(zip)
0.15.2(Jun 16, 2021)
Major bugfix. DocStore still panics when a deleted doc is at the beginning of a block. (@appaquet) #1088

Source code(tar.gz)
Source code(zip)
0.15.1(Jun 14, 2021)
Major bugfix. DocStore panics when first block is deleted. (@appaquet) #1077

Source code(tar.gz)
Source code(zip)
0.15(Jun 7, 2021)
API Changes. Using Range instead of (start, end) in the API and internals (FileSlice, OwnedBytes, Snippets, ...) This change is breaking but migration is trivial.

Added an Histogram collector. (@fulmicoton) #994

Added support for Option. (@fulmicoton)

DocAddress is now a struct (@scampi) #987

Bugfix consistent tie break handling in facet's topk (@hardikpnsp) #357

Date field support for range queries (@rihardsk) #516

Added lz4-flex as the default compression scheme in tantivy (@PSeitz) #1009

Renamed a lot of symbols to avoid all uppercasing on acronyms, as per new clippy recommendation. For instance, RAMDirectory -> RamDirectory. (@fulmicoton)

Simplified positions index format (@fulmicoton) #1022

Moved bitpacking to bitpacker subcrate and add BlockedBitpacker, which bitpacks blocks of 128 elements (@PSeitz) #1030

Added support for more-like-this query in tantivy (@evanxg852000) #1011

Added support for sorting an index, e.g presorting documents in an index by a timestamp field. This can heavily improve performance for certain scenarios, by utilizing the sorted data (Top-n optimizations)(@PSeitz). #1026

Add iterator over documents in doc store (@PSeitz). #1044

Fix log merge policy (@PSeitz). #1043

Add detection to avoid small doc store blocks on merge (@PSeitz). #1054

Make doc store compression dynamic (@PSeitz). #1060

Switch to json for footer version handling (@PSeitz). #1060

Updated TermMerger implementation to rely on the union feature of the FST (@scampi) #469

Add boolean marking whether position is required in the query_terms API call (@fulmicoton). #1070

Source code(tar.gz)
Source code(zip)
0.14(Feb 5, 2021)
Remove dependency to atomicwrites #833 .Implemented by @fulmicoton upon suggestion and research from @asafigan).

Migrated tantivy error from the now deprecated failure crate to thiserror #760. (@hirevo)

API Change. Accessing the typed value off a Schema::Value now returns an Option instead of panicking if the type does not match.

Large API Change in the Directory API. Tantivy used to assume that all files could be somehow memory mapped. After this change, Directory return a FileSlice that can be reduced and eventually read into an OwnedBytes object. Long and blocking io operation are still required by they do not span over the entire file.

Added support for Brotli compression in the DocStore. (@ppodolsky)

Added helper for building intersections and unions in BooleanQuery (@guilload)

Bugfix in Query::explain

Removed dependency on notify #924. Replaced with FileWatcher struct that polls meta file every 500ms in background thread. (@halvorboe @guilload)

Added FilterCollector, which wraps another collector and filters docs using a predicate over a fast field (@barrotsteindev)

Simplified the encoding of the skip reader struct. BlockWAND max tf is now encoded over a single byte. (@fulmicoton)

FilterCollector now supports all Fast Field value types (@barrotsteindev)

FastField are not all loaded when opening the segment reader. (@fulmicoton)

This version breaks compatibility and requires users to reindex everything.
Source code(tar.gz)
Source code(zip)
0.13.3(Jan 13, 2021)

Minor Bugfix. Avoid relying on serde's reexport of PhantomData. (#975)
Source code(tar.gz)
Source code(zip)
0.13.2(Oct 1, 2020)

HotFix. Acquiring a facet reader on a segment that does not contain any doc with this facet returns None. (#896)
Source code(tar.gz)
Source code(zip)
0.13.1(Sep 19, 2020)

Made Query and Collector Send + Sync. Updated misc dependency versions.
Source code(tar.gz)
Source code(zip)
0.13(Aug 19, 2020)
Tantivy 0.13 introduce a change in the index format that will require you to reindex your index (BlockWAND information are added in the skiplist). The index size increase is minor as this information is only added for full blocks. If you have a massive index for which reindexing is not an option, please contact me so that we can discuss possible solutions.

Bugfix in FuzzyTermQuery not matching terms by prefix when it should (@Peachball)

Relaxed constraints on the custom/tweak score functions. At the segment level, they can be mut, and they are not required to be Sync + Send.

MMapDirectory::open does not return a Result anymore.

Change in the DocSet and Scorer API. (@fulmicoton). A freshly created DocSet point directly to their first doc. A sentinel value called TERMINATED marks the end of a DocSet. .advance() returns the new DocId. Scorer::skip(target) has been replaced by Scorer::seek(target) and returns the resulting DocId. As a result, iterating through DocSet now looks as follows

let mut doc = docset.doc(); while doc != TERMINATED { // ... doc = docset.advance(); }

The change made it possible to greatly simplify a lot of the docset's code.

Misc internal optimization and introduction of the Scorer::for_each_pruning function. (@fulmicoton)

Added an offset option to the Top(.*)Collectors. (@robyoung)

Added Block WAND. Performance on TOP-K on term-unions should be greatly increased. (@fulmicoton, and special thanks to the PISA team for answering all my questions!)

Source code(tar.gz)
Source code(zip)
0.12(Feb 19, 2020)
Removing static dispatch in tokenizers for simplicity. (#762)

Added backward iteration for TermDictionary stream. (@halvorboe)

Fixed a performance issue when searching for the posting lists of a missing term (@audunhalland)

Added a configurable maximum number of docs (10M by default) for a segment to be considered for merge (@hntd187, landed by @halvorboe #713)

Important Bugfix #777, causing tantivy to retain memory mapping. (diagnosed by @poljar)

Added support for field boosting. (#547, @fulmicoton)

Source code(tar.gz)
Source code(zip)
0.11.3(Dec 20, 2019)
Fixed DateTime as a fast field (#735)

Source code(tar.gz)
Source code(zip)
0.11.1(Dec 17, 2019)
Bug fix #729

Source code(tar.gz)
Source code(zip)
0.11(Dec 15, 2019)
Added f64 field. Internally reuse u64 code the same way i64 does (@fdb-hiroshima)

Various bugfixes in the query parser.

Better handling of hyphens in query parser. (#609)

Better handling of whitespaces.

Closes #498 - add support for Elastic-style unbounded range queries for alphanumeric types eg. "title:>hello", "weight:>=70.5", "height:<200" (@petr-tik)

API change around Box<BoxableTokenizer>. See detail in #629

Avoid rebuilding Regex automaton whenever a regex query is reused. #639 (@brainlock)

Add footer with some metadata to index files. #605 (@fdb-hiroshima)

Add a method to check the compatibility of the footer in the index with the running version of tantivy (@petr-tik)

TopDocs collector: ensure stable sorting on equal score. #671 (@brainlock)

Added handling of pre-tokenized text fields (#642), which will enable users to load tokens created outside tantivy. See usage in examples/pre_tokenized_text. (@kkoziara)

Fix crash when committing multiple times with deleted documents. #681 (@brainlock)

How to update?

The index format is changed. You are required to reindex your data to use tantivy 0.11.

Box<dyn BoxableTokenizer> has been replaced by a BoxedTokenizer struct.

Regex are now compiled when the RegexQuery instance is built. As a result, it can now return an error and handling the Result is required.

tantivy::version() now returns a Version object. This object implements ToString()

Source code(tar.gz)
Source code(zip)
0.10.3(Nov 10, 2019)
Fix crash when committing multiple times with deleted documents. #681 (@brainlock)

Source code(tar.gz)
Source code(zip)
0.10.2(Oct 1, 2019)

Hotfix for #656
Source code(tar.gz)
Source code(zip)
0.10.1(Jul 30, 2019)
Closes #544. A few users experienced problems with the directory watching system. Avoid watching the mmap directory until someone effectively creates a reader that uses this functionality.

Source code(tar.gz)
Source code(zip)
0.10.0(Jul 11, 2019)
Tantivy 0.10.0 index format is compatible with the index format in 0.9.0.

Added an API to easily tweak or entirely replace the default score. See TopDocs::tweak_scoreand TopScore::custom_score (@pmasurel)

Added an ASCII folding filter (@drusellers)

Bugfix in query.count in presence of deletes (@pmasurel)

Added .explain(...) in Query and Weight to (@pmasurel)

Added an efficient way to delete_all_documents in IndexWriter (@petr-tik). All segments are simply removed.

Minor

Switched to Rust 2018 (@uvd)

Small simplification of the code. Calling .freq() or .doc() when .advance() has never been called on segment postings should panic from now on.

Tokens exceeding u16::max_value() - 4 chars are discarded silently instead of panicking.

Fast fields are now preloaded when the SegmentReader is created.

IndexMeta is now public. (@hntd187)

IndexWriter add_document, delete_term. IndexWriter is Sync, making it possible to use it with a Arc<RwLock<IndexWriter>>. add_document and delete_term can only require a read lock. (@pmasurel)

Introducing Opstamp as an expressive type alias for u64. (@petr-tik)

Stamper now relies on AtomicU64 on all platforms (@petr-tik)

Bugfix - Files get deleted slightly earlier

Compilation resources improved (@fdb-hiroshima)

How to update?

Your program should be usable as is.

Fast fields

Fast fields used to be accessed directly from the SegmentReader. The API changed, you are now required to acquire your fast field reader via the segment_reader.fast_fields(), and use one of the typed method:

.u64(), .i64() if your field is single-valued ;

.u64s(), .i64s() if your field is multi-valued ;

.bytes() if your field is bytes fast field.

Source code(tar.gz)
Source code(zip)
0.9.1(Mar 28, 2019)

Hotfix . All language were using the English stemmer.
Source code(tar.gz)
Source code(zip)
0.9(Mar 20, 2019)
0.9.0 index format is not compatible with the previous index format.

Bugfix

Some Mmap objects were being leaked, and would never get released. (@fulmicoton)

New Features

Added IndexReader. By default, index is reloaded automatically upon new commits (@fulmicoton)

Stemming in other language possible (@pentlander)

Added grouped add and delete operations. They are guaranteed to happen together (i.e. they cannot be split by a commit). In addition, adds are guaranteed to happen on the same segment. (@elbow-jason)

Added DateTime field (@barrotsteindev)

Misc improvements

Indexer memory footprint improved. (VInt comp, inlining the first block. (@fulmicoton)

Removed most unsafe (@fulmicoton)

Segments with no docs are deleted earlier (@barrotsteindev)

Removed INT_STORED and INT_INDEXED. It is now possible to use STORED and INDEXED for int fields. (@fulmicoton)

Source code(tar.gz)
Source code(zip)
0.8.2b(Feb 14, 2019)

0.8.2 fixes build for non x86_64 platforms. See #496 for details.
Source code(tar.gz)
Source code(zip)
0.8.1(Jan 23, 2019)

Hotfix of #476.

Merge was reflecting deletes before commit was passed. Thanks @barrotsteindev for reporting the bug.
Source code(tar.gz)
Source code(zip)
0.8.0(Dec 26, 2018)
API Breaking change in the collector API. (@jwolfe, @fulmicoton)

Multithreaded search (@jwolfe, @fulmicoton)

Source code(tar.gz)
Source code(zip)
0.7.2(Dec 18, 2018)

Bugfix #457 Removing faulty debug_assert!.
Source code(tar.gz)
Source code(zip)
0.7.1(Nov 2, 2018)
Bugfix: NGramTokenizer panics on non ascii chars

Added a space usage API

Source code(tar.gz)
Source code(zip)
0.7.0(Sep 16, 2018)
Skip data for doc ids and positions (@fulmicoton), greatly improving performance

Tantivy error now rely on the failure crate (@drusellers)

Added support for AND, OR, NOT syntax in addition to the +,- syntax

Added a snippet generator with highlight (@vigneshsarma, @fulmicoton)

Added a TopFieldCollector (@pentlander)

Source code(tar.gz)
Source code(zip)
0.6.1(Jul 10, 2018)
Bugfix #324. GC removing was removing file that were still in u seful

Added support for parsing AllQuery and RangeQuery via QueryParser

AllQuery: *

RangeQuery:

Inclusive field:[startIncl to endIncl]

Exclusive field:{startExcl to endExcl}

Mixed field:[startIncl to endExcl} and vice versa

Unbounded field:[start to *], field:[* to end]

Source code(tar.gz)
Source code(zip)

Owner

tantivy

GitHub

Tantivy is a full text search engine library written in Rust.

Tantivy is a full text search engine library written in Rust. It is closer to Apache Lucene than to Elasticsearch or Apache Solr in the sense it is no

7.4k Dec 30, 2022

A lightweight full-text search library that provides full control over the scoring calculations

probly-search · A full-text search library, optimized for insertion speed, that provides full control over the scoring calculations. This start initia

20 Nov 26, 2022

⚡ Insanely fast, 🌟 Feature-rich searching. lnx is the adaptable deployment of the tantivy search engine you never knew you wanted. Standing on the shoulders of giants.

✨ Feature Rich | ⚡ Insanely Fast An ultra-fast, adaptable deployment of the tantivy search engine via REST. ?? Standing On The Shoulders of Giants lnx

679 Jan 1, 2023

⚡ Insanely fast, 🌟 Feature-rich searching. lnx is the adaptable deployment of the tantivy search engine you never knew you wanted. Standing on the shoulders of giants.

✨ Feature Rich | ⚡ Insanely Fast An ultra-fast, adaptable deployment of the tantivy search engine via REST. ?? Standing On The Shoulders of Giants lnx

0 Apr 25, 2022

Tantivy is a full-text search engine library inspired by Apache Lucene and written in Rust

Related tags

Overview

Benchmark

Features

Non-features

Getting started

How can I support this project?

Contributing code

Clone and build locally

Run tests

Debug

A failing test

An example

Comments

v0.10.0

Fixed

Changed

Thanks

0.10.0 (2022-12-27)

Fixed

Changed

Thanks

Releases(0.18.1)

0.18.1(Oct 20, 2022)

0.18(May 26, 2022)

0.17(Mar 9, 2022)

0.16.1(Sep 10, 2021)

0.15.3(Jun 30, 2021)

0.15.2(Jun 16, 2021)

0.15.1(Jun 14, 2021)

0.15(Jun 7, 2021)

0.14(Feb 5, 2021)

0.13.3(Jan 13, 2021)

0.13.2(Oct 1, 2020)

0.13.1(Sep 19, 2020)

0.13(Aug 19, 2020)

0.12(Feb 19, 2020)

0.11.3(Dec 20, 2019)

0.11.1(Dec 17, 2019)

0.11(Dec 15, 2019)

How to update?

0.10.3(Nov 10, 2019)

0.10.2(Oct 1, 2019)

0.10.1(Jul 30, 2019)

0.10.0(Jul 11, 2019)

Minor

How to update?

Fast fields

0.9.1(Mar 28, 2019)

0.9(Mar 20, 2019)

Bugfix

New Features

Misc improvements

0.8.2b(Feb 14, 2019)

0.8.1(Jan 23, 2019)

0.8.0(Dec 26, 2018)

0.7.2(Dec 18, 2018)

0.7.1(Nov 2, 2018)

0.7.0(Sep 16, 2018)

0.6.1(Jul 10, 2018)

Owner

tantivy

Tantivy is a full text search engine library written in Rust.

A lightweight full-text search library that provides full control over the scoring calculations

⚡ Insanely fast, 🌟 Feature-rich searching. lnx is the adaptable deployment of the tantivy search engine you never knew you wanted. Standing on the shoulders of giants.

⚡ Insanely fast, 🌟 Feature-rich searching. lnx is the adaptable deployment of the tantivy search engine you never knew you wanted. Standing on the shoulders of giants.

A full-text search engine in rust

🔍TinySearch is a lightweight, fast, full-text search engine. It is designed for static websites.

A full-text search and indexing server written in Rust.

Shogun search - Learning the principle of search engine. This is the first time I've written Rust.

ik-analyzer for rust; chinese tokenizer for tantivy

A simple and lightweight fuzzy search engine that works in memory, searching for similar strings (a pun here).

Lightning Fast, Ultra Relevant, and Typo-Tolerant Search Engine

Perlin: An Efficient and Ergonomic Document Search-Engine

Configurable quick search engine shortcuts for your terminal and browser.

AI-powered search engine for Rust

A Rust API search engine

Python bindings for Milli, the embeddable Rust-based search engine powering Meilisearch

High-performance log search engine.