Tantivy is a full-text search engine library inspired by Apache Lucene and written in Rust

Quickwit OSS

Last update: Jan 9, 2023

Related tags

Overview

Tantivy is a full-text search engine library written in Rust.

It is closer to Apache Lucene than to Elasticsearch or Apache Solr in the sense it is not an off-the-shelf search engine server, but rather a crate that can be used to build such a search engine.

Tantivy is, in fact, strongly inspired by Lucene's design.

If you are looking for an alternative to Elasticsearch or Apache Solr, check out Quickwit, our search engine built on top of Tantivy.

Benchmark

The following benchmark breakdowns performance for different types of queries/collections.

Your mileage WILL vary depending on the nature of queries and their load.

Features

Full-text search
Configurable tokenizer (stemming available for 17 Latin languages with third party support for Chinese (tantivy-jieba and cang-jie), Japanese (lindera, Vaporetto, and tantivy-tokenizer-tiny-segmenter) and Korean (lindera + lindera-ko-dic-builder)
Fast (check out the 🐎 ✨ benchmark ✨ 🐎 )
Tiny startup time (<10ms), perfect for command-line tools
BM25 scoring (the same as Lucene)
Natural query language (e.g. (michael AND jackson) OR "king of pop")
Phrase queries search (e.g. "michael jackson")
Incremental indexing
Multithreaded indexing (indexing English Wikipedia takes < 3 minutes on my desktop)
Mmap directory
SIMD integer compression when the platform/CPU includes the SSE2 instruction set
Single valued and multivalued u64, i64, and f64 fast fields (equivalent of doc values in Lucene)
&[u8] fast fields
Text, i64, u64, f64, dates, and hierarchical facet fields
LZ4 compressed document store
Range queries
Faceted search
Configurable indexing (optional term frequency and position indexing)
JSON Field
Aggregation Collector: range buckets, average, and stats metrics
LogMergePolicy with deletes
Searcher Warmer API
Cheesy logo with a horse

Non-features

Distributed search is out of the scope of Tantivy, but if you are looking for this feature, check out Quickwit.

Getting started

Tantivy works on stable Rust (>= 1.27) and supports Linux, macOS, and Windows.

Tantivy's simple search example
tantivy-cli and its tutorial - tantivy-cli is an actual command-line interface that makes it easy for you to create a search engine, index documents, and search via the CLI or a small server with a REST API. It walks you through getting a Wikipedia search engine up and running in a few minutes.
Reference doc for the last released version

How can I support this project?

There are many ways to support this project.

Use Tantivy and tell us about your experience on Discord or by email ([email protected])
Report bugs
Write a blog post
Help with documentation by asking questions or submitting PRs
Contribute code (you can join our Discord server)
Talk about Tantivy around you

Contributing code

We use the GitHub Pull Request workflow: reference a GitHub ticket and/or include a comprehensive commit message when opening a PR.

Clone and build locally

Tantivy compiles on stable Rust but requires Rust >= 1.27. To check out and run tests, you can simply run:

    git clone https://github.com/quickwit-oss/tantivy.git
    cd tantivy
    cargo build

Run tests

Some tests will not run with just cargo test because of fail-rs. To run the tests exhaustively, run ./run-tests.sh.

Debug

You might find it useful to step through the programme with a debugger.

A failing test

Make sure you haven't run cargo clean after the most recent cargo test or cargo build to guarantee that the target/ directory exists. Use this bash script to find the name of the most recent debug build of Tantivy and run it under rust-gdb:

find target/debug/ -maxdepth 1 -executable -type f -name "tantivy*" -printf '%TY-%Tm-%Td %TT %p\n' | sort -r | cut -d " " -f 3 | xargs -I RECENT_DBG_TANTIVY rust-gdb RECENT_DBG_TANTIVY

Now that you are in rust-gdb, you can set breakpoints on lines and methods that match your source code and run the debug executable with flags that you normally pass to cargo test like this:

$gdb run --test-threads 1 --test $NAME_OF_TEST

An example

By default, rustc compiles everything in the examples/ directory in debug mode. This makes it easy for you to make examples to reproduce bugs:

rust-gdb target/debug/examples/$EXAMPLE_NAME
$ gdb run

Companies Using Tantivy

FAQ

Can I use Tantivy in other languages?

Python → tantivy-py
Ruby → tantiny

You can also find other bindings on GitHub but they may be less maintained.

What are some examples of Tantivy use?

seshat: A matrix message database/indexer
tantiny: Tiny full-text search for Ruby
lnx: adaptable, typo tolerant search engine with a REST API
and more!

On average, how much faster is Tantivy compared to Lucene?

According to our search latency benchmark, Tantivy is approximately 2x faster than Lucene.

Comments

Case of corrupted segment

Probably related to #897

Describe the bug In continuation of https://gitter.im/tantivy-search/tantivy?at=5ff589c503529b296bd23728 Firstly, I found memory bloating of Tantivy application. After debugging I found that merging thread have been failing every time on merging broken store in a particular segment. Example of the whole segment I've sent to your personally in Gitter.

Which version of tantivy are you using? https://github.com/tantivy-search/tantivy/commit/a4f33d3823f1bad3ff7a59877f1608615acabe6e

What happened I used poor man debugging and launched Tantivy with patched function (added print only):

    fn write_storable_fields(&self, store_writer: &mut StoreWriter) -> crate::Result<()> {
        for reader in &self.readers {
            let store_reader = reader.get_store_reader()?;
            if reader.num_deleted_docs() > 0 {
                for doc_id in reader.doc_ids_alive() {
                    let doc = store_reader.get(doc_id);
                    if let Err(ref err) = doc {
                        println!("Error: {:?}\nSegment ID: {:?}\nDocID: {}", err, reader.segment_id(), doc_id);
                    }
                    store_writer.store(&doc?)?;
                }
            } else {
                store_writer.stack(&store_reader)?;
            }
        }
        Ok(())
    }

Stdout after failed merging:

Error: IOError(Custom { kind: InvalidData, error: "Reach end of buffer while reading VInt" })
Segment ID: Seg("e6ece22e")
DocID: 53

opened by ppodolsky 33

Make it possible to write the index from independent thread easily.
(This is a follow up from #549)

Ideally what I need is the ability to write to index from independent threads each having own writer.

This problem is very common and hurts some Very Important Project relying on tantivy (toshi, plume). (Invoking the name @fdb-hiroshima and @hntd187 for the discussion).

Web server are typically multithreaded and requests may spawn the need to add or delete documents. Dealing with a Arc<RwLock<IndexWriter>> might feel dirty, and rust beginners may not really understand the logic behind that.

On the other hand, the IndexWriter already relies on a channel to dispatch indexing operation to its own small thread pool. Stamping is also done using Atomics. There is actually no real reason to prevent .add_document() and and .delete_term() to happen from different threads.

The problem is

what should happen with .commit()

what should happen with .rollback()

Especially would this ability to index from several threads confuse people on

what a commit is in tantivy? (.commit() ensure that all operations that happened but the .commit() are processed)

how multithreaded indexing works ? (indexing from several thread will no help. Tantivy has its own multithreading system).

Also,

should .commit() and .rollback() block other operations? (It is technically possible to have .commit() only block other .commit() operations.)

Should .commit() and .rollback() return futures?
opened by fulmicoton 25
Block-Max WAND implementation proof of concept
This is a proof-of-concept implementation of BMW algorithm for union queries. There are still some parts missing in the algorithm itself, but also I haven't touched anything related to indexing yet.

Two things in particular I want to point out for now:

BMW requires sorting of posting lists; I wrapped them in Box thinking that it should be faster to move than lists themselves; not yet sure this is the best way, I haven't spent too much time thinking about it so far;

BMW needs that additional piece of feedback from the collector: the current threshold; right now, I am passing a predicate but it definitely has its disadvantages. one alternative is to have someone, say, the collector, set that between steps; again, not entirely sure what the way to do this is.

Any comments at this stage would be appreciated.
opened by elshize 22
garbage_collect_files works only on second try

Describe the bug the snippet code I use is here

garbage_collect_files does not work the first time I use it, only the second time. First time means first call of garbage_collect_files in one thread, second means second call of garbage_collect_files in another thread

Which version of tantivy are you using? 0.9.1

To Reproduce

repo where is all the code I used, look at drop_index() function in src/bin/server.rs

I use POST request to trigger garbage_collect_files like this: POST http://localhost:8080/drop

When I post at first time all documents are deleted but ./index directory does not become empty, actually it has 36Mb data in it. Original size of index directory was 129Mb When I post at second time directory size drops to 8,0K. Nothing else but second call to garbage_collect_files happens.

I don't know, maybe it is an expected behavior, but is weird.
bug

opened by kkonevets 21

memory leak problem

Hi, i'm a rust rookie. my program suffer a memory leak problem, it could been killed by kernel after running along time. Getting data from redis，and adding to tantivy with rust:

fn main() {
    let mut index_writer = index.writer_with_num_threads(num_threads, buffer).unwrap();
    index_writer.set_merge_policy(mergePolicy);
    loop {
        let mut datas = RedisService::get_data();
        let len = datas.len();
        if len == 0 {
            //sleep
        }
        for data in datas {
            let mut doc = Document::new();
            let json_value = json::parse(&data).unwrap();
            //doc.add_text,add_u64;
            index_writer.add_document(doc);
        }
        index_writer.commit();
    }
}

question

opened by vsop-479 20

[RFC] Python bindings

This patch-set implements python bindings for Tantivy.

The bindings are written using PyO3 and thus require rust nightly.

Some parts of the Tantivy API are omitted for now (like the SegmentReader) while some parts are made a little bit more pythonic using optional arguments instead of builder patterns.

The search method of the Searcher struct is incomplete and only allows a single TopCollector as its collector argument. Since the method is generic and python doesn't support generics, it has proven to be quite tricky to allow multiple differing collectors to be passed to the function in a dynamic way.

The first couple of patches add some functionality or expose some structs publicly in Tantivy that are required for the python bindings.

The last patch finally adds the python bindings as a sub-crate to Tantivy. To build the bindings a setup.py file is provided that requires setuptools-rust or pyo3-pack can be used. Both of the methods allow the bindings to be uploaded to pypi.

opened by poljar 20
Make it possible to stream the terms matching an Automaton
Closes #273

Ok, so I've started down the path, but I'm for sure lost.

Not sure how I would test with Automaton - still reading up on the fst code there

Not sure how to use the fst StreamBuilder, but it feels close on the code implementation, but I have compiler issues, which could just be about spreading the generic type love around more.
opened by drusellers 19
Replace `chrono` with `time`
Fixes #1304.

BREAKING API CHANGE: The type DateTime is no longer (re-)exported. Instead you need to use time::OffsetDateTime explicitly.

Remarks:

Please check the error handling and mappings. time returns results instead of panicking like chrono does. For the FastValue impl I had to use `.expect("valid UNIX timestamp").

I had to adjust some debug strings in tests that are used for matching test results. Copy&paste from the failed test outputs. Not very handy, please verify.

TODO:

[x] Add changelog entry
opened by uklotzde 17
Tantivy fails with `Failed purge deletes` error
I am getting a failure starting my application

Failed purge deletes: Error(PathDoesNotExist("/SOME_PATH/c475e13ef3ca45128b1f8d9ee42fe994.term")

This is on OS X. My computer crashed recently. What's a good way to move forward here? Happy to help writing some code if that helps :)
bug high priority
opened by winding-lines 16
- Knock-knock? - Who's there? - Broken segment!

Describe the bug The same load profile as in #969 - deletions, addings and mergings. Now it happens on querying after several hours of serving. I think the reason is basically the same. At startup and during several hours afterwards all queries were ok but after generations of merges searcher.doc started to throw VInt decoding error.

Which version of tantivy are you using? https://github.com/tantivy-search/tantivy/commit/bf6e6e8a7cc1826212ba2500b08ecb53dfcdeda1

To Reproduce Sent broken segment to you in gitter.

opened by ppodolsky 15
The Snippet::fragments member is misleading and needs a rename
Is your feature request related to a problem? Please describe. I am creating an app where I would like to break different fragment onto their own lines.

What my app currently outputs:

[ueuccfcgbx] Fucshia Arch: https://blog.quarkslab.com/playing-around-with-the-fuchsia-operating-system.html address space is context-switched by **Zircon**.Contrary to other OSes however, the IOMMU (Input-Output MMU), plays an important role on **Zircon**: it is

Notice how **Zircon**.Contrary is concatenated.

What I would like:

[ueuccfcgbx] Fucshia Arch: https://blog.quarkslab.com/playing-around-with-the-fuchsia-operating-system.html address space is context-switched by **Zircon**. Contrary to other OSes however, the IOMMU(Input-Output MMU), plays an important role on **Zircon**: it is

Currently a Snippet stores fragments (a string that has all of the text of the snippet) and a vector of HighlightSections (what parts of fragments should be highlighted). I would like to break each fragment onto its own line to make this output more readable/understandable. Currently there is no way to know where one fragment ends, and another begins.

Describe the solution you'd like Change the type of Snippet::fragments to Vec<String> and add a fragment_number member to HighlightSection.

[Optional] describe alternatives you've considered Add a new member to Snippet similar to highlighted called fragment_sections, that is a Vec of start and stop points for the different fragments
enhancement
opened by liamwarfield 15
Add regex tokenizer

This adds a regex tokenizer which tokenizes the text by using a regex pattern to split. This is my first attempt and works for my usecase, but it's not ideal from a code and a configuration perspective.

Closes https://github.com/quickwit-oss/tantivy/issues/1670

opened by mkleen 3
Implement RegexTokenizer

Implementation of a RegexTokenizer for #1670.

To make this a decently flexible solution I implemented support for capture groups as well as simple find matches: The RegexTokenStream will check on construction if there are any capture groups (besides the entire match) and use either captures_read_at for capture support or find_at for simple matches. Tests and examples included.

Closes https://github.com/quickwit-oss/tantivy/issues/1670

opened by Gearme 2

`get_docids_for_value_range` is broken w/certain fast fields that use a GCD inverse

Describe the bug

When inserting fast fields in a certain order, segments will occasionally end up with a column that matches extra documents when calling get_docids_for_value_range using nonexistant values.

The testcase at the bottom of this report illustrates a failing case, though you need to run it repeatedly to cause it to fail.

Expected behaviour: passing fastfield values/ranges to get_docids_for_value_range that cannot possibly exist in a given segment reader should return no matching documents.

Which version of tantivy are you using?

0.19.0

To Reproduce

This code will reproduce the bug, but it won't happen every time. Running it repeatedly will eventually give you a SegmentReader with the min/max values [6341073085727221 3373930417471920086], and that SegmentReader will always match document zero when querying in the invalid value 7999999999999999.


    #[test]
    fn test_gcd_bug() {
        let dir = RamDirectory::create();
        let mut schema_builder = Schema::builder();
        let url_norm_hash_field = schema_builder.add_i64_field("url_norm_hash", FAST | INDEXED);  
        let schema = schema_builder.build();
        let index = Index::create(dir, schema, IndexSettings::default()).unwrap();
        let mut writer = index.writer(50_000_000).unwrap();
        for i in [6341073085727221_i64, 6341073085727221_i64, 6341073085727221_i64, 6341073085727221_i64, 6341073085727221_i64, 6341073085727221_i64, 3373930417471920086_i64, 3373930417471920086_i64, 3373930417471920086_i64, 3373930417471920086_i64] {
            writer.add_document(doc! {
                url_norm_hash_field => i,
            }).unwrap();
        }
        writer.commit();
        let reader = index.reader().unwrap();
        let searcher = reader.searcher();
        for segment in searcher.segment_readers() {
            let field = segment.fast_fields().i64(url_norm_hash_field).unwrap();
            let (min, max) = (field.min_value(), field.max_value());
            println!("{} {}", min, max);
            if 7999999999999999_i64 >= min && 7999999999999999_i64 <= max {
                let mut vec = vec![];
                field.get_docids_for_value_range(7999999999999999_i64..=7999999999999999_i64, 0..u32::MAX, &mut vec);
                println!("{:?}", vec);
            }
        }
    }

bug

opened by mmastrac 3

Panic when opening old indexes
Describe the bug

Tantivy panics when reading an index of old version with no DOC_STORE_VERSION field in the footer:

thread 'MainLoop' panicked at 'actual doc store version: 2222222222, expected: 1', tantivy-0.19.0\src\store\footer.rs:30:13
opened by gyk 0
Allow single-threaded addition of documents to existing index without spawning new threads
Is your feature request related to a problem? Please describe. Continuing my work on getting tantivy to work server-side with WASM (related issues #1751 #541 ), I would like to index dynamically added documents. In essence I have a large set of documents I can pre-index in a build phase, however each user also has some documents that are loaded dynamically from a database.

I would imagine that normally I could do something like this:

let index = Index::open(ram_directory).unwrap(); let mut index_writer = index.writer(3_000_000).unwrap(); index_writer.add_document(my_document).unwrap(); index_writer.commit().unwrap();

And then search the index. I realize that this might leak documents from one tenant to another, but as this index is rebuilt in-memory on each request and dropped after, this isn't a large concern.

However, as WASM is single-threaded I'm unable to actually get this to work as it seems all the IndexWriters require a thread pool of some sort.

I have tried both index.writer() and index.writer_with_num_threads with both 1 and 0 threads. I've even delved into the undocumentedSingleSegmentIndexWriter Even though it seems to suggest it's only for creating an index with a Single Segment and not adding to an existing index, I figured I would give it a go.

However trying to instantiate it gives me the following IoError

IoError( Error { kind: Unsupported, message: "operation not supported on this platform", }, )

Which I think might be threadpool-related, but I am unable to get a stacktrace to confirm.

Describe the solution you'd like I think in essence I'm asking if there's any way to accomplish what I want to do.

Is there another way to use tantivy to index dynamic documents without using an IndexWriter and that machinery?

If not, is there any single-threaded IndexWriter or similar I've been unable to find? Will you consider adding one?

Alternatively, is there any way or guidance for me to implement my own single-threaded IndexWriter? From a first glance it seems like a lot of the code that deals with writing is (understandably) internal to tantivy and not exposed.
opened by GeeWee 4

Releases(0.18.1)

0.18.1(Oct 20, 2022)
Hotfix on positions. #1629 (@fmassot, @fulmicoton, @PSeitz)

Source code(tar.gz)
Source code(zip)
0.18(May 26, 2022)
For date values chrono has been replaced with time (@uklotzde) #1304

Add histgram aggregation (@PSeitz)

Add support for fastfield on text fields (@PSeitz)

Add terms aggregation (@PSeitz)

Add support for zstd compression (@kryesh)

Source code(tar.gz)
Source code(zip)
0.17(Mar 9, 2022)
LogMergePolicy now triggers merges if the ratio of deleted documents reaches a threshold (@shikhar @fulmicoton) #115

Adds a searcher Warmer API (@shikhar @fulmicoton)

Change to non-strict schema. Ignore fields in data which are not defined in schema. Previously this returned an error. #1211

Facets are necessarily indexed. Existing index with indexed facets should work out of the box. Index without facets that are marked with index: false should be broken (but they were already broken in a sense). (@fulmicoton) #1195 .

Bugfix that could in theory impact durability in theory on some filesystems #1224

Schema now offers not indexing fieldnorms (@lpouget) #922

Reduce the number of fsync calls #1225

Fix opening bytes index with dynamic codec (@PSeitz) #1278

Added an aggregation collector compatible with Elasticsearch (@PSeitz)

Added a JSON schema type @fulmicoton #1251

Added support for slop in phrase queries @halvorboe #1068

Source code(tar.gz)
Source code(zip)
0.16.1(Sep 10, 2021)

Major Bugfix on multivalued fastfield. #1151
Source code(tar.gz)
Source code(zip)
0.15.3(Jun 30, 2021)
Major bugfix. Deleting documents was broken when the index was sorted by a field. (@appaquet, @fulmicoton) #1101

Source code(tar.gz)
Source code(zip)
0.15.2(Jun 16, 2021)
Major bugfix. DocStore still panics when a deleted doc is at the beginning of a block. (@appaquet) #1088

Source code(tar.gz)
Source code(zip)
0.15.1(Jun 14, 2021)
Major bugfix. DocStore panics when first block is deleted. (@appaquet) #1077

Source code(tar.gz)
Source code(zip)
0.15(Jun 7, 2021)
API Changes. Using Range instead of (start, end) in the API and internals (FileSlice, OwnedBytes, Snippets, ...) This change is breaking but migration is trivial.

Added an Histogram collector. (@fulmicoton) #994

Added support for Option. (@fulmicoton)

DocAddress is now a struct (@scampi) #987

Bugfix consistent tie break handling in facet's topk (@hardikpnsp) #357

Date field support for range queries (@rihardsk) #516

Added lz4-flex as the default compression scheme in tantivy (@PSeitz) #1009

Renamed a lot of symbols to avoid all uppercasing on acronyms, as per new clippy recommendation. For instance, RAMDirectory -> RamDirectory. (@fulmicoton)

Simplified positions index format (@fulmicoton) #1022

Moved bitpacking to bitpacker subcrate and add BlockedBitpacker, which bitpacks blocks of 128 elements (@PSeitz) #1030

Added support for more-like-this query in tantivy (@evanxg852000) #1011

Added support for sorting an index, e.g presorting documents in an index by a timestamp field. This can heavily improve performance for certain scenarios, by utilizing the sorted data (Top-n optimizations)(@PSeitz). #1026

Add iterator over documents in doc store (@PSeitz). #1044

Fix log merge policy (@PSeitz). #1043

Add detection to avoid small doc store blocks on merge (@PSeitz). #1054

Make doc store compression dynamic (@PSeitz). #1060

Switch to json for footer version handling (@PSeitz). #1060

Updated TermMerger implementation to rely on the union feature of the FST (@scampi) #469

Add boolean marking whether position is required in the query_terms API call (@fulmicoton). #1070

Source code(tar.gz)
Source code(zip)
0.14(Feb 5, 2021)
Remove dependency to atomicwrites #833 .Implemented by @fulmicoton upon suggestion and research from @asafigan).

Migrated tantivy error from the now deprecated failure crate to thiserror #760. (@hirevo)

API Change. Accessing the typed value off a Schema::Value now returns an Option instead of panicking if the type does not match.

Large API Change in the Directory API. Tantivy used to assume that all files could be somehow memory mapped. After this change, Directory return a FileSlice that can be reduced and eventually read into an OwnedBytes object. Long and blocking io operation are still required by they do not span over the entire file.

Added support for Brotli compression in the DocStore. (@ppodolsky)

Added helper for building intersections and unions in BooleanQuery (@guilload)

Bugfix in Query::explain

Removed dependency on notify #924. Replaced with FileWatcher struct that polls meta file every 500ms in background thread. (@halvorboe @guilload)

Added FilterCollector, which wraps another collector and filters docs using a predicate over a fast field (@barrotsteindev)

Simplified the encoding of the skip reader struct. BlockWAND max tf is now encoded over a single byte. (@fulmicoton)

FilterCollector now supports all Fast Field value types (@barrotsteindev)

FastField are not all loaded when opening the segment reader. (@fulmicoton)

This version breaks compatibility and requires users to reindex everything.
Source code(tar.gz)
Source code(zip)
0.13.3(Jan 13, 2021)

Minor Bugfix. Avoid relying on serde's reexport of PhantomData. (#975)
Source code(tar.gz)
Source code(zip)
0.13.2(Oct 1, 2020)

HotFix. Acquiring a facet reader on a segment that does not contain any doc with this facet returns None. (#896)
Source code(tar.gz)
Source code(zip)
0.13.1(Sep 19, 2020)

Made Query and Collector Send + Sync. Updated misc dependency versions.
Source code(tar.gz)
Source code(zip)
0.13(Aug 19, 2020)
Tantivy 0.13 introduce a change in the index format that will require you to reindex your index (BlockWAND information are added in the skiplist). The index size increase is minor as this information is only added for full blocks. If you have a massive index for which reindexing is not an option, please contact me so that we can discuss possible solutions.

Bugfix in FuzzyTermQuery not matching terms by prefix when it should (@Peachball)

Relaxed constraints on the custom/tweak score functions. At the segment level, they can be mut, and they are not required to be Sync + Send.

MMapDirectory::open does not return a Result anymore.

Change in the DocSet and Scorer API. (@fulmicoton). A freshly created DocSet point directly to their first doc. A sentinel value called TERMINATED marks the end of a DocSet. .advance() returns the new DocId. Scorer::skip(target) has been replaced by Scorer::seek(target) and returns the resulting DocId. As a result, iterating through DocSet now looks as follows

let mut doc = docset.doc(); while doc != TERMINATED { // ... doc = docset.advance(); }

The change made it possible to greatly simplify a lot of the docset's code.

Misc internal optimization and introduction of the Scorer::for_each_pruning function. (@fulmicoton)

Added an offset option to the Top(.*)Collectors. (@robyoung)

Added Block WAND. Performance on TOP-K on term-unions should be greatly increased. (@fulmicoton, and special thanks to the PISA team for answering all my questions!)

Source code(tar.gz)
Source code(zip)
0.12(Feb 19, 2020)
Removing static dispatch in tokenizers for simplicity. (#762)

Added backward iteration for TermDictionary stream. (@halvorboe)

Fixed a performance issue when searching for the posting lists of a missing term (@audunhalland)

Added a configurable maximum number of docs (10M by default) for a segment to be considered for merge (@hntd187, landed by @halvorboe #713)

Important Bugfix #777, causing tantivy to retain memory mapping. (diagnosed by @poljar)

Added support for field boosting. (#547, @fulmicoton)

Source code(tar.gz)
Source code(zip)
0.11.3(Dec 20, 2019)
Fixed DateTime as a fast field (#735)

Source code(tar.gz)
Source code(zip)
0.11.1(Dec 17, 2019)
Bug fix #729

Source code(tar.gz)
Source code(zip)
0.11(Dec 15, 2019)
Added f64 field. Internally reuse u64 code the same way i64 does (@fdb-hiroshima)

Various bugfixes in the query parser.

Better handling of hyphens in query parser. (#609)

Better handling of whitespaces.

Closes #498 - add support for Elastic-style unbounded range queries for alphanumeric types eg. "title:>hello", "weight:>=70.5", "height:<200" (@petr-tik)

API change around Box<BoxableTokenizer>. See detail in #629

Avoid rebuilding Regex automaton whenever a regex query is reused. #639 (@brainlock)

Add footer with some metadata to index files. #605 (@fdb-hiroshima)

Add a method to check the compatibility of the footer in the index with the running version of tantivy (@petr-tik)

TopDocs collector: ensure stable sorting on equal score. #671 (@brainlock)

Added handling of pre-tokenized text fields (#642), which will enable users to load tokens created outside tantivy. See usage in examples/pre_tokenized_text. (@kkoziara)

Fix crash when committing multiple times with deleted documents. #681 (@brainlock)

How to update?

The index format is changed. You are required to reindex your data to use tantivy 0.11.

Box<dyn BoxableTokenizer> has been replaced by a BoxedTokenizer struct.

Regex are now compiled when the RegexQuery instance is built. As a result, it can now return an error and handling the Result is required.

tantivy::version() now returns a Version object. This object implements ToString()

Source code(tar.gz)
Source code(zip)
0.10.3(Nov 10, 2019)
Fix crash when committing multiple times with deleted documents. #681 (@brainlock)

Source code(tar.gz)
Source code(zip)
0.10.2(Oct 1, 2019)

Hotfix for #656
Source code(tar.gz)
Source code(zip)
0.10.1(Jul 30, 2019)
Closes #544. A few users experienced problems with the directory watching system. Avoid watching the mmap directory until someone effectively creates a reader that uses this functionality.

Source code(tar.gz)
Source code(zip)
0.10.0(Jul 11, 2019)
Tantivy 0.10.0 index format is compatible with the index format in 0.9.0.

Added an API to easily tweak or entirely replace the default score. See TopDocs::tweak_scoreand TopScore::custom_score (@pmasurel)

Added an ASCII folding filter (@drusellers)

Bugfix in query.count in presence of deletes (@pmasurel)

Added .explain(...) in Query and Weight to (@pmasurel)

Added an efficient way to delete_all_documents in IndexWriter (@petr-tik). All segments are simply removed.

Minor

Switched to Rust 2018 (@uvd)

Small simplification of the code. Calling .freq() or .doc() when .advance() has never been called on segment postings should panic from now on.

Tokens exceeding u16::max_value() - 4 chars are discarded silently instead of panicking.

Fast fields are now preloaded when the SegmentReader is created.

IndexMeta is now public. (@hntd187)

IndexWriter add_document, delete_term. IndexWriter is Sync, making it possible to use it with a Arc<RwLock<IndexWriter>>. add_document and delete_term can only require a read lock. (@pmasurel)

Introducing Opstamp as an expressive type alias for u64. (@petr-tik)

Stamper now relies on AtomicU64 on all platforms (@petr-tik)

Bugfix - Files get deleted slightly earlier

Compilation resources improved (@fdb-hiroshima)

How to update?

Your program should be usable as is.

Fast fields

Fast fields used to be accessed directly from the SegmentReader. The API changed, you are now required to acquire your fast field reader via the segment_reader.fast_fields(), and use one of the typed method:

.u64(), .i64() if your field is single-valued ;

.u64s(), .i64s() if your field is multi-valued ;

.bytes() if your field is bytes fast field.

Source code(tar.gz)
Source code(zip)
0.9.1(Mar 28, 2019)

Hotfix . All language were using the English stemmer.
Source code(tar.gz)
Source code(zip)
0.9(Mar 20, 2019)
0.9.0 index format is not compatible with the previous index format.

Bugfix

Some Mmap objects were being leaked, and would never get released. (@fulmicoton)

New Features

Added IndexReader. By default, index is reloaded automatically upon new commits (@fulmicoton)

Stemming in other language possible (@pentlander)

Added grouped add and delete operations. They are guaranteed to happen together (i.e. they cannot be split by a commit). In addition, adds are guaranteed to happen on the same segment. (@elbow-jason)

Added DateTime field (@barrotsteindev)

Misc improvements

Indexer memory footprint improved. (VInt comp, inlining the first block. (@fulmicoton)

Removed most unsafe (@fulmicoton)

Segments with no docs are deleted earlier (@barrotsteindev)

Removed INT_STORED and INT_INDEXED. It is now possible to use STORED and INDEXED for int fields. (@fulmicoton)

Source code(tar.gz)
Source code(zip)
0.8.2b(Feb 14, 2019)

0.8.2 fixes build for non x86_64 platforms. See #496 for details.
Source code(tar.gz)
Source code(zip)
0.8.1(Jan 23, 2019)

Hotfix of #476.

Merge was reflecting deletes before commit was passed. Thanks @barrotsteindev for reporting the bug.
Source code(tar.gz)
Source code(zip)
0.8.0(Dec 26, 2018)
API Breaking change in the collector API. (@jwolfe, @fulmicoton)

Multithreaded search (@jwolfe, @fulmicoton)

Source code(tar.gz)
Source code(zip)
0.7.2(Dec 18, 2018)

Bugfix #457 Removing faulty debug_assert!.
Source code(tar.gz)
Source code(zip)
0.7.1(Nov 2, 2018)
Bugfix: NGramTokenizer panics on non ascii chars

Added a space usage API

Source code(tar.gz)
Source code(zip)
0.7.0(Sep 16, 2018)
Skip data for doc ids and positions (@fulmicoton), greatly improving performance

Tantivy error now rely on the failure crate (@drusellers)

Added support for AND, OR, NOT syntax in addition to the +,- syntax

Added a snippet generator with highlight (@vigneshsarma, @fulmicoton)

Added a TopFieldCollector (@pentlander)

Source code(tar.gz)
Source code(zip)
0.6.1(Jul 10, 2018)
Bugfix #324. GC removing was removing file that were still in u seful

Added support for parsing AllQuery and RangeQuery via QueryParser

AllQuery: *

RangeQuery:

Inclusive field:[startIncl to endIncl]

Exclusive field:{startExcl to endExcl}

Mixed field:[startIncl to endExcl} and vice versa

Unbounded field:[start to *], field:[* to end]

Source code(tar.gz)
Source code(zip)