Tantivy is a full-text search engine library inspired by Apache Lucene and written in Rust

Overview

Docs Build Status codecov Join the chat at https://discord.gg/MT27AG5EVE License: MIT Crates.io

Tantivy

Tantivy is a full-text search engine library written in Rust.

It is closer to Apache Lucene than to Elasticsearch or Apache Solr in the sense it is not an off-the-shelf search engine server, but rather a crate that can be used to build such a search engine.

Tantivy is, in fact, strongly inspired by Lucene's design.

If you are looking for an alternative to Elasticsearch or Apache Solr, check out Quickwit, our search engine built on top of Tantivy.

Benchmark

The following benchmark breakdowns performance for different types of queries/collections.

Your mileage WILL vary depending on the nature of queries and their load.

Features

  • Full-text search
  • Configurable tokenizer (stemming available for 17 Latin languages with third party support for Chinese (tantivy-jieba and cang-jie), Japanese (lindera, Vaporetto, and tantivy-tokenizer-tiny-segmenter) and Korean (lindera + lindera-ko-dic-builder)
  • Fast (check out the 🐎 benchmark 🐎 )
  • Tiny startup time (<10ms), perfect for command-line tools
  • BM25 scoring (the same as Lucene)
  • Natural query language (e.g. (michael AND jackson) OR "king of pop")
  • Phrase queries search (e.g. "michael jackson")
  • Incremental indexing
  • Multithreaded indexing (indexing English Wikipedia takes < 3 minutes on my desktop)
  • Mmap directory
  • SIMD integer compression when the platform/CPU includes the SSE2 instruction set
  • Single valued and multivalued u64, i64, and f64 fast fields (equivalent of doc values in Lucene)
  • &[u8] fast fields
  • Text, i64, u64, f64, dates, and hierarchical facet fields
  • LZ4 compressed document store
  • Range queries
  • Faceted search
  • Configurable indexing (optional term frequency and position indexing)
  • JSON Field
  • Aggregation Collector: range buckets, average, and stats metrics
  • LogMergePolicy with deletes
  • Searcher Warmer API
  • Cheesy logo with a horse

Non-features

Distributed search is out of the scope of Tantivy, but if you are looking for this feature, check out Quickwit.

Getting started

Tantivy works on stable Rust (>= 1.27) and supports Linux, macOS, and Windows.

How can I support this project?

There are many ways to support this project.

  • Use Tantivy and tell us about your experience on Discord or by email ([email protected])
  • Report bugs
  • Write a blog post
  • Help with documentation by asking questions or submitting PRs
  • Contribute code (you can join our Discord server)
  • Talk about Tantivy around you

Contributing code

We use the GitHub Pull Request workflow: reference a GitHub ticket and/or include a comprehensive commit message when opening a PR.

Clone and build locally

Tantivy compiles on stable Rust but requires Rust >= 1.27. To check out and run tests, you can simply run:

    git clone https://github.com/quickwit-oss/tantivy.git
    cd tantivy
    cargo build

Run tests

Some tests will not run with just cargo test because of fail-rs. To run the tests exhaustively, run ./run-tests.sh.

Debug

You might find it useful to step through the programme with a debugger.

A failing test

Make sure you haven't run cargo clean after the most recent cargo test or cargo build to guarantee that the target/ directory exists. Use this bash script to find the name of the most recent debug build of Tantivy and run it under rust-gdb:

find target/debug/ -maxdepth 1 -executable -type f -name "tantivy*" -printf '%TY-%Tm-%Td %TT %p\n' | sort -r | cut -d " " -f 3 | xargs -I RECENT_DBG_TANTIVY rust-gdb RECENT_DBG_TANTIVY

Now that you are in rust-gdb, you can set breakpoints on lines and methods that match your source code and run the debug executable with flags that you normally pass to cargo test like this:

$gdb run --test-threads 1 --test $NAME_OF_TEST

An example

By default, rustc compiles everything in the examples/ directory in debug mode. This makes it easy for you to make examples to reproduce bugs:

rust-gdb target/debug/examples/$EXAMPLE_NAME
$ gdb run

Companies Using Tantivy

Nuclia   Humanfirst.ai Element.io Nuclia   Humanfirst.ai    Element.io

FAQ

Can I use Tantivy in other languages?

You can also find other bindings on GitHub but they may be less maintained.

What are some examples of Tantivy use?

  • seshat: A matrix message database/indexer
  • tantiny: Tiny full-text search for Ruby
  • lnx: adaptable, typo tolerant search engine with a REST API
  • and more!

On average, how much faster is Tantivy compared to Lucene?

Comments
  • Case of corrupted segment

    Case of corrupted segment

    Probably related to #897

    Describe the bug In continuation of https://gitter.im/tantivy-search/tantivy?at=5ff589c503529b296bd23728 Firstly, I found memory bloating of Tantivy application. After debugging I found that merging thread have been failing every time on merging broken store in a particular segment. Example of the whole segment I've sent to your personally in Gitter.

    Which version of tantivy are you using? https://github.com/tantivy-search/tantivy/commit/a4f33d3823f1bad3ff7a59877f1608615acabe6e

    What happened I used poor man debugging and launched Tantivy with patched function (added print only):

        fn write_storable_fields(&self, store_writer: &mut StoreWriter) -> crate::Result<()> {
            for reader in &self.readers {
                let store_reader = reader.get_store_reader()?;
                if reader.num_deleted_docs() > 0 {
                    for doc_id in reader.doc_ids_alive() {
                        let doc = store_reader.get(doc_id);
                        if let Err(ref err) = doc {
                            println!("Error: {:?}\nSegment ID: {:?}\nDocID: {}", err, reader.segment_id(), doc_id);
                        }
                        store_writer.store(&doc?)?;
                    }
                } else {
                    store_writer.stack(&store_reader)?;
                }
            }
            Ok(())
        }
    

    Stdout after failed merging:

    Error: IOError(Custom { kind: InvalidData, error: "Reach end of buffer while reading VInt" })
    Segment ID: Seg("e6ece22e")
    DocID: 53
    
    opened by ppodolsky 33
  • Make it possible to write the index from independent thread easily.

    Make it possible to write the index from independent thread easily.

    (This is a follow up from #549)

    Ideally what I need is the ability to write to index from independent threads each having own writer.

    This problem is very common and hurts some Very Important Project relying on tantivy (toshi, plume). (Invoking the name @fdb-hiroshima and @hntd187 for the discussion).

    Web server are typically multithreaded and requests may spawn the need to add or delete documents. Dealing with a Arc<RwLock<IndexWriter>> might feel dirty, and rust beginners may not really understand the logic behind that.

    On the other hand, the IndexWriter already relies on a channel to dispatch indexing operation to its own small thread pool. Stamping is also done using Atomics. There is actually no real reason to prevent .add_document() and and .delete_term() to happen from different threads.

    The problem is

    • what should happen with .commit()
    • what should happen with .rollback()

    Especially would this ability to index from several threads confuse people on

    • what a commit is in tantivy? (.commit() ensure that all operations that happened but the .commit() are processed)
    • how multithreaded indexing works ? (indexing from several thread will no help. Tantivy has its own multithreading system).

    Also,

    • should .commit() and .rollback() block other operations? (It is technically possible to have .commit() only block other .commit() operations.)
    • Should .commit() and .rollback() return futures?
    opened by fulmicoton 25
  • Block-Max WAND implementation proof of concept

    Block-Max WAND implementation proof of concept

    This is a proof-of-concept implementation of BMW algorithm for union queries. There are still some parts missing in the algorithm itself, but also I haven't touched anything related to indexing yet.

    Two things in particular I want to point out for now:

    • BMW requires sorting of posting lists; I wrapped them in Box thinking that it should be faster to move than lists themselves; not yet sure this is the best way, I haven't spent too much time thinking about it so far;
    • BMW needs that additional piece of feedback from the collector: the current threshold; right now, I am passing a predicate but it definitely has its disadvantages. one alternative is to have someone, say, the collector, set that between steps; again, not entirely sure what the way to do this is.

    Any comments at this stage would be appreciated.

    opened by elshize 22
  • garbage_collect_files works only on second try

    garbage_collect_files works only on second try

    Describe the bug the snippet code I use is here

    garbage_collect_files does not work the first time I use it, only the second time. First time means first call of garbage_collect_files in one thread, second means second call of garbage_collect_files in another thread

    Which version of tantivy are you using? 0.9.1

    To Reproduce

    repo where is all the code I used, look at drop_index() function in src/bin/server.rs

    I use POST request to trigger garbage_collect_files like this: POST http://localhost:8080/drop

    When I post at first time all documents are deleted but ./index directory does not become empty, actually it has 36Mb data in it. Original size of index directory was 129Mb When I post at second time directory size drops to 8,0K. Nothing else but second call to garbage_collect_files happens.

    I don't know, maybe it is an expected behavior, but is weird.

    bug 
    opened by kkonevets 21
  • memory leak problem

    memory leak problem

    Hi, i'm a rust rookie. my program suffer a memory leak problem, it could been killed by kernel after running along time. Getting data from redis,and adding to tantivy with rust:

    fn main() {
        let mut index_writer = index.writer_with_num_threads(num_threads, buffer).unwrap();
        index_writer.set_merge_policy(mergePolicy);
        loop {
            let mut datas = RedisService::get_data();
            let len = datas.len();
            if len == 0 {
                //sleep
            }
            for data in datas {
                let mut doc = Document::new();
                let json_value = json::parse(&data).unwrap();
                //doc.add_text,add_u64;
                index_writer.add_document(doc);
            }
            index_writer.commit();
        }
    }
    
    
    question 
    opened by vsop-479 20
  • [RFC] Python bindings

    [RFC] Python bindings

    This patch-set implements python bindings for Tantivy.

    The bindings are written using PyO3 and thus require rust nightly.

    Some parts of the Tantivy API are omitted for now (like the SegmentReader) while some parts are made a little bit more pythonic using optional arguments instead of builder patterns.

    The search method of the Searcher struct is incomplete and only allows a single TopCollector as its collector argument. Since the method is generic and python doesn't support generics, it has proven to be quite tricky to allow multiple differing collectors to be passed to the function in a dynamic way.

    The first couple of patches add some functionality or expose some structs publicly in Tantivy that are required for the python bindings.

    The last patch finally adds the python bindings as a sub-crate to Tantivy. To build the bindings a setup.py file is provided that requires setuptools-rust or pyo3-pack can be used. Both of the methods allow the bindings to be uploaded to pypi.

    opened by poljar 20
  • Make it possible to stream the terms matching an Automaton

    Make it possible to stream the terms matching an Automaton

    Closes #273

    Ok, so I've started down the path, but I'm for sure lost.

    1. Not sure how I would test with Automaton - still reading up on the fst code there
    2. Not sure how to use the fst StreamBuilder, but it feels close on the code implementation, but I have compiler issues, which could just be about spreading the generic type love around more.
    opened by drusellers 19
  • Replace `chrono` with `time`

    Replace `chrono` with `time`

    Fixes #1304.

    BREAKING API CHANGE: The type DateTime is no longer (re-)exported. Instead you need to use time::OffsetDateTime explicitly.

    Remarks:

    • Please check the error handling and mappings. time returns results instead of panicking like chrono does. For the FastValue impl I had to use `.expect("valid UNIX timestamp").
    • I had to adjust some debug strings in tests that are used for matching test results. Copy&paste from the failed test outputs. Not very handy, please verify.

    TODO:

    • [x] Add changelog entry
    opened by uklotzde 17
  • Tantivy fails with `Failed purge deletes` error

    Tantivy fails with `Failed purge deletes` error

    I am getting a failure starting my application

    Failed purge deletes: Error(PathDoesNotExist("/SOME_PATH/c475e13ef3ca45128b1f8d9ee42fe994.term")
    

    This is on OS X. My computer crashed recently. What's a good way to move forward here? Happy to help writing some code if that helps :)

    bug high priority 
    opened by winding-lines 16
  • - Knock-knock? - Who's there? - Broken segment!

    - Knock-knock? - Who's there? - Broken segment!

    Describe the bug The same load profile as in #969 - deletions, addings and mergings. Now it happens on querying after several hours of serving. I think the reason is basically the same. At startup and during several hours afterwards all queries were ok but after generations of merges searcher.doc started to throw VInt decoding error.

    Which version of tantivy are you using? https://github.com/tantivy-search/tantivy/commit/bf6e6e8a7cc1826212ba2500b08ecb53dfcdeda1

    To Reproduce Sent broken segment to you in gitter.

    opened by ppodolsky 15
  • The Snippet::fragments member is misleading and needs a rename

    The Snippet::fragments member is misleading and needs a rename

    Is your feature request related to a problem? Please describe. I am creating an app where I would like to break different fragment onto their own lines.

    What my app currently outputs:

    [ueuccfcgbx] Fucshia Arch: https://blog.quarkslab.com/playing-around-with-the-fuchsia-operating-system.html
    address space is context-switched by **Zircon**.Contrary to other OSes however, the IOMMU
    (Input-Output MMU), plays an important role on **Zircon**: it is
    

    Notice how **Zircon**.Contrary is concatenated.

    What I would like:

    [ueuccfcgbx] Fucshia Arch: https://blog.quarkslab.com/playing-around-with-the-fuchsia-operating-system.html
    address space is context-switched by **Zircon**. 
    Contrary to other OSes however, the IOMMU(Input-Output MMU), plays an important role on **Zircon**: it is
    

    Currently a Snippet stores fragments (a string that has all of the text of the snippet) and a vector of HighlightSections (what parts of fragments should be highlighted). I would like to break each fragment onto its own line to make this output more readable/understandable. Currently there is no way to know where one fragment ends, and another begins.

    Describe the solution you'd like Change the type of Snippet::fragments to Vec<String> and add a fragment_number member to HighlightSection.

    [Optional] describe alternatives you've considered Add a new member to Snippet similar to highlighted called fragment_sections, that is a Vec of start and stop points for the different fragments

    enhancement 
    opened by liamwarfield 15
  • Add regex tokenizer

    Add regex tokenizer

    This adds a regex tokenizer which tokenizes the text by using a regex pattern to split. This is my first attempt and works for my usecase, but it's not ideal from a code and a configuration perspective.

    Closes https://github.com/quickwit-oss/tantivy/issues/1670

    opened by mkleen 3
  • Implement RegexTokenizer

    Implement RegexTokenizer

    Implementation of a RegexTokenizer for #1670.

    To make this a decently flexible solution I implemented support for capture groups as well as simple find matches: The RegexTokenStream will check on construction if there are any capture groups (besides the entire match) and use either captures_read_at for capture support or find_at for simple matches. Tests and examples included.

    Closes https://github.com/quickwit-oss/tantivy/issues/1670

    opened by Gearme 2
  • `get_docids_for_value_range` is broken w/certain fast fields that use a GCD inverse

    `get_docids_for_value_range` is broken w/certain fast fields that use a GCD inverse

    Describe the bug

    When inserting fast fields in a certain order, segments will occasionally end up with a column that matches extra documents when calling get_docids_for_value_range using nonexistant values.

    The testcase at the bottom of this report illustrates a failing case, though you need to run it repeatedly to cause it to fail.

    Expected behaviour: passing fastfield values/ranges to get_docids_for_value_range that cannot possibly exist in a given segment reader should return no matching documents.

    Which version of tantivy are you using?

    0.19.0

    To Reproduce

    This code will reproduce the bug, but it won't happen every time. Running it repeatedly will eventually give you a SegmentReader with the min/max values [6341073085727221 3373930417471920086], and that SegmentReader will always match document zero when querying in the invalid value 7999999999999999.

    
        #[test]
        fn test_gcd_bug() {
            let dir = RamDirectory::create();
            let mut schema_builder = Schema::builder();
            let url_norm_hash_field = schema_builder.add_i64_field("url_norm_hash", FAST | INDEXED);  
            let schema = schema_builder.build();
            let index = Index::create(dir, schema, IndexSettings::default()).unwrap();
            let mut writer = index.writer(50_000_000).unwrap();
            for i in [6341073085727221_i64, 6341073085727221_i64, 6341073085727221_i64, 6341073085727221_i64, 6341073085727221_i64, 6341073085727221_i64, 3373930417471920086_i64, 3373930417471920086_i64, 3373930417471920086_i64, 3373930417471920086_i64] {
                writer.add_document(doc! {
                    url_norm_hash_field => i,
                }).unwrap();
            }
            writer.commit();
            let reader = index.reader().unwrap();
            let searcher = reader.searcher();
            for segment in searcher.segment_readers() {
                let field = segment.fast_fields().i64(url_norm_hash_field).unwrap();
                let (min, max) = (field.min_value(), field.max_value());
                println!("{} {}", min, max);
                if 7999999999999999_i64 >= min && 7999999999999999_i64 <= max {
                    let mut vec = vec![];
                    field.get_docids_for_value_range(7999999999999999_i64..=7999999999999999_i64, 0..u32::MAX, &mut vec);
                    println!("{:?}", vec);
                }
            }
        }
    
    bug 
    opened by mmastrac 3
  • Panic when opening old indexes

    Panic when opening old indexes

    Describe the bug

    Tantivy panics when reading an index of old version with no DOC_STORE_VERSION field in the footer:

    thread 'MainLoop' panicked at 'actual doc store version: 2222222222, expected: 1', tantivy-0.19.0\src\store\footer.rs:30:13
    
    opened by gyk 0
  • Allow single-threaded addition of documents to existing index without spawning new threads

    Allow single-threaded addition of documents to existing index without spawning new threads

    Is your feature request related to a problem? Please describe. Continuing my work on getting tantivy to work server-side with WASM (related issues #1751 #541 ), I would like to index dynamically added documents. In essence I have a large set of documents I can pre-index in a build phase, however each user also has some documents that are loaded dynamically from a database.

    I would imagine that normally I could do something like this:

    let index = Index::open(ram_directory).unwrap();
    let mut index_writer = index.writer(3_000_000).unwrap();
    index_writer.add_document(my_document).unwrap();
    index_writer.commit().unwrap();
    

    And then search the index. I realize that this might leak documents from one tenant to another, but as this index is rebuilt in-memory on each request and dropped after, this isn't a large concern.

    However, as WASM is single-threaded I'm unable to actually get this to work as it seems all the IndexWriters require a thread pool of some sort.

    I have tried both index.writer() and index.writer_with_num_threads with both 1 and 0 threads. I've even delved into the undocumentedSingleSegmentIndexWriter Even though it seems to suggest it's only for creating an index with a Single Segment and not adding to an existing index, I figured I would give it a go.

    However trying to instantiate it gives me the following IoError

    IoError(
        Error {
            kind: Unsupported,
            message: "operation not supported on this platform",
        },
    )
    

    Which I think might be threadpool-related, but I am unable to get a stacktrace to confirm.

    Describe the solution you'd like I think in essence I'm asking if there's any way to accomplish what I want to do.

    • Is there another way to use tantivy to index dynamic documents without using an IndexWriter and that machinery?
    • If not, is there any single-threaded IndexWriter or similar I've been unable to find? Will you consider adding one?
    • Alternatively, is there any way or guidance for me to implement my own single-threaded IndexWriter? From a first glance it seems like a lot of the code that deals with writing is (understandably) internal to tantivy and not exposed.
    opened by GeeWee 4
Releases(0.18.1)
  • 0.18.1(Oct 20, 2022)

  • 0.18(May 26, 2022)

    • For date values chrono has been replaced with time (@uklotzde) #1304
    • Add histgram aggregation (@PSeitz)
    • Add support for fastfield on text fields (@PSeitz)
    • Add terms aggregation (@PSeitz)
    • Add support for zstd compression (@kryesh)
    Source code(tar.gz)
    Source code(zip)
  • 0.17(Mar 9, 2022)

    • LogMergePolicy now triggers merges if the ratio of deleted documents reaches a threshold (@shikhar @fulmicoton) #115
    • Adds a searcher Warmer API (@shikhar @fulmicoton)
    • Change to non-strict schema. Ignore fields in data which are not defined in schema. Previously this returned an error. #1211
    • Facets are necessarily indexed. Existing index with indexed facets should work out of the box. Index without facets that are marked with index: false should be broken (but they were already broken in a sense). (@fulmicoton) #1195 .
    • Bugfix that could in theory impact durability in theory on some filesystems #1224
    • Schema now offers not indexing fieldnorms (@lpouget) #922
    • Reduce the number of fsync calls #1225
    • Fix opening bytes index with dynamic codec (@PSeitz) #1278
    • Added an aggregation collector compatible with Elasticsearch (@PSeitz)
    • Added a JSON schema type @fulmicoton #1251
    • Added support for slop in phrase queries @halvorboe #1068
    Source code(tar.gz)
    Source code(zip)
  • 0.16.1(Sep 10, 2021)

  • 0.15.3(Jun 30, 2021)

  • 0.15.2(Jun 16, 2021)

  • 0.15.1(Jun 14, 2021)

  • 0.15(Jun 7, 2021)

    • API Changes. Using Range instead of (start, end) in the API and internals (FileSlice, OwnedBytes, Snippets, ...) This change is breaking but migration is trivial.
    • Added an Histogram collector. (@fulmicoton) #994
    • Added support for Option. (@fulmicoton)
    • DocAddress is now a struct (@scampi) #987
    • Bugfix consistent tie break handling in facet's topk (@hardikpnsp) #357
    • Date field support for range queries (@rihardsk) #516
    • Added lz4-flex as the default compression scheme in tantivy (@PSeitz) #1009
    • Renamed a lot of symbols to avoid all uppercasing on acronyms, as per new clippy recommendation. For instance, RAMDirectory -> RamDirectory. (@fulmicoton)
    • Simplified positions index format (@fulmicoton) #1022
    • Moved bitpacking to bitpacker subcrate and add BlockedBitpacker, which bitpacks blocks of 128 elements (@PSeitz) #1030
    • Added support for more-like-this query in tantivy (@evanxg852000) #1011
    • Added support for sorting an index, e.g presorting documents in an index by a timestamp field. This can heavily improve performance for certain scenarios, by utilizing the sorted data (Top-n optimizations)(@PSeitz). #1026
    • Add iterator over documents in doc store (@PSeitz). #1044
    • Fix log merge policy (@PSeitz). #1043
    • Add detection to avoid small doc store blocks on merge (@PSeitz). #1054
    • Make doc store compression dynamic (@PSeitz). #1060
    • Switch to json for footer version handling (@PSeitz). #1060
    • Updated TermMerger implementation to rely on the union feature of the FST (@scampi) #469
    • Add boolean marking whether position is required in the query_terms API call (@fulmicoton). #1070
    Source code(tar.gz)
    Source code(zip)
  • 0.14(Feb 5, 2021)

    • Remove dependency to atomicwrites #833 .Implemented by @fulmicoton upon suggestion and research from @asafigan).
    • Migrated tantivy error from the now deprecated failure crate to thiserror #760. (@hirevo)
    • API Change. Accessing the typed value off a Schema::Value now returns an Option instead of panicking if the type does not match.
    • Large API Change in the Directory API. Tantivy used to assume that all files could be somehow memory mapped. After this change, Directory return a FileSlice that can be reduced and eventually read into an OwnedBytes object. Long and blocking io operation are still required by they do not span over the entire file.
    • Added support for Brotli compression in the DocStore. (@ppodolsky)
    • Added helper for building intersections and unions in BooleanQuery (@guilload)
    • Bugfix in Query::explain
    • Removed dependency on notify #924. Replaced with FileWatcher struct that polls meta file every 500ms in background thread. (@halvorboe @guilload)
    • Added FilterCollector, which wraps another collector and filters docs using a predicate over a fast field (@barrotsteindev)
    • Simplified the encoding of the skip reader struct. BlockWAND max tf is now encoded over a single byte. (@fulmicoton)
    • FilterCollector now supports all Fast Field value types (@barrotsteindev)
    • FastField are not all loaded when opening the segment reader. (@fulmicoton)

    This version breaks compatibility and requires users to reindex everything.

    Source code(tar.gz)
    Source code(zip)
  • 0.13.3(Jan 13, 2021)

  • 0.13.2(Oct 1, 2020)

  • 0.13.1(Sep 19, 2020)

  • 0.13(Aug 19, 2020)

    Tantivy 0.13 introduce a change in the index format that will require you to reindex your index (BlockWAND information are added in the skiplist). The index size increase is minor as this information is only added for full blocks. If you have a massive index for which reindexing is not an option, please contact me so that we can discuss possible solutions.

    • Bugfix in FuzzyTermQuery not matching terms by prefix when it should (@Peachball)
    • Relaxed constraints on the custom/tweak score functions. At the segment level, they can be mut, and they are not required to be Sync + Send.
    • MMapDirectory::open does not return a Result anymore.
    • Change in the DocSet and Scorer API. (@fulmicoton). A freshly created DocSet point directly to their first doc. A sentinel value called TERMINATED marks the end of a DocSet. .advance() returns the new DocId. Scorer::skip(target) has been replaced by Scorer::seek(target) and returns the resulting DocId. As a result, iterating through DocSet now looks as follows
    let mut doc = docset.doc();
    while doc != TERMINATED {
       // ...
       doc = docset.advance();
    }
    

    The change made it possible to greatly simplify a lot of the docset's code.

    • Misc internal optimization and introduction of the Scorer::for_each_pruning function. (@fulmicoton)
    • Added an offset option to the Top(.*)Collectors. (@robyoung)
    • Added Block WAND. Performance on TOP-K on term-unions should be greatly increased. (@fulmicoton, and special thanks to the PISA team for answering all my questions!)
    Source code(tar.gz)
    Source code(zip)
  • 0.12(Feb 19, 2020)

    • Removing static dispatch in tokenizers for simplicity. (#762)
    • Added backward iteration for TermDictionary stream. (@halvorboe)
    • Fixed a performance issue when searching for the posting lists of a missing term (@audunhalland)
    • Added a configurable maximum number of docs (10M by default) for a segment to be considered for merge (@hntd187, landed by @halvorboe #713)
    • Important Bugfix #777, causing tantivy to retain memory mapping. (diagnosed by @poljar)
    • Added support for field boosting. (#547, @fulmicoton)
    Source code(tar.gz)
    Source code(zip)
  • 0.11.3(Dec 20, 2019)

  • 0.11.1(Dec 17, 2019)

  • 0.11(Dec 15, 2019)

    • Added f64 field. Internally reuse u64 code the same way i64 does (@fdb-hiroshima)
    • Various bugfixes in the query parser.
      • Better handling of hyphens in query parser. (#609)
      • Better handling of whitespaces.
    • Closes #498 - add support for Elastic-style unbounded range queries for alphanumeric types eg. "title:>hello", "weight:>=70.5", "height:<200" (@petr-tik)
    • API change around Box<BoxableTokenizer>. See detail in #629
    • Avoid rebuilding Regex automaton whenever a regex query is reused. #639 (@brainlock)
    • Add footer with some metadata to index files. #605 (@fdb-hiroshima)
    • Add a method to check the compatibility of the footer in the index with the running version of tantivy (@petr-tik)
    • TopDocs collector: ensure stable sorting on equal score. #671 (@brainlock)
    • Added handling of pre-tokenized text fields (#642), which will enable users to load tokens created outside tantivy. See usage in examples/pre_tokenized_text. (@kkoziara)
    • Fix crash when committing multiple times with deleted documents. #681 (@brainlock)

    How to update?

    • The index format is changed. You are required to reindex your data to use tantivy 0.11.
    • Box<dyn BoxableTokenizer> has been replaced by a BoxedTokenizer struct.
    • Regex are now compiled when the RegexQuery instance is built. As a result, it can now return an error and handling the Result is required.
    • tantivy::version() now returns a Version object. This object implements ToString()
    Source code(tar.gz)
    Source code(zip)
  • 0.10.3(Nov 10, 2019)

  • 0.10.2(Oct 1, 2019)

  • 0.10.1(Jul 30, 2019)

    • Closes #544. A few users experienced problems with the directory watching system. Avoid watching the mmap directory until someone effectively creates a reader that uses this functionality.
    Source code(tar.gz)
    Source code(zip)
  • 0.10.0(Jul 11, 2019)

    Tantivy 0.10.0 index format is compatible with the index format in 0.9.0.

    • Added an API to easily tweak or entirely replace the default score. See TopDocs::tweak_scoreand TopScore::custom_score (@pmasurel)
    • Added an ASCII folding filter (@drusellers)
    • Bugfix in query.count in presence of deletes (@pmasurel)
    • Added .explain(...) in Query and Weight to (@pmasurel)
    • Added an efficient way to delete_all_documents in IndexWriter (@petr-tik). All segments are simply removed.

    Minor

    • Switched to Rust 2018 (@uvd)
    • Small simplification of the code. Calling .freq() or .doc() when .advance() has never been called on segment postings should panic from now on.
    • Tokens exceeding u16::max_value() - 4 chars are discarded silently instead of panicking.
    • Fast fields are now preloaded when the SegmentReader is created.
    • IndexMeta is now public. (@hntd187)
    • IndexWriter add_document, delete_term. IndexWriter is Sync, making it possible to use it with a Arc<RwLock<IndexWriter>>. add_document and delete_term can only require a read lock. (@pmasurel)
    • Introducing Opstamp as an expressive type alias for u64. (@petr-tik)
    • Stamper now relies on AtomicU64 on all platforms (@petr-tik)
    • Bugfix - Files get deleted slightly earlier
    • Compilation resources improved (@fdb-hiroshima)

    How to update?

    Your program should be usable as is.

    Fast fields

    Fast fields used to be accessed directly from the SegmentReader. The API changed, you are now required to acquire your fast field reader via the segment_reader.fast_fields(), and use one of the typed method:

    • .u64(), .i64() if your field is single-valued ;
    • .u64s(), .i64s() if your field is multi-valued ;
    • .bytes() if your field is bytes fast field.
    Source code(tar.gz)
    Source code(zip)
  • 0.9.1(Mar 28, 2019)

  • 0.9(Mar 20, 2019)

    0.9.0 index format is not compatible with the previous index format.

    Bugfix

    Some Mmap objects were being leaked, and would never get released. (@fulmicoton)

    New Features

    • Added IndexReader. By default, index is reloaded automatically upon new commits (@fulmicoton)
    • Stemming in other language possible (@pentlander)
    • Added grouped add and delete operations. They are guaranteed to happen together (i.e. they cannot be split by a commit). In addition, adds are guaranteed to happen on the same segment. (@elbow-jason)
    • Added DateTime field (@barrotsteindev)

    Misc improvements

    • Indexer memory footprint improved. (VInt comp, inlining the first block. (@fulmicoton)
    • Removed most unsafe (@fulmicoton)
    • Segments with no docs are deleted earlier (@barrotsteindev)
    • Removed INT_STORED and INT_INDEXED. It is now possible to use STORED and INDEXED for int fields. (@fulmicoton)
    Source code(tar.gz)
    Source code(zip)
  • 0.8.2b(Feb 14, 2019)

  • 0.8.1(Jan 23, 2019)

  • 0.8.0(Dec 26, 2018)

  • 0.7.2(Dec 18, 2018)

  • 0.7.1(Nov 2, 2018)

  • 0.7.0(Sep 16, 2018)

    • Skip data for doc ids and positions (@fulmicoton), greatly improving performance
    • Tantivy error now rely on the failure crate (@drusellers)
    • Added support for AND, OR, NOT syntax in addition to the +,- syntax
    • Added a snippet generator with highlight (@vigneshsarma, @fulmicoton)
    • Added a TopFieldCollector (@pentlander)
    Source code(tar.gz)
    Source code(zip)
  • 0.6.1(Jul 10, 2018)

    • Bugfix #324. GC removing was removing file that were still in u seful
    • Added support for parsing AllQuery and RangeQuery via QueryParser
      • AllQuery: *
      • RangeQuery:
        • Inclusive field:[startIncl to endIncl]
        • Exclusive field:{startExcl to endExcl}
        • Mixed field:[startIncl to endExcl} and vice versa
        • Unbounded field:[start to *], field:[* to end]
    Source code(tar.gz)
    Source code(zip)
Owner
Quickwit OSS
Quickwit OSS Project
Quickwit OSS
Tantivy is a full text search engine library written in Rust.

Tantivy is a full text search engine library written in Rust. It is closer to Apache Lucene than to Elasticsearch or Apache Solr in the sense it is no

Quickwit OSS 7.4k Dec 30, 2022
A lightweight full-text search library that provides full control over the scoring calculations

probly-search · A full-text search library, optimized for insertion speed, that provides full control over the scoring calculations. This start initia

Quantleaf 20 Nov 26, 2022
⚡ Insanely fast, 🌟 Feature-rich searching. lnx is the adaptable deployment of the tantivy search engine you never knew you wanted. Standing on the shoulders of giants.

✨ Feature Rich | ⚡ Insanely Fast An ultra-fast, adaptable deployment of the tantivy search engine via REST. ?? Standing On The Shoulders of Giants lnx

lnx 679 Jan 1, 2023
⚡ Insanely fast, 🌟 Feature-rich searching. lnx is the adaptable deployment of the tantivy search engine you never knew you wanted. Standing on the shoulders of giants.

✨ Feature Rich | ⚡ Insanely Fast An ultra-fast, adaptable deployment of the tantivy search engine via REST. ?? Standing On The Shoulders of Giants lnx

lnx 0 Apr 25, 2022
A full-text search engine in rust

Toshi A Full-Text Search Engine in Rust Please note that this is far from production ready, also Toshi is still under active development, I'm just slo

Toshi Search 3.8k Jan 7, 2023
🔍TinySearch is a lightweight, fast, full-text search engine. It is designed for static websites.

tinysearch TinySearch is a lightweight, fast, full-text search engine. It is designed for static websites. TinySearch is written in Rust, and then com

null 2.2k Dec 31, 2022
A full-text search and indexing server written in Rust.

Bayard Bayard is a full-text search and indexing server written in Rust built on top of Tantivy that implements Raft Consensus Algorithm and gRPC. Ach

Bayard Search 1.8k Dec 26, 2022
Shogun search - Learning the principle of search engine. This is the first time I've written Rust.

shogun_search Learning the principle of search engine. This is the first time I've written Rust. A search engine written in Rust. Current Features: Bu

Yuxiang Liu 5 Mar 9, 2022
ik-analyzer for rust; chinese tokenizer for tantivy

ik-rs ik-analyzer for Rust support Tantivy Usage Chinese Segment let mut ik = IKSegmenter::new(); let text = "中华人民共和国"; let tokens = ik.to

Shen Yanchao 4 Dec 26, 2022
A simple and lightweight fuzzy search engine that works in memory, searching for similar strings (a pun here).

simsearch A simple and lightweight fuzzy search engine that works in memory, searching for similar strings (a pun here). Documentation Usage Add the f

Andy Lok 116 Dec 10, 2022
Lightning Fast, Ultra Relevant, and Typo-Tolerant Search Engine

MeiliSearch Website | Roadmap | Blog | LinkedIn | Twitter | Documentation | FAQ ⚡ Lightning Fast, Ultra Relevant, and Typo-Tolerant Search Engine ?? M

MeiliSearch 31.6k Dec 31, 2022
Perlin: An Efficient and Ergonomic Document Search-Engine

Table of Contents 1. Perlin Perlin Perlin is a free and open-source document search engine library build on top of perlin-core. Since the first releas

CurrySoftware GmbH 70 Dec 9, 2022
Configurable quick search engine shortcuts for your terminal and browser.

Quicksearch Configurable quick search engine shortcuts for your terminal and browser. Installation Run cargo install quicksearch to install Configurat

Rahul Pai 2 Oct 14, 2022
AI-powered search engine for Rust

txtai: AI-powered search engine for Rust txtai executes machine-learning workflows to transform data and build AI-powered text indices to perform simi

NeuML 69 Jan 2, 2023
A Rust API search engine

Roogle Roogle is a Rust API search engine, which allows you to search functions by names and type signatures. Progress Available Queries Function quer

Roogle 342 Dec 26, 2022
Python bindings for Milli, the embeddable Rust-based search engine powering Meilisearch

milli-py Python bindings for Milli, the embeddable Rust-based search engine powering Meilisearch. Due to limitations around Rust lifecycles, methods a

Alexandro Sanchez 92 Feb 21, 2023
High-performance log search engine.

NOTE: This project is under development, please do not depend on it yet as things may break. MinSQL MinSQL is a log search engine designed with simpli

High Performance, Kubernetes Native Object Storage 359 Nov 27, 2022
Cross-platform, cross-browser, cross-search-engine duckduckgo-like bangs

localbang Cross-platform, cross-browser, cross-search-engine duckduckgo-like bangs What are "bangs"?? Bangs are a way to define where to search inside

Jakob Kruse 7 Nov 23, 2022
Image search example by approximate nearest-neighbor library In Rust

rust-ann-search-example Image search example by approximate nearest-neighbor library In Rust use - tensorflow 0.17.0 - pretrain ResNet50 - hora (Ru

vaaaaanquish 8 Jan 3, 2022