Represent large sets and maps compactly with finite state transducers.

Related tags

Text search fst
Overview

fst

This crate provides a fast implementation of ordered sets and maps using finite state machines. In particular, it makes use of finite state transducers to map keys to values as the machine is executed. Using finite state machines as data structures enables us to store keys in a compact format that is also easily searchable. For example, this crate leverages memory maps to make range queries very fast.

Check out my blog post Index 1,600,000,000 Keys with Automata and Rust for extensive background, examples and experiments.

Build status

Dual-licensed under MIT or the UNLICENSE.

Documentation

https://docs.rs/fst

The regex-automata crate provides implementations of the fst::Automata trait when its transducer feature is enabled. This permits using DFAs compiled by regex-automata to search finite state transducers produced by this crate.

Installation

Simply add a corresponding entry to your Cargo.toml dependency list:

[dependencies]
fst = "0.4"

Example

This example demonstrates building a set in memory and executing a fuzzy query against it. You'll need fst = "0.4" with the levenshtein feature enabled in your Cargo.toml.

use fst::{IntoStreamer, Set};
use fst::automaton::Levenshtein;

fn main() -> Result<(), Box<dyn std::error::Error>> {
  // A convenient way to create sets in memory.
  let keys = vec!["fa", "fo", "fob", "focus", "foo", "food", "foul"];
  let set = Set::from_iter(keys)?;

  // Build our fuzzy query.
  let lev = Levenshtein::new("foo", 1)?;

  // Apply our fuzzy query to the set we built.
  let stream = set.search(lev).into_stream();

  let keys = stream.into_strs()?;
  assert_eq!(keys, vec!["fo", "fob", "foo", "food"]);
  Ok(())
}

Check out the documentation for a lot more examples!

Cargo features

  • levenshtein - Disabled by default. This adds the Levenshtein automaton to the automaton sub-module. This includes an additional dependency on utf8-ranges.
Comments
  • Thread panics on read operations of FST set file

    Thread panics on read operations of FST set file

    I'm seeing a very small percentage of "corrupt" FST set files that are triggering panics in Rust (leading to a Python interpreter segfault). The errors look like:

    thread '<unnamed>' panicked at 'index out of bounds: the len is 17498006 but the index is 15336395951936096993', /usr/local/cargo/registry/src/github.com-1ecc6299db9ec823/fst-0.3.0/src/raw/node.rs:306:17
    
    thread '<unnamed>' panicked at 'index out of bounds: the len is 89225255 but the index is 15119944950614189002', /usr/local/cargo/registry/src/github.com-1ecc6299db9ec823/fst-0.3.0/src/raw/node.rs:306:17
    
    thread '<unnamed>' panicked at 'index out of bounds: the len is 16285338 but the index is 3532794445444415790', /usr/local/cargo/registry/src/github.com-1ecc6299db9ec823/fst-0.3.0/src/raw/node.rs:306:17                      
    

    This occurs on approximately 13 out of 4635 files, ranging in size from 20MB to > 100MB. I have not been able to narrow things down past this, but wanted to know what might cause this?

    I'm shelling out to the fst-bin crate to combine multiple input files into larger files, then doing set operations on the merged output. The fst binary was built on Rust nightly; I'm not sure of the exact version at the moment.

    opened by davidblewett 22
  • Update memmap dependency to fix API incompatibility

    Update memmap dependency to fix API incompatibility

    The proper fix(in case the memmap API still has to be exposed) for this would probably be to re-export the memmap API from fst(basically push it up the chain, thus everybody that uses fst directly or indirectly have to stick to the same memmap version :( ) but I didn't want to take that decision for the project.

    The problem is that fst exposes the memmap API, through its own API(ex: impl From<Mmap>...). memmap 0.5 in theory is a breaking API change, so Rust doesn't allow other projects depending on fst to update to memmap 0.5.

    I think this also needs a version bump to 0.3.0(fst version).

    opened by lilianmoraru 17
  • [Question] Query panics on very large dataset

    [Question] Query panics on very large dataset

    I try to run a StartsWith Query against a large generated FST and it panics with the following message thread 'main' panicked at 'index out of bounds: the len is 32379027443 but the index is 17000494749432868067', /home/jakob/.cargo/registry/src/github.com-1ecc6299db9ec823/fst-0.4.7/src/raw/node.rs:302:17

    The len matches the filesize of the fst.

    The code for the query is:

    use fst::{
        automaton::{AlwaysMatch, Str},
        Automaton, IntoStreamer, Map, Streamer,
    };
    use memmap::Mmap;
    use std::fs::File;
    
    pub fn main() -> Result<(), Box<dyn std::error::Error>> {
        let mmap = unsafe { Mmap::map(&File::open("pwned-passwords.fst")?)? };
        let map = Map::new(mmap)?;
    
        for arg in std::env::args() {
            let query = Str::new(&arg).starts_with();
            let mut results = map.search(query).into_stream();
            while let Some((key, value)) = results.next() {
                println!(
                    "{}: {:>20}",
                    unsafe { std::str::from_utf8_unchecked(key) },
                    value
                );
            }
        }
    
        Ok(())
    }
    

    So basically the sample on a different data set and not much more.

    There were no errors reported by generating the fst. To generate the same fst one can follow the procedure below:

    1. download the v8 Version of Pwned-Passwords SHA1 (17GB compressed)
    2. Unpack this file and reorder it (unfortunately the correctly ordered file does not come with the values) (about 34GiB)
    3. Generate an fst from the reorderd file (the fst is 30.1 GiB, which feels a little high)

    Oh btw. I have no clue about finite automata and FSTs, but found the blogpost fascinating, so I wanted to play with them a little.

    bug 
    opened by jakob-ledermann 12
  • Request: looser lifetime for loading from byte slice

    Request: looser lifetime for loading from byte slice

    Currently, the primary way to load an fst object (Map/Set/Raw) from a byte slice is to use from_static_slice. While that solves the use case of loading data that lives for the lifetime of the program, it appears to be overly-restrictive for other use cases.

    For example, I have fst data that I've written into a network buffer:

    • I do not want to give ownership of that buffer to fst, so using from_bytes won't work.
    • The fst data is not 'static, so from_static_slice won't work.
    • The fst data is a subslice of a greater slice (the network buffer), so to make it work with from_shared_bytes I would need to change my internal APIs to use Arc<Vec<u8>> instead of &[u8] (which may add bugs and performance regressions) .

    While I could try to use the existing mmap API, I believe I would need to create a file-like object that wraps my byte slice, and then memory map it, which is unergonomic and error-prone.

    My recommendation is to add a from_slice(bytes: &[u8]) constructor to Map, Set, and Raw. My understanding is that this is unfortunately not trivial, because the FstData type does not take a lifetime parameter, which would be required while it holds a reference to the byte slice.

    Nevertheless, is there interest in adding this feature?

    Thanks!

    enhancement 
    opened by rw 10
  • Two lines of input don't work

    Two lines of input don't work

    The commandline

    fst map two.csv two.fst

    does work, but the following command:

    fst range -o two.fst

    prints only one single line and the command

    fst dot two.fst | dot -Tpng > two.png

    produces a graph with three nodes - ok. But only a single final node, a wrong output on label 'a' (1) and no output on label 'b'.

    opened by ElBartel 10
  • Update the levenshtein construction algorithm

    Update the levenshtein construction algorithm

    I implemented the method used in Fast String Correction with Levenshtein-Automata. Currently the project is available here. https://github.com/tantivy-search/levenshtein-automata

    Building a new automaton requires to precompute a datastructure for each type of Levenshtein automaton considered. This precomputation is rather slow, but is typically amortized after building one automaton. If this is a problem, it is also possible to serialize it and ship it with the library. That's the approach taken by Lucene.

    The benefits compared with the current implementation of fst-levenshtein are as follows

    • building a DFA is much faster. For instance for a distance of 2, and for the string "Levenshtein", the current algorithm takes 2.5ms while the new implementation takes 0.09ms. Note that clearly 2.5ms is fast enough for most application, and fst intersection will dominate anyway.

    • the produced DFA is smaller, so we can hope for less cache miss. (!?) Microbenchmarking this with honesty is obviously very difficult... Going through the DFA above over the string Leventshtein went from 72 ns to 8 ns. The difference might also be due to the interface being easier to deal with for the compiler (state are u32 that I pass by value). For a levenshtein distance of 2 without transposition, the number of states for the DFA (before utf8 conversion) is bounded to 30 x (query len + 1)

    • the DFA outputs the actual distance rather than a boolean. This is useful for scoring.

    • Optionally transposition can be given a cost of 1 (Levenshtein-Damerau).

    • It addresses #38

    The Cons as I see it are:

    • It is yet an extra crate
    • Code quality is not as good as yours and bugs might be lurking somewhere.

    Should we switch to this implementation and how do we proceed ?

    If you decide to go with this implementation, the way I see it is : I should publish the project as a different crate. This sucks to have to have two repo for levenshtein automaton, but currently fst_levenshtein cannot be used without depending on fst which is not idea either. fst_levenshtein could depend on this library and wrap its DFA.

    An alternative solution, would be to integrate my code into the fst_levenshtein crate...

    opened by fulmicoton 10
  • Querying a 63GB file took a long time

    Querying a 63GB file took a long time

    I'm using this library for querying a 63GB FST of about 7 billion strings of length 9. With more than enough memory available too completely load/cache the FST, it still took 16 minutes to only query about 15 thousand strings. Then I first copied the files to a tmpfs before doing the exact same thing and I was able to query more than 300 million strings in only 36 minutes.

    Do you have any idea what might be the problem in the first case or how I would go about bisecting the issue?

    opened by Procrat 10
  • provide reverse mapping from value to key in some cases

    provide reverse mapping from value to key in some cases

    It is sometimes useful to have String <-> StringOrdinal mapping, where StringOrdinal is its position within the set of strings assuming the strings have been sorted.

    The fst repo makes a very good String -> StringOrdinal mapping. Wouldn't it be possible to get the StringOrdinal -> String by binary searching the correct transition at each node?

    enhancement 
    opened by fulmicoton 9
  • Add a StreamWithState streamer

    Add a StreamWithState streamer

    This is an attempt to be able to retrieve the Automaton State from a streamer, this streamer will be called StreamWithState.

    Currently the map::Stream struct implement the Streamer trait with an associated Streamer::Item that is (&[u8], u64) in other term the matching string associated with the value.

    The problem is that it is not possible to retrieve the state of the automaton with the current map::Stream type. It could be useful when the map::Stream is used with a levenshtein automaton, the state will permit to know the distance of the matching string returned.

    So I propose to add a new struct named StreamWithState that implement the Streamer trait and its associated type Streamer::Item is (&[u8], u64, A::State) where A: Automaton. This way the user will be able to retrieve the state of the Automaton and therefore the levenshtein distance of the matching string returned (by answering the automaton).

    The StreamWithState is accessible via the with_state self consuming method on map::StreamBuilder.

    All of this resolves #60.

    What I have done here is probably a little bit confused due to the way github shows the diffs. For me, map::StreamWithState is a superset of raw::Stream It should be possible to implement map::Stream using the other and just ignoring the Automaton::State, so I created a special StreamWithState::next method that accept a function which will "transform" the A::State:

    opened by Kerollmops 8
  • Fst made of fixed length byte strings: bounded hamming distance query

    Fst made of fixed length byte strings: bounded hamming distance query

    I have this problem: 3B k/v where the key are fixed length bytes strings either 32, 64 or 128 bits (in a given fst all keys have the same length though).

    I need to find all values with a key that is within a bounded (small) bit hamming distance of existing keys.

    I was thinking of a simple hamming distance query approach based on "Detecting Near-Duplicates for Web Crawling" at http://gurmeet.net/computer-science/research-papers/

    So say I want a hamming distance of 3, and my keys are 32 bits. At query time, I could build a query levensthein fst that would allow any key with up to 3 byte to be changed (not added/removed) to still match, e.g. this would match all the keys that that are within hamming distance 3 or less.

    This should work, but is this a proper use of the levensthein automaton? e.g. would there be a better construction for hamming distance than levensthein?

    Since this would be raw bytes, this might not work well with the unicode support of levensthein?

    Would the fact that the first three bytes can be anything with the last byte still match force a full sequential scan of the fst?

    @fulmicoton ping too as you seem to grok levensthein automatons quite well

    opened by pombredanne 8
  • key positions in fst

    key positions in fst

    Is it possible to:

    • return the key position from contains? That is instead of bool return Option<usize> where None means not found and Some(pos) is the position of the key in the lexicographic order of keys represented by the automaton
    • return a key by its position. That is implement ops::Index for set, such that myset[pos] returns the key at pos.

    If the above is possible then more possibilities open up:

    • map from bytes to bytes: "set" fst + int -> byte map (possibly another "set" fst)
    • map from bytes to ints, where ints are monotonically increasing when bytes are in lexicographic order (this is very common when building search indices): "set" fst + succinct integer encoding
    wontfix 
    opened by alkis 8
  • Unused traits in code examples

    Unused traits in code examples

    The code examples on this page import the traits Streamer and IntoStreamer. But apparently, these traites aren't being used after they have been imported.

    I suggest doing one of the following:

    1. remove the imports of Streamer and IntoStreamer traits from the code examples
    2. explain why those traits are necessary and how they are being implicitly used (if that's the case) in the code example
    opened by amab8901 0
  • [Help] Does the FST file format work cross-platform?

    [Help] Does the FST file format work cross-platform?

    First, thank you for this marvelous library. I'm experimenting with fst for my first ever Rust project, and for my use case, fst provides 10x the compression of gzip with a minimal lookup penalty at runtime.

    My question is: does the .fst file created by MapBuilder work across platforms? Ideally I could generate this file just once on my Macbook then distribute it across all of the platforms I want to support: Linux x64, Wasm, ...

    I checked the docs, but I couldn't find a clear answer to this question. But then again, I'm a Rust newbie who's fairly new to systems programming, so perhaps there's an obvious answer I'm just not aware of.

    Thank you for your time!

    (PS: I'm not sure what the right forum is for help questions, so forgive me if I'm in the wrong place.)

    question 
    opened by akprasad 1
  • Improving stream operation performance

    Improving stream operation performance

    Would it be possible to improve the performance of stream operations in a threaded manner? My use case is creating and comparing FSTs from ~300B strings. The initial merging operations you've provided work well until the final FST merge. Creating the last stream via union takes hours where only one core is utilized. I imagine the same occurs with difference as well.

    I have not familiarized myself with the internals yet, but maybe it is possible (in this case) to partition the keys based on some prefix and batch the work with some synchronization such that the receiver can build the final FST in order? This assumes there is an easy way to partition the prefixes. Maybe identifying eligible partitions could be done traversing and comparing the FST?

    opened by process0 1
  • [fst-bin] Add a commandline option '--delimiter' to `map`.

    [fst-bin] Add a commandline option '--delimiter' to `map`.

    This commandline option allows to specify the delimiter between keys and the values in the input file to the map subcommand.

    The provided delimiter should be a single byte utf-8 character. If the value consists of multiple bytes only the first one is used. This parameter is passed to csv::ReaderBuilder::delimiter which only allows for a single byte as delimiter.

    Motivation

    This is motivated by a colon used as separator between the keys and values. Fundamentaly resulting in issue #146, as I needed to write my own generator binary (where I missed the call to finish). With this option I would not have needed to use a custom implementation or could have used the official binary to compare the behaviour.

    opened by jakob-ledermann 0
  • [Help] Possible to build a FST at compile time?

    [Help] Possible to build a FST at compile time?

    I have a problem which I think FST might solve:

    1. 40 million keys, no values
    2. All strings
    3. We know every string at compile time and can guarantee no new strings will be added.
    4. Around 40-50k reads per programs run.
    5. We want to do this as fast as possible on runtime, and can sacrifice memory for it.

    Is it possible to build an FST at compile time to achieve (5)?

    opened by bee-san 1
Owner
Andrew Gallant
I love to code.
Andrew Gallant
A simple and lightweight fuzzy search engine that works in memory, searching for similar strings (a pun here).

simsearch A simple and lightweight fuzzy search engine that works in memory, searching for similar strings (a pun here). Documentation Usage Add the f

Andy Lok 116 Dec 10, 2022
Lightning Fast, Ultra Relevant, and Typo-Tolerant Search Engine

MeiliSearch Website | Roadmap | Blog | LinkedIn | Twitter | Documentation | FAQ ⚡ Lightning Fast, Ultra Relevant, and Typo-Tolerant Search Engine ?? M

MeiliSearch 31.6k Dec 31, 2022
Perlin: An Efficient and Ergonomic Document Search-Engine

Table of Contents 1. Perlin Perlin Perlin is a free and open-source document search engine library build on top of perlin-core. Since the first releas

CurrySoftware GmbH 70 Dec 9, 2022
Tantivy is a full-text search engine library inspired by Apache Lucene and written in Rust

Tantivy is a full text search engine library written in Rust. It is closer to Apache Lucene than to Elasticsearch or Apache Solr in the sense it is no

tantivy 7.4k Dec 28, 2022
A full-text search and indexing server written in Rust.

Bayard Bayard is a full-text search and indexing server written in Rust built on top of Tantivy that implements Raft Consensus Algorithm and gRPC. Ach

Bayard Search 1.8k Dec 26, 2022
Rapidly Search and Hunt through Windows Event Logs

Rapidly Search and Hunt through Windows Event Logs Chainsaw provides a powerful ‘first-response’ capability to quickly identify threats within Windows

F-Secure Countercept 1.8k Dec 31, 2022
🔎 A simple in-memory search for collections and key-value stores.

Indicium Search ?? A simple in-memory search for collections (Vec, HashMap, BTreeMap, etc) and key-value stores. Features autocompletion. There are ma

Dylan Bowker 41 Oct 28, 2022
An example of web application by using Rust and Axum with Clean Architecture.

stock-metrics Stock price and stats viewer. Getting Started Middleware Launch the middleware by executing docker compose: cd local-middleware docker c

Yuki Toyoda 62 Dec 10, 2022
2 and 3-dimensional collision detection library in Rust.

2D Documentation | 3D Documentation | User Guide | Forum ⚠️ **This crate is now passively-maintained. It is being superseded by the Parry project.** ⚠

dimforge 914 Dec 24, 2022
Tantivy is a full-text search engine library inspired by Apache Lucene and written in Rust

Tantivy is a full-text search engine library written in Rust. It is closer to Apache Lucene than to Elasticsearch or Apache Solr in the sense it is no

Quickwit OSS 7.5k Jan 9, 2023
Finding all pairs of similar documents time- and memory-efficiently

Finding all pairs of similar documents This software provides time- and memory-efficient all pairs similarity searches in documents. Problem definitio

null 53 Jan 2, 2023
Configurable quick search engine shortcuts for your terminal and browser.

Quicksearch Configurable quick search engine shortcuts for your terminal and browser. Installation Run cargo install quicksearch to install Configurat

Rahul Pai 2 Oct 14, 2022
💰 Midas is a free and open source Moving Average Trading backtest simulator.

Midas is a free and open source Moving Average Trading backtest simulator Bilibili Video: https://www.bilibili.com/video/BV11o4y1B7fL ⚠️ Warning Inves

Jerry 6 Mar 17, 2023
A Solr 8+ Client for Rust and Python

Solrstice: A Solr 8+ Client for Rust and Python Solrstice is a SolrCloud aware client library written in rust. It also provides a wrapper to python. U

Andreas H Johansen 4 Aug 26, 2023
EasyAlgolia is a Rust crate designed for utilizing the Algolia admin client. It simplifies the process of updating and inserting documents into Algolia's search index.

crate link EasyAlgolia is a Rust crate designed for utilizing the Algolia admin client. It simplifies the process of updating and inserting documents

faizal khan 3 Mar 20, 2024
Generate Soufflé Datalog types, relations, and facts that represent ASTs from a variety of programming languages.

treeedb treeedb makes it easier to start writing a source-level program analysis in Soufflé Datalog. First, treeedb generates Soufflé types and relati

Langston Barrett 16 Nov 30, 2022
Irx-config - The library provides convenient way to represent/parse configuration from different sources

The irx-config library provides convenient way to represent/parse configuration from different sources. The main goals is to be very easy to use and t

Andriy Bakay 2 Sep 14, 2022
Sets of libraries and tools to write applications and libraries mixing OCaml and Rust

Sets of libraries and tools to write applications and libraries mixing OCaml and Rust. These libraries will help keeping your types and data structures synchronized, and enable seamless exchange between OCaml and Rust

Meta 36 Jan 28, 2023
An implementation of regular expressions for Rust. This implementation uses finite automata and guarantees linear time matching on all inputs.

regex A Rust library for parsing, compiling, and executing regular expressions. Its syntax is similar to Perl-style regular expressions, but lacks a f

The Rust Programming Language 2.6k Jan 8, 2023
Source code for our paper "Higher-order finite elements for embedded simulation"

Higher-order Finite Elements for Embedded Simulation This repository contains the source code used to produce the results for our paper: Longva, A., L

Interactive Computer Graphics 18 Sep 30, 2022