Represent large sets and maps compactly with finite state transducers.

Andrew Gallant

Last update: Jan 5, 2023

Related tags

Text search fst

Overview

fst

This crate provides a fast implementation of ordered sets and maps using finite state machines. In particular, it makes use of finite state transducers to map keys to values as the machine is executed. Using finite state machines as data structures enables us to store keys in a compact format that is also easily searchable. For example, this crate leverages memory maps to make range queries very fast.

Check out my blog post Index 1,600,000,000 Keys with Automata and Rust for extensive background, examples and experiments.

Dual-licensed under MIT or the UNLICENSE.

Documentation

https://docs.rs/fst

The regex-automata crate provides implementations of the fst::Automata trait when its transducer feature is enabled. This permits using DFAs compiled by regex-automata to search finite state transducers produced by this crate.

Installation

Simply add a corresponding entry to your Cargo.toml dependency list:

[dependencies]
fst = "0.4"

Example

This example demonstrates building a set in memory and executing a fuzzy query against it. You'll need fst = "0.4" with the levenshtein feature enabled in your Cargo.toml.

use fst::{IntoStreamer, Set};
use fst::automaton::Levenshtein;

fn main() -> Result<(), Box<dyn std::error::Error>> {
  // A convenient way to create sets in memory.
  let keys = vec!["fa", "fo", "fob", "focus", "foo", "food", "foul"];
  let set = Set::from_iter(keys)?;

  // Build our fuzzy query.
  let lev = Levenshtein::new("foo", 1)?;

  // Apply our fuzzy query to the set we built.
  let stream = set.search(lev).into_stream();

  let keys = stream.into_strs()?;
  assert_eq!(keys, vec!["fo", "fob", "foo", "food"]);
  Ok(())
}

Check out the documentation for a lot more examples!

Cargo features

levenshtein - Disabled by default. This adds the Levenshtein automaton to the automaton sub-module. This includes an additional dependency on utf8-ranges.

Comments

Thread panics on read operations of FST set file

I'm seeing a very small percentage of "corrupt" FST set files that are triggering panics in Rust (leading to a Python interpreter segfault). The errors look like:

thread '<unnamed>' panicked at 'index out of bounds: the len is 17498006 but the index is 15336395951936096993', /usr/local/cargo/registry/src/github.com-1ecc6299db9ec823/fst-0.3.0/src/raw/node.rs:306:17

thread '<unnamed>' panicked at 'index out of bounds: the len is 89225255 but the index is 15119944950614189002', /usr/local/cargo/registry/src/github.com-1ecc6299db9ec823/fst-0.3.0/src/raw/node.rs:306:17

thread '<unnamed>' panicked at 'index out of bounds: the len is 16285338 but the index is 3532794445444415790', /usr/local/cargo/registry/src/github.com-1ecc6299db9ec823/fst-0.3.0/src/raw/node.rs:306:17

This occurs on approximately 13 out of 4635 files, ranging in size from 20MB to > 100MB. I have not been able to narrow things down past this, but wanted to know what might cause this?

I'm shelling out to the fst-bin crate to combine multiple input files into larger files, then doing set operations on the merged output. The fst binary was built on Rust nightly; I'm not sure of the exact version at the moment.

opened by davidblewett 22

Update memmap dependency to fix API incompatibility

The proper fix(in case the memmap API still has to be exposed) for this would probably be to re-export the memmap API from fst(basically push it up the chain, thus everybody that uses fst directly or indirectly have to stick to the same memmap version :( ) but I didn't want to take that decision for the project.

The problem is that fst exposes the memmap API, through its own API(ex: impl From<Mmap>...). memmap 0.5 in theory is a breaking API change, so Rust doesn't allow other projects depending on fst to update to memmap 0.5.

I think this also needs a version bump to 0.3.0(fst version).

opened by lilianmoraru 17
[Question] Query panics on very large dataset
I try to run a StartsWith Query against a large generated FST and it panics with the following message thread 'main' panicked at 'index out of bounds: the len is 32379027443 but the index is 17000494749432868067', /home/jakob/.cargo/registry/src/github.com-1ecc6299db9ec823/fst-0.4.7/src/raw/node.rs:302:17

The len matches the filesize of the fst.

The code for the query is:

use fst::{ automaton::{AlwaysMatch, Str}, Automaton, IntoStreamer, Map, Streamer, }; use memmap::Mmap; use std::fs::File; pub fn main() -> Result<(), Box<dyn std::error::Error>> { let mmap = unsafe { Mmap::map(&File::open("pwned-passwords.fst")?)? }; let map = Map::new(mmap)?; for arg in std::env::args() { let query = Str::new(&arg).starts_with(); let mut results = map.search(query).into_stream(); while let Some((key, value)) = results.next() { println!( "{}: {:>20}", unsafe { std::str::from_utf8_unchecked(key) }, value ); } } Ok(()) }

So basically the sample on a different data set and not much more.

There were no errors reported by generating the fst. To generate the same fst one can follow the procedure below:

download the v8 Version of Pwned-Passwords SHA1 (17GB compressed)

Unpack this file and reorder it (unfortunately the correctly ordered file does not come with the values) (about 34GiB)

Generate an fst from the reorderd file (the fst is 30.1 GiB, which feels a little high)

Oh btw. I have no clue about finite automata and FSTs, but found the blogpost fascinating, so I wanted to play with them a little.
bug
opened by jakob-ledermann 12
Request: looser lifetime for loading from byte slice
Currently, the primary way to load an fst object (Map/Set/Raw) from a byte slice is to use from_static_slice. While that solves the use case of loading data that lives for the lifetime of the program, it appears to be overly-restrictive for other use cases.

For example, I have fst data that I've written into a network buffer:

I do not want to give ownership of that buffer to fst, so using from_bytes won't work.

The fst data is not 'static, so from_static_slice won't work.

The fst data is a subslice of a greater slice (the network buffer), so to make it work with from_shared_bytes I would need to change my internal APIs to use Arc<Vec<u8>> instead of &[u8] (which may add bugs and performance regressions) .

While I could try to use the existing mmap API, I believe I would need to create a file-like object that wraps my byte slice, and then memory map it, which is unergonomic and error-prone.

My recommendation is to add a from_slice(bytes: &[u8]) constructor to Map, Set, and Raw. My understanding is that this is unfortunately not trivial, because the FstData type does not take a lifetime parameter, which would be required while it holds a reference to the byte slice.

Nevertheless, is there interest in adding this feature?

Thanks!
enhancement
opened by rw 10
Two lines of input don't work

The commandline

fst map two.csv two.fst

does work, but the following command:

fst range -o two.fst

prints only one single line and the command

fst dot two.fst | dot -Tpng > two.png

produces a graph with three nodes - ok. But only a single final node, a wrong output on label 'a' (1) and no output on label 'b'.

opened by ElBartel 10
Update the levenshtein construction algorithm
I implemented the method used in Fast String Correction with Levenshtein-Automata. Currently the project is available here. https://github.com/tantivy-search/levenshtein-automata

Building a new automaton requires to precompute a datastructure for each type of Levenshtein automaton considered. This precomputation is rather slow, but is typically amortized after building one automaton. If this is a problem, it is also possible to serialize it and ship it with the library. That's the approach taken by Lucene.

The benefits compared with the current implementation of fst-levenshtein are as follows

building a DFA is much faster. For instance for a distance of 2, and for the string "Levenshtein", the current algorithm takes 2.5ms while the new implementation takes 0.09ms. Note that clearly 2.5ms is fast enough for most application, and fst intersection will dominate anyway.

the produced DFA is smaller, so we can hope for less cache miss. (!?) Microbenchmarking this with honesty is obviously very difficult... Going through the DFA above over the string Leventshtein went from 72 ns to 8 ns. The difference might also be due to the interface being easier to deal with for the compiler (state are u32 that I pass by value). For a levenshtein distance of 2 without transposition, the number of states for the DFA (before utf8 conversion) is bounded to 30 x (query len + 1)

the DFA outputs the actual distance rather than a boolean. This is useful for scoring.

Optionally transposition can be given a cost of 1 (Levenshtein-Damerau).

It addresses #38

The Cons as I see it are:

It is yet an extra crate

Code quality is not as good as yours and bugs might be lurking somewhere.

Should we switch to this implementation and how do we proceed ?

If you decide to go with this implementation, the way I see it is : I should publish the project as a different crate. This sucks to have to have two repo for levenshtein automaton, but currently fst_levenshtein cannot be used without depending on fst which is not idea either. fst_levenshtein could depend on this library and wrap its DFA.

An alternative solution, would be to integrate my code into the fst_levenshtein crate...
opened by fulmicoton 10
Querying a 63GB file took a long time

I'm using this library for querying a 63GB FST of about 7 billion strings of length 9. With more than enough memory available too completely load/cache the FST, it still took 16 minutes to only query about 15 thousand strings. Then I first copied the files to a tmpfs before doing the exact same thing and I was able to query more than 300 million strings in only 36 minutes.

Do you have any idea what might be the problem in the first case or how I would go about bisecting the issue?

opened by Procrat 10
provide reverse mapping from value to key in some cases

It is sometimes useful to have String <-> StringOrdinal mapping, where StringOrdinal is its position within the set of strings assuming the strings have been sorted.

The fst repo makes a very good String -> StringOrdinal mapping. Wouldn't it be possible to get the StringOrdinal -> String by binary searching the correct transition at each node?
enhancement

opened by fulmicoton 9
Add a StreamWithState streamer
This is an attempt to be able to retrieve the Automaton State from a streamer, this streamer will be called StreamWithState.

Currently the map::Stream struct implement the Streamer trait with an associated Streamer::Item that is (&[u8], u64) in other term the matching string associated with the value.

The problem is that it is not possible to retrieve the state of the automaton with the current map::Stream type. It could be useful when the map::Stream is used with a levenshtein automaton, the state will permit to know the distance of the matching string returned.

So I propose to add a new struct named StreamWithState that implement the Streamer trait and its associated type Streamer::Item is (&[u8], u64, A::State) where A: Automaton. This way the user will be able to retrieve the state of the Automaton and therefore the levenshtein distance of the matching string returned (by answering the automaton).

The StreamWithState is accessible via the with_state self consuming method on map::StreamBuilder.

All of this resolves #60.

What I have done here is probably a little bit confused due to the way github shows the diffs. For me, map::StreamWithState is a superset of raw::Stream It should be possible to implement map::Stream using the other and just ignoring the Automaton::State, so I created a special StreamWithState::next method that accept a function which will "transform" the A::State:

the map::Stream which does not need the state, ignores it by using a function that return ().

and the map::StreamWithState clones it by using the Clone::clone function.
opened by Kerollmops 8
Fst made of fixed length byte strings: bounded hamming distance query

I have this problem: 3B k/v where the key are fixed length bytes strings either 32, 64 or 128 bits (in a given fst all keys have the same length though).

I need to find all values with a key that is within a bounded (small) bit hamming distance of existing keys.

I was thinking of a simple hamming distance query approach based on "Detecting Near-Duplicates for Web Crawling" at http://gurmeet.net/computer-science/research-papers/

So say I want a hamming distance of 3, and my keys are 32 bits. At query time, I could build a query levensthein fst that would allow any key with up to 3 byte to be changed (not added/removed) to still match, e.g. this would match all the keys that that are within hamming distance 3 or less.

This should work, but is this a proper use of the levensthein automaton? e.g. would there be a better construction for hamming distance than levensthein?

Since this would be raw bytes, this might not work well with the unicode support of levensthein?

Would the fact that the first three bytes can be anything with the last byte still match force a full sequential scan of the fst?

@fulmicoton ping too as you seem to grok levensthein automatons quite well

opened by pombredanne 8
key positions in fst
Is it possible to:

return the key position from contains? That is instead of bool return Option<usize> where None means not found and Some(pos) is the position of the key in the lexicographic order of keys represented by the automaton

return a key by its position. That is implement ops::Index for set, such that myset[pos] returns the key at pos.

If the above is possible then more possibilities open up:

map from bytes to bytes: "set" fst + int -> byte map (possibly another "set" fst)

map from bytes to ints, where ints are monotonically increasing when bytes are in lexicographic order (this is very common when building search indices): "set" fst + succinct integer encoding

wontfix
opened by alkis 8
Unused traits in code examples
The code examples on this page import the traits Streamer and IntoStreamer. But apparently, these traites aren't being used after they have been imported.

I suggest doing one of the following:

remove the imports of Streamer and IntoStreamer traits from the code examples

explain why those traits are necessary and how they are being implicitly used (if that's the case) in the code example
opened by amab8901 0
[Help] Does the FST file format work cross-platform?

First, thank you for this marvelous library. I'm experimenting with fst for my first ever Rust project, and for my use case, fst provides 10x the compression of gzip with a minimal lookup penalty at runtime.

My question is: does the .fst file created by MapBuilder work across platforms? Ideally I could generate this file just once on my Macbook then distribute it across all of the platforms I want to support: Linux x64, Wasm, ...

I checked the docs, but I couldn't find a clear answer to this question. But then again, I'm a Rust newbie who's fairly new to systems programming, so perhaps there's an obvious answer I'm just not aware of.

Thank you for your time!

(PS: I'm not sure what the right forum is for help questions, so forgive me if I'm in the wrong place.)
question

opened by akprasad 1
Improving stream operation performance

Would it be possible to improve the performance of stream operations in a threaded manner? My use case is creating and comparing FSTs from ~300B strings. The initial merging operations you've provided work well until the final FST merge. Creating the last stream via union takes hours where only one core is utilized. I imagine the same occurs with difference as well.

I have not familiarized myself with the internals yet, but maybe it is possible (in this case) to partition the keys based on some prefix and batch the work with some synchronization such that the receiver can build the final FST in order? This assumes there is an easy way to partition the prefixes. Maybe identifying eligible partitions could be done traversing and comparing the FST?

opened by process0 1
[fst-bin] Add a commandline option '--delimiter' to `map`.

This commandline option allows to specify the delimiter between keys and the values in the input file to the map subcommand.

The provided delimiter should be a single byte utf-8 character. If the value consists of multiple bytes only the first one is used. This parameter is passed to csv::ReaderBuilder::delimiter which only allows for a single byte as delimiter.

Motivation

This is motivated by a colon used as separator between the keys and values. Fundamentaly resulting in issue #146, as I needed to write my own generator binary (where I missed the call to finish). With this option I would not have needed to use a custom implementation or could have used the official binary to compare the behaviour.

opened by jakob-ledermann 0
[Help] Possible to build a FST at compile time?
I have a problem which I think FST might solve:

40 million keys, no values

All strings

We know every string at compile time and can guarantee no new strings will be added.

Around 40-50k reads per programs run.

We want to do this as fast as possible on runtime, and can sacrifice memory for it.

Is it possible to build an FST at compile time to achieve (5)?
opened by bee-san 1

Owner

Andrew Gallant

I love to code.

GitHub

A simple and lightweight fuzzy search engine that works in memory, searching for similar strings (a pun here).

simsearch A simple and lightweight fuzzy search engine that works in memory, searching for similar strings (a pun here). Documentation Usage Add the f

116 Dec 10, 2022

Lightning Fast, Ultra Relevant, and Typo-Tolerant Search Engine

31.6k Dec 31, 2022

Perlin: An Efficient and Ergonomic Document Search-Engine

Table of Contents 1. Perlin Perlin Perlin is a free and open-source document search engine library build on top of perlin-core. Since the first releas

70 Dec 9, 2022

Tantivy is a full-text search engine library inspired by Apache Lucene and written in Rust

Tantivy is a full text search engine library written in Rust. It is closer to Apache Lucene than to Elasticsearch or Apache Solr in the sense it is no

7.4k Dec 28, 2022

A full-text search and indexing server written in Rust.

Bayard Bayard is a full-text search and indexing server written in Rust built on top of Tantivy that implements Raft Consensus Algorithm and gRPC. Ach

1.8k Dec 26, 2022

Rapidly Search and Hunt through Windows Event Logs

Rapidly Search and Hunt through Windows Event Logs Chainsaw provides a powerful ‘first-response’ capability to quickly identify threats within Windows

1.8k Dec 31, 2022

🔎 A simple in-memory search for collections and key-value stores.

Indicium Search ?? A simple in-memory search for collections (Vec, HashMap, BTreeMap, etc) and key-value stores. Features autocompletion. There are ma

41 Oct 28, 2022

An example of web application by using Rust and Axum with Clean Architecture.

stock-metrics Stock price and stats viewer. Getting Started Middleware Launch the middleware by executing docker compose: cd local-middleware docker c

62 Dec 10, 2022

2 and 3-dimensional collision detection library in Rust.

2D Documentation | 3D Documentation | User Guide | Forum ⚠️ **This crate is now passively-maintained. It is being superseded by the Parry project.** ⚠

914 Dec 24, 2022

Tantivy is a full-text search engine library inspired by Apache Lucene and written in Rust

Tantivy is a full-text search engine library written in Rust. It is closer to Apache Lucene than to Elasticsearch or Apache Solr in the sense it is no

7.5k Jan 9, 2023

Finding all pairs of similar documents time- and memory-efficiently

Finding all pairs of similar documents This software provides time- and memory-efficient all pairs similarity searches in documents. Problem definitio

53 Jan 2, 2023

Configurable quick search engine shortcuts for your terminal and browser.

Quicksearch Configurable quick search engine shortcuts for your terminal and browser. Installation Run cargo install quicksearch to install Configurat

2 Oct 14, 2022

💰 Midas is a free and open source Moving Average Trading backtest simulator.

Midas is a free and open source Moving Average Trading backtest simulator Bilibili Video: https://www.bilibili.com/video/BV11o4y1B7fL ⚠️ Warning Inves

6 Mar 17, 2023

A Solr 8+ Client for Rust and Python

Solrstice: A Solr 8+ Client for Rust and Python Solrstice is a SolrCloud aware client library written in rust. It also provides a wrapper to python. U

4 Aug 26, 2023

EasyAlgolia is a Rust crate designed for utilizing the Algolia admin client. It simplifies the process of updating and inserting documents into Algolia's search index.

crate link EasyAlgolia is a Rust crate designed for utilizing the Algolia admin client. It simplifies the process of updating and inserting documents

3 Mar 20, 2024

Generate Soufflé Datalog types, relations, and facts that represent ASTs from a variety of programming languages.

treeedb treeedb makes it easier to start writing a source-level program analysis in Soufflé Datalog. First, treeedb generates Soufflé types and relati

16 Nov 30, 2022

Irx-config - The library provides convenient way to represent/parse configuration from different sources

The irx-config library provides convenient way to represent/parse configuration from different sources. The main goals is to be very easy to use and t

2 Sep 14, 2022

Rustato: A powerful, thread-safe global state management library for Rust applications, offering type-safe, reactive state handling with an easy-to-use macro-based API.

Rustato State Manager A generical thread-safe global state manager for Rust Introduction • Features • Installation • Usage • Advanced Usage • Api Refe

8 Sep 16, 2024

An implementation of regular expressions for Rust. This implementation uses finite automata and guarantees linear time matching on all inputs.

regex A Rust library for parsing, compiling, and executing regular expressions. Its syntax is similar to Perl-style regular expressions, but lacks a f

2.6k Jan 8, 2023

Sets of libraries and tools to write applications and libraries mixing OCaml and Rust

Sets of libraries and tools to write applications and libraries mixing OCaml and Rust. These libraries will help keeping your types and data structures synchronized, and enable seamless exchange between OCaml and Rust

36 Jan 28, 2023

Represent large sets and maps compactly with finite state transducers.

Related tags

Overview

fst

Documentation

Installation

Example

Cargo features

Comments

Motivation

Owner

Andrew Gallant

A simple and lightweight fuzzy search engine that works in memory, searching for similar strings (a pun here).

Lightning Fast, Ultra Relevant, and Typo-Tolerant Search Engine

Perlin: An Efficient and Ergonomic Document Search-Engine

Tantivy is a full-text search engine library inspired by Apache Lucene and written in Rust

A full-text search and indexing server written in Rust.

Rapidly Search and Hunt through Windows Event Logs

🔎 A simple in-memory search for collections and key-value stores.

An example of web application by using Rust and Axum with Clean Architecture.

2 and 3-dimensional collision detection library in Rust.

Tantivy is a full-text search engine library inspired by Apache Lucene and written in Rust

Finding all pairs of similar documents time- and memory-efficiently

Configurable quick search engine shortcuts for your terminal and browser.

💰 Midas is a free and open source Moving Average Trading backtest simulator.

A Solr 8+ Client for Rust and Python

EasyAlgolia is a Rust crate designed for utilizing the Algolia admin client. It simplifies the process of updating and inserting documents into Algolia's search index.

Generate Soufflé Datalog types, relations, and facts that represent ASTs from a variety of programming languages.

Irx-config - The library provides convenient way to represent/parse configuration from different sources

Rustato: A powerful, thread-safe global state management library for Rust applications, offering type-safe, reactive state handling with an easy-to-use macro-based API.

An implementation of regular expressions for Rust. This implementation uses finite automata and guarantees linear time matching on all inputs.

Sets of libraries and tools to write applications and libraries mixing OCaml and Rust