Fuzzy Index for Python, written in Rust. Works like error-tolerant dict, keyed by a human input.

Related tags

Command-line fuzzdex
Overview

FuzzDex

FuzzDex is a fast Python library, written in Rust. It implements an in-memory fuzzy index that works like an error-tolerant dictionary keyed by a human input.

Algorithm

You load into fuzzdex series of short phrases - like street names consisting of one or multiple words, with an numerical index that identifies this street names in related dictionary of cities.

Then, you can query the index using a must-token (currently only one, but could be altered to use more) and additional should-tokens to read a total of limit of possibly matching phrases.

Must-token is trigramized (warszawa -> war ars rsz sza zaw awa) and all phrases containing given trigrams are initially read from the index. Trigrams have scores, the more common they are, the less they increase the phrase score. Trigrams of should-tokens additionally alter the score (positively when they match), but don't add additional phrases from index. Phrases are then sorted by score.

Top phrases are filtered to contain an optional constraint and the must-token with a maximal editing distance (Levenshtein) until limit of phrases is gathered.

Internally, the results of a must-token search are LRU cached as in practise it's pretty often repeated. Should-tokens vary and they are always recalculated.

Usecases

It was designed to match parts of a user supplied physical addresses to a data extracted from the OpenStreet map - in order to find streets and cities.

Address is first tokenized and then it's parts are matched against fuzzy dictionary of cities and streets. Additional constraints can limit the matched streets only to given city - or finding cities that have a given street.

Data is first searched for using trigrams (warszawa -> war ars rsz sza zaw awa), and then additionally filtered using maximal Levenshtein editing distance.

Original solution used fuzzy query of the Elasticsearch database, which worked - but was 21x slower in our tests.

Example

import fuzzdex
# Create two fuzzy indices with cities and streets.
cities = fuzzdex.FuzzDex()
# Warsaw has streets: Czerniakowska, Nowy Świat and Wawelska
cities.add_phrase("Warsaw", 1, constraints={1, 2, 3})
# Wrocław only Czerniawska
cities.add_phrase("Wrocław", 2, constraints={4})

streets = fuzzdex.FuzzDex()
# Streets with reversed constraints and own indices:
streets.add_phrase("Czerniakowska", 1, constraints={1})
streets.add_phrase("Nowy Świat", 2, constraints={1})
streets.add_phrase("Wawelska", 3, constraints={1})

streets.add_phrase("Czerniawska", 4, constraints={2})

# This recalculates trigram scores and makes index immutable:
cities.finish()
streets.finish()

# warszawa matches warsaw at editing distance 2.
cities.search("warszawa", [], max_distance=2, limit=60)
#    [{'origin': 'Warsaw', 'index': 1, 'token': 'warsaw',
#      'distance': 2, 'score': 200000.0, 'should_score': 0.0}]

# `świat` adds additional should score to the result and places it higher
# in case the limit is set:
streets.search("nowy", ["świat"], max_distance=2, constraint=1)
#    [{'origin': 'Nowy Świat', 'index': 2, 'token': 'nowy',
#      'distance': 0, 'score': 5.999, 'should_score': 7.4999}]

# Won't match with constraint 2.
streets.search("nowy", ["świat"], constraint=2)
#    []

# Quering for `czerniawska` will return `czerniakowska` (no constraints),
# but with a lower score and higher distance:
In [22]: streets.search("czerniawska", [], max_distance=2)
Out[22]:
#  [{'origin': 'Czerniawska', 'index': 4, 'token': 'czerniawska',
#   'distance': 0, 'score': 9.49995231628418, 'should_score': 0.0},
#  {'origin': 'Czerniakowska', 'index': 1, 'token': 'czerniakowska',
#   'distance': 2, 'score': 6.4999680519104, 'should_score': 0.0}]

Installation, development

You can install fuzzdex from PyPI when using one of the architectures it's published for (x86_64, few Python versions).

pip3 install fuzzdex

Or use maturin to build it locally:

pipenv install --dev
pipenv shell
maturin develop -r
pytest

You can also use cargo and copy or link the .so file directly (rename libfuzzdex.so to fuzzdex.so):

cargo build --release
ln -s target/release/libfuzzdex.so fuzzdex.so

build.sh has commands for building manylinux packages for PyPI.

You might also like...
Codemod - Codemod is a tool/library to assist you with large-scale codebase refactors that can be partially automated but still require human oversight and occasional intervention

Codemod - Codemod is a tool/library to assist you with large-scale codebase refactors that can be partially automated but still require human oversight and occasional intervention. Codemod was developed at Facebook and released as open source.

A simple, human-friendly, embeddable scripting language

Mica Language reference · Rust API A simple, human-friendly scripting language, developed one feature at a time. Human-friendly syntax inspired by Rub

Human numeric sorting program — does what `sort -h` is supposed to do!

hns — Human Numeric Sort v0.1.0 (⏫︎2022-09-20) © 2022 Fredrick R. Brennan and hns Authors Apache 2.0 licensed, see LICENSE. man page Packages hns_0.1.

Display file sizes in human-readable units

hsize Display file sizes in human-readable units $ hsize 1000 1000000 5000000 1.00 KB 1.00 MB 5.00 MB $ hsize -p 5 1048576 12345678 1.04858 MB 12.345

Experimental extension that brings OpenAI API to your PostgreSQL to run queries in human language.

Postgres ChatGPT Experimental PostgreSQL extension that enables the use of OpenAI GPT API inside PostgreSQL, allowing for queries to be written usi

Skim - Fuzzy Finder in rust!
Skim - Fuzzy Finder in rust!

Life is short, skim! Half of our life is spent on navigation: files, lines, commands… You need skim! It is a general fuzzy finder that saves you time.

Fast TLSH-compatible Fuzzy Hashing Library in pure Rust

fast-tlsh: Fast TLSH-compatible Fuzzy Hashing Library in pure Rust TLSH stands for Trendmicro Locality Sensitive Hash. TLSH can be used to detect simi

Save cli commands and fuzzy find them later
Save cli commands and fuzzy find them later

crow - cli command memorizer What is crow? | Installation | Usage | FAQ What is crow? crow (command row) is a CLI tool to help you memorize CLI comman

Fzf - A command-line fuzzy finder
Fzf - A command-line fuzzy finder

fzf is a general-purpose command-line fuzzy finder. It's an interactive Unix filter for command-line that can be used with any list; files, command hi

Comments
  • Error: index out of bounds

    Error: index out of bounds

    I was happy to find fuzzdex, as I try to identify word lemmas in older text with lots of spelling variation. Unfortunately, I got stuck. My code seems to work, but index searches only work about 1 out 10 times. The remaining attempts fail, with error messages like this;

    pyo3_runtime.PanicException: index out of bounds: the len is 1 but the index is 1

    The trouble-making line: print(former1.search("abbeddom", [], max_distance=3, limit=10))

    Any help is much appreciated.

    Thomas

    opened by thomtro 7
Owner
Tomasz bla Fortuna
I do stuff. Mostly commercially recently, but still occasionally I waste time on small personal projects which made me who I am today.
Tomasz bla Fortuna
Fuzzy a general fuzzy finder that saves you time in rust!

Life is short, skim! Half of our life is spent on navigation: files, lines, commands… You need skim! It is a general fuzzy finder that saves you time.

Jinzhou Zhang 3.7k Jan 8, 2023
Bam Error Stats Tool (best): analysis of error types in aligned reads.

best Bam Error Stats Tool (best): analysis of error types in aligned reads. best is used to assess the quality of reads after aligning them to a refer

Google 54 Jan 3, 2023
This library provides a convenient derive macro for the standard library's std::error::Error trait.

derive(Error) This library provides a convenient derive macro for the standard library's std::error::Error trait. [dependencies] therror = "1.0" Compi

Sebastian Thiel 5 Oct 23, 2023
A multi-page fuzzy launcher for your terminal, written in Rust.

fr33zmenu A multi-page fuzzy launcher for your terminal, written in Rust. Supports theming and multiple keybind schemes, including basic vim keybinds.

null 3 Dec 15, 2022
Local-first high performance codebase index engine designed for AI

CodeIndex CodeIndex is a local-first high performance codebase index engine designed for AI. It helps your LLM understand the structure and semantics

Jipiti AI 9 Aug 30, 2023
A library that creates a terminal-like window with feature-packed drawing of text and easy input handling. MIRROR.

BearLibTerminal provides a pseudoterminal window with a grid of character cells and a simple yet powerful API for flexible textual output and uncompli

Tommy Ettinger 43 Oct 31, 2022
A Windows virtual display driver written in Rust (works with VR, etc)

Virtual Display Driver This is a Windows driver made in Rust which creates a virtual desktop. It has many uses, such as: A private virtual desktop for

Cherry 28 Sep 19, 2023
Rust Imaging Library's Python binding: A performant and high-level image processing library for Python written in Rust

ril-py Rust Imaging Library for Python: Python bindings for ril, a performant and high-level image processing library written in Rust. What's this? Th

Cryptex 13 Dec 6, 2022
This is choose, a human-friendly and fast alternative to cut and (sometimes) awk

Choose This is choose, a human-friendly and fast alternative to cut and (sometimes) awk Features terse field selection syntax similar to Python's list

Ryan Geary 1.4k Jan 7, 2023
Grep with human-friendly search output

hgrep: Human-friendly GREP hgrep is a grep tool to search files with given pattern and print the matched code snippets with human-friendly syntax high

Linda_pp 345 Jan 4, 2023