A fast, simple and lightweight Bloom filter library for Python, fully implemented in Rust.

Overview

rBloom

PyPI build license

A fast, simple and lightweight Bloom filter library for Python, fully implemented in Rust. It's designed to be as pythonic as possible, mimicking the built-in set type where it can. While it's a new kid on the block (this project was started in 2023), it's also currently the fastest kid on the block by a long shot (see the section Benchmarks). Releases are published on PyPI.

Quickstart

This library defines only one class, which can be used as follows:

>>> from rbloom import Bloom
>>> bf = Bloom(200, 0.01)  # 200 items max, false positive rate of 1%
>>> bf.add("hello")
>>> "hello" in bf
True
>>> "world" in bf
False
>>> bf.update(["hello", "world"])  # "hello" and "world" now in bf
>>> other_bf = Bloom(200, 0.01)

### add some items to other_bf

>>> third_bf = bf | other_bf  # third_bf now contains all items in
                              # bf and other_bf
>>> third_bf = bf.copy()
... third_bf.update(other_bf)  # same as above

Unlike BF's in real life, you can have as many of these as you want, and they all work well together!

For the full API, see the section Documentation.

Installation

On almost all platforms, simply run:

pip install rbloom

If you're on an uncommon platform, this may cause pip to build the library from source, which requires the Rust toolchain. You can also build rbloom by cloning this repository and running maturin:

maturin build --release

This will create a wheel in the target/wheels/ directory, which can subsequently also be passed to pip.

Why rBloom?

Why should you use this library instead of one of the other Bloom filter libraries on PyPI?

  • Simple: Almost all important methods work exactly like their counterparts in the built-in set type.
  • Fast: rbloom is implemented in Rust, which makes it blazingly fast. See section Benchmarks for more information.
  • Lightweight: rbloom has no dependencies of its own.
  • Maintainable: The entire library fits comfortably in a few hundred lines of code, and it's written in idiomatic Rust. Even if I were to stop maintaining rbloom (which I don't intend to), it would be trivially easy for you to fork it and keep it working for you.

I started rbloom because I was looking for a simple Bloom filter dependency for a project, but the pure Python implementations were too slow. The only maintained fast alternative I could find, pybloomfiltermmap3 (which is written in C and is a great library), failed to work on recent versions of Python (see below), so I felt very uncomfortable using it as a dependency. I also felt like the thousands of lines of code in that library were a bit hard to handle should it stop being maintained (which is what happened to the original pybloomfiltermmap). However, please note that pybloomfiltermmap3 implements persistent filters, while rbloom currently does not, so if that's something you require, you should definitely give that library a try.

Benchmarks

I implemented the following simple benchmark in the respective API of each library:

bf = Bloom(10_000_000, 0.01)

for i in range(10_000_000):
    bf.add(i + 0.5)  # floats because ints are hashed as themselves

for i in range(10_000_000):
    assert i + 0.5 in bf

This resulted in the following runtimes:

Library Time Notes
rBloom 5.956s works out-of-the-box
pybloomfiltermmap3 11.280s surprisingly hard to get working [1]
pybloom3 75.871s works out-of-the-box
Flor 128.837s doesn't work on arbitrary objects [2]
bloom-filter2 325.044s doesn't work on arbitrary objects [2]

[1] It refused to install on Python 3.11 and kept segfaulting on 3.10, so I installed 3.7 on my machine for this benchmark.
[2] I tested both converting to bytes and pickling, and chose the faster time.

The benchmark was run on a 2019 Dell XPS 15 7590 with an Intel Core i5-9300H. It was run 5 times for each library, and the average time was used.

Also note that rbloom is compiled against a stable ABI for portability, and that you can get a small but measurable speedup by removing the "abi3-py37" flag from Cargo.toml and building it yourself.

Documentation

This library defines only one class, the signature of which should be thought of as:

class Bloom:

    # expected_items: max number of items to be added to the filter
    # false_positive_rate: max false positive rate of the filter
    # hash_func: optional argument, see section "Cryptographic security"
    def __init__(self, expected_items: int, false_positive_rate: float,
                 hash_func=__builtins__.hash)

    @property
    def size_in_bits(self) -> int  # number of buckets in the filter

    @property
    def hash_func(self) -> Callable[[Any], int]  # retrieve the hash_func
                                                 # given to __init__

    @property
    def approx_items(self) -> float  # estimated number of items in
                                     # the filter

    #                all subsequent methods are
    # -------- equivalent to the corresponding methods ---------
    #                 of the built-in set type

    def add(self, object)

    def __contains__(self, object) -> bool    # object in self

    def __bool__(self) -> bool                # False if empty

    def __repr__(self) -> str                 # Basic info

    def __or__(self, other: Bloom) -> Bloom   # self | other

    def __ior__(self, other: Bloom)           # self |= other

    def __and__(self, other: Bloom) -> Bloom  # self & other

    def __iand__(self, other: Bloom)          # self &= other

    def update(self, *others: Union[Iterable, Bloom])

    def intersection_update(self, *others: Union[Iterable, Bloom])

    def clear(self)                           # remove all items

    def copy(self) -> Bloom

To prevent death and destruction, the bitwise set operations only work on filters where all parameters are equal (including the hash functions being the exact same object). Because this is a Bloom filter, the __contains__ and approx_items methods are probabilistic.

Cryptographic security

Python's built-in hash function is designed to be fast, not maximally collision-resistant, so if your program depends on the false positive rate being perfectly correct, you may want to supply your own hash function. This is especially the case when working with very large filters (more than a few tens of millions of items) or when false positives are very costly and could be exploited by an adversary. Just make sure that your hash function returns an integer between -2^127 and 2^127 - 1. Feel free to use the following example in your own code:

from rbloom import Bloom
from hashlib import sha256
from pickle import dumps

def hash_func(obj):
    h = sha256(dumps(obj)).digest()
    return int.from_bytes(h[:16], "big") - 2**127

bf = Bloom(100_000_000, 0.01, hash_func=hash_func)

When you throw away Python's built-in hash function and start hashing serialized representations of objects, however, you open up a breach into the scary realm of the unpythonic:

  • Numbers like 2, 2.0 and 2 + 0j will suddenly no longer be equal.
  • Instances of classes with custom hashing logic (e.g. to stop caches inside instances from affecting their hashes) will suddenly display undefined behavior.
  • Objects that can't be serialized simply won't be hashable at all.

Making you supply your own hash function in this case is a deliberate design decision intended to show you what you're doing and prevent you from shooting yourself in the foot.


This implementation of a Bloom filter doesn't use multiple hash functions, but instead works by redistributing the entropy of a single hash over multiple integers by using the single hash as the seed of a simple linear congruential generator (LCG). Those integers are then used as indexes for the bit array that makes up the filter. The constant used for the LCG is one proposed by (L'Ecuyer, 1999).

You might also like...
zigfi is an open-source stocks, commodities and cryptocurrencies price monitoring CLI app, written fully in Rust, where you can organize assets you're watching easily into watchlists for easy access on your terminal.
zigfi is an open-source stocks, commodities and cryptocurrencies price monitoring CLI app, written fully in Rust, where you can organize assets you're watching easily into watchlists for easy access on your terminal.

zigfi zigfi is an open-source stocks, commodities and cryptocurrencies price monitoring CLI app, written fully in Rust, where you can organize assets

A fully modular window manager, extremely extensibile and easily approachable.

AquariWM is a fully modular window manager, allowing extreme extensibility while remaining easily approachable. Installation AquariWM is currently in

Fully-typed, async, reusable state management and synchronization for Dioxus 🧬

dioxus-query 🦀 ⚡ Fully-typed, async, reusable state management and synchronization for Dioxus 🧬. Inspired by TanStack Query. ⚠️ Work in progress ⚠️

Create, reorder, group, and focus workspaces easily in i3. Fully configurable with enhanced polybar modules.
Create, reorder, group, and focus workspaces easily in i3. Fully configurable with enhanced polybar modules.

Create, reorder, group, and focus workspaces fast and easily in i3. Features Focus Mode: Eliminate Distractions Enable Focus Mode: Use groups and focu

A parser combinator that is fully statically dispatched and supports both sync/async.

XParse A parser combinator that is fully statically dispatched and supports both sync & async parsing. How to use An example of a very simple JSON par

A super simple but lightweight logging library that tries to capture the most important (status) information.

Hackerlog A super simple but lightweight logging library that tries to capture the most important (status) information. The following is supported: Lo

A lightweight and super fast cli todo program written in rust under 200 sloc
A lightweight and super fast cli todo program written in rust under 200 sloc

todo A lightweight and super fast cli todo program written in rust under 200 sloc installation AUR package: todo-bin use cargo build --release to comp

A fast and lightweight HTTP server implementation in Rust.

server_nano A tiny, fast, and friendly web server written in rust and inspired by express. It uses may to coroutines Usage First, add this to your Car

Cucumber testing framework for Rust. Fully native, no external test runners or dependencies.
Cucumber testing framework for Rust. Fully native, no external test runners or dependencies.

Cucumber testing framework for Rust An implementation of the Cucumber testing framework for Rust. Fully native, no external test runners or dependenci

Comments
  • Misc updates

    Misc updates

    • Update pyo3 to 0.18
    • Make use of the ability to take a Python<'_> first argument, which py03 will automatically pass in
    • Pull out some variables to simplify approx_items calculation
    • Add py03 signatures to methods (these will appear in the help in python)
    • Use PyObject.clone_ref(), which is cheaper as long as we already hold a Python<'_> marker
    • Reduce cloning for or/and implementations, by implementing operators on references as well
    • Reuse a single temp bitset in intersection_update when passed mulitple non-bloom items
    • Add a simple repr implementation (maybe should include more info?)
    • Added an implementation of __bool__, so if my_bloom: works like other python collections (false if empty, otherwise true)
    • Some refactoring of internal BitLine struct to use a type defintion for its "words", and changed to use a vec of usize rather than bytes
    • Simplify checking if hash functions are the same by matching on a tuple
    opened by Dr-Emann 1
  • Implement save/load

    Implement save/load

    Hello, I made a small benchmark and rBloom is about 2x faster than pybloomfiltermmap3, wow!

    Would it be possible to save filter bits to a file and reload it? Access to bits from Python and a new arg in Bloom.new() can be sufficient.

    Thanks! Denis

    opened by denis-angilella 5
Owner
Kenan Hanke
An aspiring computational neuroscientist with a background in software engineering and human medicine
Kenan Hanke
Rust Imaging Library's Python binding: A performant and high-level image processing library for Python written in Rust

ril-py Rust Imaging Library for Python: Python bindings for ril, a performant and high-level image processing library written in Rust. What's this? Th

Cryptex 13 Dec 6, 2022
A lightweight and high-performance order-book designed to process level 2 and trades data. Available in Rust and Python

ninjabook A lightweight and high-performance order-book implemented in Rust, designed to process level 2 and trades data. Available in Python and Rust

Ninja Quant 134 Jul 22, 2024
Yet Another Kalman Filter Implementation. As well as Lie Theory (Lie group and algebra) on SE(3). [no_std] is supported by default.

yakf - Yet Another Kalman Filter Yet Another Kalman Filter Implementation, as well as, Lie Theory (Lie group, algebra, vector) on SO(3), SE(3), SO(2),

null 7 Dec 1, 2022
Rust crate for interacting with the Windows Packet Filter driver.

NDISAPI-RS NDISAPI-RS is a Rust crate for interacting with the Windows Packet Filter driver. It provides an easy-to-use, safe, and efficient interface

Vadim Smirnov 6 Jun 15, 2023
Alexander Mongus is a state-of-the-art filter to sneak amogus characters in pictures

A. Mongus Go to: http://www.lortex.org/amogu/ ??? This is a client-side, Webassembly-based filter to hide amongus characters in your images. Example:

Lucas Pluvinage 3 Apr 16, 2022
Filter, Sort & Delete Duplicate Files Recursively

Deduplicator Find, Sort, Filter & Delete duplicate files Usage Usage: deduplicator [OPTIONS] [scan_dir_path] Arguments: [scan_dir_path] Run Dedupl

Sreedev Kodichath 108 Jan 27, 2023
A tool to filter sites in a FASTA-format whole-genome pseudo-alignment

Core-SNP-filter This is a tool to filter sites (i.e. columns) in a FASTA-format whole-genome pseudo-alignment based on: Whether the site contains vari

Ryan Wick 15 Apr 2, 2023
PyO3 bindings and Python interface to skani, a method for fast fast genomic identity calculation using sparse chaining.

?? ⛓️ ?? Pyskani PyO3 bindings and Python interface to skani, a method for fast fast genomic identity calculation using sparse chaining. ??️ Overview

Martin Larralde 13 Mar 21, 2023
Simple console input macros with the goal of being implemented in the standard library.

Simple console input macros with the goal of being implemented in the standard library.

undersquire 2 Feb 10, 2022
A fast python geohash library created by wrapping rust.

Pygeohash-Fast A Fast geohasher for python. Created by wrapping the rust geohash crate with pyo3. Huge shout out to the georust community :) Currently

Zach Paden 3 Aug 18, 2022