Finding all pairs of similar documents time- and memory-efficiently

Overview

Finding all pairs of similar documents

Crates.io Documentation Build Status

This software provides time- and memory-efficient all pairs similarity searches in documents.

Problem definition

  • Input
    • List of documents $D = (d_1, d_2, \dots, d_n)$
    • Distance function $\delta: D \times D \rightarrow [0,1]$
    • Radius threshold $r \in [0,1]$
  • Output
    • All pairs of similar document ids $R = \{ (i,j): i < j, \delta(d_i, d_j) \leq r \}$

Features

Easy to use

This software supports all essential steps of document similarity search, from feature extraction to output of similar pairs. Therefore, you can immediately try the fast all pairs similarity search using your document files.

Flexible tokenization

You can specify any delimiter when splitting words in tokenization for feature extraction. This can be useful in languages where multiple definitions of words exist, such as Japanese or Chinese.

Time and memory efficiency

The time and memory complexities are linear over the numbers of input documents and output results on the basis of the ideas behind the locality sensitive hashing (LSH) and sketch sorting approach.

Tunable search performance

LSH allows tuning of performance in accuracy, time, and memory, through a manual parameter specifying search dimensions. You can flexibly perform searches depending on your dataset and machine environment.

  • Specifying lower dimensions allows for faster and rougher searches with less memory usage.
  • Specifying higher dimensions allows for more accurate searches with more memory usage.

Pure Rust

This software is implemented in Rust, achieving safe and fast performance.

Running example

Here, we describe the basic usage of this software through an example of running the CLI tool.

First of all, install rustc and cargo following the official instructions since this software is implemented in Rust.

1. Data preparation

You have to prepare a text file containing documents line by line (NOT including empty lines).

To produce an example file used throughout this description, you can use scripts/load_nltk_dataset.py that downloads the Reuters Corpus provided by NLTK. Run the following command.

$ ./scripts/load_nltk_dataset.py reuters

reuters.txt will be output.

$ head reuters.txt
hre properties & lt ; hre > 1st qtr jan 31 net shr 38 cts vs 47 cts net 2 , 253 , 664 vs 2 , 806 , 820 gross income 5 , 173 , 318 vs 5 , 873 , 904 note : net includes gains on sale of real estate of 126 , 117 dlrs vs 29 , 812 dlrs .
the firm , however , is supplying temporary financing , and sources close to the transaction disputed the claim that the firm will not end up paying for its equity position . 
conoco , which has completed geological prospecting for the tunisian government , has transferred one third of its option rights in the region to ina , it said .
" willis faber ' s stake in morgan grenfell has been a very successful investment ," it said .
china reports 700 mln dlr two - month trade deficit china ' s trade deficit totalled 700 mln dlrs in the first two months of this year , according to figures released by the state statistics bureau .
the treasury said baker and stoltenberg " are consulting with their g - 7 colleagues and are confident that this will enable them to foster exchange rate stability around current levels ."
u . s . tariffs are due to take effect on april 17 .
some dealers said there were growing signs the united states wanted the dollar to fall further .
since last august smart has been leading talks to open up japan to purchases of more u . s .- made automotive parts .
the resulting association will operate under the name of charter and will be based in bristol .

Fully-duplicate documents in reuters.txt are removed because they are noisy in evaluation of similarity searches. To do this, the output lines are shuffled, and your file will not be the identical to the example.

2. Finding all pairs of similar documents

The workspace find-simdoc-cli provides CLI tools for fast all pairs similarity searches in documents. The approach consists of three steps:

  1. Extract features from documents
    • Set representation of character or word ngrams
    • Tfidf-weighted vector representation of character or word ngrams
  2. Convert the features into binary sketches through locality sensitive hashing (LSH)
  3. Search for similar sketches in the Hamming space using a modified variant of the sketch sorting approach

2.1 Jaccard space

The executable jaccard provides a similarity search in the Jaccard space. You can check the arguments with the following command.

$ cargo run --release -p find-simdoc-cli --bin jaccard -- --help

Run the following command if you want to search for reuters.txt with

  • search radius 0.1,
  • tokens of character 5-grams, and
  • 15*64=960 dimensions in the Hamming space.
$ cargo run --release -p find-simdoc-cli --bin jaccard -- -i reuters.txt -r 0.1 -w 5 -c 15 > result-jaccard.csv

Argument -c indicates the number of dimensions in the Hamming space, a trade-off parameter between approximation accuracy and search speed. The larger this value, the higher the accuracy, but the longer the search takes. This section describes how to examine the approximation accuracy for the number of dimensions.

Pairs of similar documents (indicated by zero-origin line numbers) and their distances are reported.

$ head result-jaccard.csv
i,j,dist
191,29637,0.07291666666666667
199,38690,0.0375
274,10048,0.07083333333333333
294,27675,0.04791666666666667
311,13812,0.04583333333333333
361,50938,0.08958333333333333
469,6360,0.035416666666666666
546,10804,0.0875
690,28281,0.0875

2.2 Cosine space

The executable cosine provides a similarity search in the Cosine space. You can check the arguments with the following command.

$ cargo run --release -p find-simdoc-cli --bin cosine -- --help

Run the following command if you want to search for reuters.txt with

  • search radius 0.1,
  • tokens of word 3-grams,
  • word delimiter " " (i.e., a space),
  • 10*64=640 dimensions in the Hamming space, and
  • weighting using the standard TF and the smoothed IDF.
$ cargo run --release -p find-simdoc-cli --bin cosine -- -i reuters.txt -r 0.1 -d " " -w 3 -c 10 -T standard -I smooth > result-cosine.csv

Pairs of similar documents (indicated by zero-origin line numbers) and their distances are reported.

$ head result-cosine.csv
i,j,dist
542,49001,0.084375
964,24198,0.09375
1872,3024,0.0859375
1872,6823,0.090625
1872,8462,0.0953125
1872,11402,0.090625
1872,18511,0.0859375
1872,41491,0.0875
1872,48344,0.0859375

3. Printing similar documents

The executable dump prints similar documents from an output CSV file.

If you want to print similar documents in reuters.txt with the result result-jaccard.csv, run the following command.

$ cargo run --release -p find-simdoc-cli --bin dump -- -i reuters.txt -s result-jaccard.csv
[i=191,j=29637,dist=0.07291666666666667]
pending its deliberations , harper and row ' s board has postponed indefinitely a special meeting of stockholders that had been scheduled for april 2 to discuss a proposal to recapitalize the company ' s stock to create two classes of shares with different voting rights .
pending its deliberations , harper and row ' s board has postponed indefinitely a special meeting of stockholders that had been scheduled for april 2 to discuss a proposal to recapitalize the company ' s stock in order to create two classes of shares with different votinmg rights .
[i=199,j=38690,dist=0.0375]
government officials had no immediate comment on the report , which advised a reduction in the overall size of the public investment programme and greater emphasis on the preservation of peru ' s export potential .
government officials had no immediate comment on the report , which advised a reduction in the overall size of the public investment program and greater emphasis on the preservation of peru ' s export potential .
[i=274,j=10048,dist=0.07083333333333333]
the measure was adopted as part of a wide - ranging trade bill that will be considered by the full house in april before it moves on to the senate .
the measure was adopted as part of a wide - ranging trade bill that will be considered by the full house in april before it moves onto the senate .
[i=294,j=27675,dist=0.04791666666666667]
the company said the start - up was necessitated by continuing strong demand for aluminum and dwindling worldwide inventories , and that the metal is needed to supply reynolds ' various fabricating businesses .
the company said the start up was necessitated by continuing strong demand for aluminum and dwindling worldwide inventories , and that the metal is needed to supply reynolds ' various fabricating businesses .
[i=311,j=13812,dist=0.04583333333333333]
he said in an interview with reuter that after a few years it was likely south korea would drop barriers to foreign goods and move toward a more balanced trade position .
he said in an interview with reuters that after a few years it was likely south korea would drop barriers to foreign goods and move toward a more balanced trade position .
[i=361,j=50938,dist=0.08958333333333333]
hog and cattle slaughter guesstimates chicago mercantile exchange floor traders and commission house representatives are guesstimating today ' s hog slaughter at about 295 , 000 to 305 , 000 head versus 307 , 000 week ago and 311 , 000 a year ago .
hog and cattle slaughter guesstimates chicago mercantile exchange floor traders and commission house representatives are guesstimating today ' s hog slaughter at about 295 , 000 to 308 , 000 head versus 305 , 000 week ago and 308 , 000 a year ago .
[i=469,j=6360,dist=0.035416666666666666]
the national planning department forecast that in 1987 coffee , colombia ' s traditional major export , will account for only one - third of total exports , or about 1 . 5 billion dlrs .
the national planning department forecast that in 1987 coffee , colombia ' s traditional major export , will account for only one third of total exports , or about 1 . 5 billion dlrs .
...

4. Testing the accuracy of 1-bit minwise hashing

LSH is an approximate solution, and you may want to know the accuracy. The executable minhash_acc allows you to examine

  • the mean absolute error that is the averaged gap between the normalized Hamming distance and the actual Jaccard distance; and
  • the number of true results, precisions, recalls, and F1-scores for search radii {0.01, 0.02, 0.05, 0.1, 0.2, 0.5}.

To use this executable, we recommend extracting a small subset from your dataset because it exactly computes distances for all possible pairs (although the computation is accelerated with parallelization).

$ head -5000 reuters.txt > reuters.5k.txt

You can test the number of Hamming dimensions from 64 to 6400 (i.e., the number of chunks from 1 to 100 indicated with -c) with the following command. The arguments for feature extraction are the same as those of jaccard.

$ cargo run --release -p find-simdoc-cli --bin minhash_acc -- -i reuters.5k.txt -w 5 > acc.csv

Approximation accuracy of 1-bit minwise hashing

LSH is an approximate solution, and the number of dimensions in the Hamming space (indicated with the command line argument -c) is related to the approximation accuracy. As a hint for choosing a parameter of -c, we show experimental results obtained from reuters.txt of 51,535 documents when setting -w 5.

Mean absolute error (MAE)

The following figure shows MAEs while varying the number of Hamming dimensions from 64 to 6400 (i.e., the number of chunks from 1 to 100 indicated with -c).

As expected, the larger the number, the higher the accuracy. For example, when the number of dimensions is 1024 (setting the argument -c 16), we achieve the MAE around 2.5%.

Recall

Of the precision, recall, and F1 score, the most interesting would be the recall. This is because false positives can be filtered out in post processing.

The following figure shows recalls in search with radii 0.05, 0.1, and 0.2 (indicated with the argument -r).

For radii 0.1 and 0.2, over 90% recalls are achieved in most cases. For smaller radius 0.05, 75-90% recalls are obtained because the MAE becomes larger relative to the radius.

By the way, the numbers of true results are 50, 201, and 626 for radii 0.05, 0.1, and 0.2, respecitvely.

F1 score

The following figure shows F1 scores in search with radii 0.05, 0.1, and 0.2 (indicated with the argument -r).

  • For radius 0.05, over 90% scores are achieved from 3520 dimensions (setting -c 55).
  • For radius 0.1, over 90% scores are achieved from 704 dimensions (setting -c 11).
  • For radius 0.2, over 90% scores are achieved from 448 dimensions (setting -c 7).

Disclaimer

This software is developed by LegalForce, Inc., but not an officially supported LegalForce product.

License

Licensed under either of

at your option.

Contribution

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.

You might also like...
An example of web application by using Rust and Axum with Clean Architecture.

stock-metrics Stock price and stats viewer. Getting Started Middleware Launch the middleware by executing docker compose: cd local-middleware docker c

2 and 3-dimensional collision detection library in Rust.

2D Documentation | 3D Documentation | User Guide | Forum ⚠️ **This crate is now passively-maintained. It is being superseded by the Parry project.** ⚠

Tantivy is a full-text search engine library inspired by Apache Lucene and written in Rust
Tantivy is a full-text search engine library inspired by Apache Lucene and written in Rust

Tantivy is a full-text search engine library written in Rust. It is closer to Apache Lucene than to Elasticsearch or Apache Solr in the sense it is no

Configurable quick search engine shortcuts for your terminal and browser.

Quicksearch Configurable quick search engine shortcuts for your terminal and browser. Installation Run cargo install quicksearch to install Configurat

💰 Midas is a free and open source Moving Average Trading backtest simulator.
💰 Midas is a free and open source Moving Average Trading backtest simulator.

Midas is a free and open source Moving Average Trading backtest simulator Bilibili Video: https://www.bilibili.com/video/BV11o4y1B7fL ⚠️ Warning Inves

A Solr 8+ Client for Rust and Python

Solrstice: A Solr 8+ Client for Rust and Python Solrstice is a SolrCloud aware client library written in rust. It also provides a wrapper to python. U

Schema2000 is a tool that parses exsiting JSON documents and tries to derive a JSON schema from these documents.

Schema 2000 Schema2000 is a tool that parses exsiting JSON documents and tries to derive a JSON schema from these documents. Currently, Schema2000 is

A simple library to get all pairs from any Dex and sync reserves.

pair_sync A simple library to get all pairs from any supported Dex and sync reserves. Crates.io Documentation in progress Filename: examples/sync-pair

C library for finding nearest (most similar) element in a set

VP-tree nearest neighbor search A relatively simple and readable Rust implementation of Vantage Point tree search algorithm. The VP tree algorithm doe

With Dejavu, you can have a perfect memory by capturing and organizing your visual recordings efficiently.

Dejavu The content in README.md is assisted by ChatGPT. Overview Dejavu is an open-source, cross-platform tool designed to help you record and search

Rust and TypeScript example code for finding all members from a collection id.

get-collection-rs Use the Crawler cargo build --release ./target/release/get-collection rpc_node collection_id Example: ./target/release/get-col

Efficiently in-memory log manager

ram-journal Efficiently in-memory log manager Ram journal is a system that considerably reduces disk read and write operations by keeping logs from th

Merge together and efficiently time-sort compressed .pcap files stored in AWS S3 object storage (or locally) to stdout for pipelined processing.

Merge together and efficiently time-sort compressed .pcap files stored in AWS S3 object storage (or locally) to stdout for pipelined processing. High performance and parallel implementation for 10 Gbps playback throughput with large numbers of files (~4k).

Doubly-linked list that stores key-node pairs.

key-node-list Doubly-linked list that stores key-node pairs. KeyNodeList is a doubly-linked list, it uses a hash map to maintain correspondence betwee

hashmap macro for creating hashmap from provided key/value pairs

HashMap Macro Creates a HashMap from provided key/value pairs. Usage use std::collections::HashMap; use hashmap_macro::hashmap; let m: HashMap&str,

A simple and lightweight fuzzy search engine that works in memory, searching for similar strings (a pun here).

simsearch A simple and lightweight fuzzy search engine that works in memory, searching for similar strings (a pun here). Documentation Usage Add the f

SegVec data structure for rust. Similar to Vec, but allocates memory in chunks of increasing size.

segvec This crate provides the SegVec data structure. It is similar to Vec, but allocates memory in chunks of increasing size, referred to as "segment

Cross-platform GPU-accelerated viewer for the Mandelbrot set and similar (escape-time) fractals

fractal_viewer A cross-platform, GPU-accelerated viewer for the Mandelbrot Set and related fractals. Try it online! Usage Scroll wheel to zoom, click

`fugit` provides a comprehensive library of `Duration` and `Instant` for the handling of time in embedded systems, doing all it can at compile time.

fugit fugit provides a comprehensive library of Duration and Instant for the handling of time in embedded systems, doing all it can at compile time. T

Comments
  • Fix English

    Fix English

    There was no English usage of "all-pair similarity search", but there is "All pairs similarity search". This PR fixed the usage (and nested listing)

    opened by kampersanda 0
  • Add parallelization and external buffering in minhash_acc for large data

    Add parallelization and external buffering in minhash_acc for large data

    In the previous version, minhash_acc runs in single processing and maintains data in the main memory. However, this executable exactly computes distances for all possible pairs and stores all the results, making evaluation on large datasets impossible.

    This PR used rayon and wrote large data in a tmp file. Also, this PR updated the experimental results using a larger dataset.

    opened by kampersanda 0
  • Move lsh module into find-simdoc

    Move lsh module into find-simdoc

    Before publishing find-simdoc in crateio, it is necessary to publish the dependencies, all-pairs-hamming and lsh. I will publish all-pairs-hamming, but lsh is not worth publishing as independent library. So, this PR moved the module in lsh into find-simdoc. (and some toml files were formatted)

    opened by kampersanda 0
  • Add progress printing

    Add progress printing

    This PR added mutex progress reporters in parallel processing because such reporters will be helpful to get psychological safety when handling massive data.

    opened by kampersanda 0
Releases(v0.1.1)
Search through millions of documents in milliseconds ⚡️

a concurrent indexer combined with fast and relevant search algorithms Introduction This repository contains the core engine used in MeiliSearch. It c

MeiliSearch 433 Dec 20, 2022
🔎 A simple in-memory search for collections and key-value stores.

Indicium Search ?? A simple in-memory search for collections (Vec, HashMap, BTreeMap, etc) and key-value stores. Features autocompletion. There are ma

Dylan Bowker 41 Oct 28, 2022
Shogun search - Learning the principle of search engine. This is the first time I've written Rust.

shogun_search Learning the principle of search engine. This is the first time I've written Rust. A search engine written in Rust. Current Features: Bu

Yuxiang Liu 5 Mar 9, 2022
weggli is a fast and robust semantic search tool for C and C++ codebases. It is designed to help security researchers identify interesting functionality in large codebases.

weggli Introduction weggli is a fast and robust semantic search tool for C and C++ codebases. It is designed to help security researchers identify int

Google Project Zero 2k Jan 5, 2023
Represent large sets and maps compactly with finite state transducers.

fst This crate provides a fast implementation of ordered sets and maps using finite state machines. In particular, it makes use of finite state transd

Andrew Gallant 1.5k Jan 5, 2023
Lightning Fast, Ultra Relevant, and Typo-Tolerant Search Engine

MeiliSearch Website | Roadmap | Blog | LinkedIn | Twitter | Documentation | FAQ ⚡ Lightning Fast, Ultra Relevant, and Typo-Tolerant Search Engine ?? M

MeiliSearch 31.6k Dec 31, 2022
Perlin: An Efficient and Ergonomic Document Search-Engine

Table of Contents 1. Perlin Perlin Perlin is a free and open-source document search engine library build on top of perlin-core. Since the first releas

CurrySoftware GmbH 70 Dec 9, 2022
Tantivy is a full-text search engine library inspired by Apache Lucene and written in Rust

Tantivy is a full text search engine library written in Rust. It is closer to Apache Lucene than to Elasticsearch or Apache Solr in the sense it is no

tantivy 7.4k Dec 28, 2022
A full-text search and indexing server written in Rust.

Bayard Bayard is a full-text search and indexing server written in Rust built on top of Tantivy that implements Raft Consensus Algorithm and gRPC. Ach

Bayard Search 1.8k Dec 26, 2022
Rapidly Search and Hunt through Windows Event Logs

Rapidly Search and Hunt through Windows Event Logs Chainsaw provides a powerful ‘first-response’ capability to quickly identify threats within Windows

F-Secure Countercept 1.8k Dec 31, 2022