Finding all pairs of similar documents

This software provides time- and memory-efficient all pairs similarity searches in documents.

Problem definition

Input
- List of documents $D = (d_1, d_2, \dots, d_n)$
- Distance function $\delta: D \times D \rightarrow [0,1]$
- Radius threshold $r \in [0,1]$
Output
- All pairs of similar document ids $R = \{ (i,j): i < j, \delta(d_i, d_j) \leq r \}$

Features

Easy to use

This software supports all essential steps of document similarity search, from feature extraction to output of similar pairs. Therefore, you can immediately try the fast all pairs similarity search using your document files.

Flexible tokenization

You can specify any delimiter when splitting words in tokenization for feature extraction. This can be useful in languages where multiple definitions of words exist, such as Japanese or Chinese.

Time and memory efficiency

The time and memory complexities are linear over the numbers of input documents and output results on the basis of the ideas behind the locality sensitive hashing (LSH) and sketch sorting approach.

Tunable search performance

LSH allows tuning of performance in accuracy, time, and memory, through a manual parameter specifying search dimensions. You can flexibly perform searches depending on your dataset and machine environment.

Specifying lower dimensions allows for faster and rougher searches with less memory usage.
Specifying higher dimensions allows for more accurate searches with more memory usage.

Pure Rust

This software is implemented in Rust, achieving safe and fast performance.

Running example

Here, we describe the basic usage of this software through an example of running the CLI tool.

First of all, install rustc and cargo following the official instructions since this software is implemented in Rust.

1. Data preparation

You have to prepare a text file containing documents line by line (NOT including empty lines).

To produce an example file used throughout this description, you can use scripts/load_nltk_dataset.py that downloads the Reuters Corpus provided by NLTK. Run the following command.

$ ./scripts/load_nltk_dataset.py reuters

reuters.txt will be output.

$ head reuters.txt
hre properties & lt ; hre > 1st qtr jan 31 net shr 38 cts vs 47 cts net 2 , 253 , 664 vs 2 , 806 , 820 gross income 5 , 173 , 318 vs 5 , 873 , 904 note : net includes gains on sale of real estate of 126 , 117 dlrs vs 29 , 812 dlrs .
the firm , however , is supplying temporary financing , and sources close to the transaction disputed the claim that the firm will not end up paying for its equity position . 
conoco , which has completed geological prospecting for the tunisian government , has transferred one third of its option rights in the region to ina , it said .
" willis faber ' s stake in morgan grenfell has been a very successful investment ," it said .
china reports 700 mln dlr two - month trade deficit china ' s trade deficit totalled 700 mln dlrs in the first two months of this year , according to figures released by the state statistics bureau .
the treasury said baker and stoltenberg " are consulting with their g - 7 colleagues and are confident that this will enable them to foster exchange rate stability around current levels ."
u . s . tariffs are due to take effect on april 17 .
some dealers said there were growing signs the united states wanted the dollar to fall further .
since last august smart has been leading talks to open up japan to purchases of more u . s .- made automotive parts .
the resulting association will operate under the name of charter and will be based in bristol .

Fully-duplicate documents in reuters.txt are removed because they are noisy in evaluation of similarity searches. To do this, the output lines are shuffled, and your file will not be the identical to the example.

2. Finding all pairs of similar documents

The workspace find-simdoc-cli provides CLI tools for fast all pairs similarity searches in documents. The approach consists of three steps:

Extract features from documents
- Set representation of character or word ngrams
- Tfidf-weighted vector representation of character or word ngrams
Convert the features into binary sketches through locality sensitive hashing (LSH)
- 1-bit minwise hashing for the Jaccard similarity
- Simplified simhash for the Cosine similarity
Search for similar sketches in the Hamming space using a modified variant of the sketch sorting approach

2.1 Jaccard space

The executable jaccard provides a similarity search in the Jaccard space. You can check the arguments with the following command.

$ cargo run --release -p find-simdoc-cli --bin jaccard -- --help

Run the following command if you want to search for reuters.txt with

search radius 0.1,
tokens of character 5-grams, and
15*64=960 dimensions in the Hamming space.

$ cargo run --release -p find-simdoc-cli --bin jaccard -- -i reuters.txt -r 0.1 -w 5 -c 15 > result-jaccard.csv

Argument -c indicates the number of dimensions in the Hamming space, a trade-off parameter between approximation accuracy and search speed. The larger this value, the higher the accuracy, but the longer the search takes. This section describes how to examine the approximation accuracy for the number of dimensions.

Pairs of similar documents (indicated by zero-origin line numbers) and their distances are reported.

$ head result-jaccard.csv
i,j,dist
191,29637,0.07291666666666667
199,38690,0.0375
274,10048,0.07083333333333333
294,27675,0.04791666666666667
311,13812,0.04583333333333333
361,50938,0.08958333333333333
469,6360,0.035416666666666666
546,10804,0.0875
690,28281,0.0875

2.2 Cosine space

The executable cosine provides a similarity search in the Cosine space. You can check the arguments with the following command.

$ cargo run --release -p find-simdoc-cli --bin cosine -- --help

Run the following command if you want to search for reuters.txt with

search radius 0.1,
tokens of word 3-grams,
word delimiter " " (i.e., a space),
10*64=640 dimensions in the Hamming space, and
weighting using the standard TF and the smoothed IDF.

$ cargo run --release -p find-simdoc-cli --bin cosine -- -i reuters.txt -r 0.1 -d " " -w 3 -c 10 -T standard -I smooth > result-cosine.csv

Pairs of similar documents (indicated by zero-origin line numbers) and their distances are reported.

$ head result-cosine.csv
i,j,dist
542,49001,0.084375
964,24198,0.09375
1872,3024,0.0859375
1872,6823,0.090625
1872,8462,0.0953125
1872,11402,0.090625
1872,18511,0.0859375
1872,41491,0.0875
1872,48344,0.0859375

3. Printing similar documents

The executable dump prints similar documents from an output CSV file.

If you want to print similar documents in reuters.txt with the result result-jaccard.csv, run the following command.

$ cargo run --release -p find-simdoc-cli --bin dump -- -i reuters.txt -s result-jaccard.csv
[i=191,j=29637,dist=0.07291666666666667]
pending its deliberations , harper and row ' s board has postponed indefinitely a special meeting of stockholders that had been scheduled for april 2 to discuss a proposal to recapitalize the company ' s stock to create two classes of shares with different voting rights .
pending its deliberations , harper and row ' s board has postponed indefinitely a special meeting of stockholders that had been scheduled for april 2 to discuss a proposal to recapitalize the company ' s stock in order to create two classes of shares with different votinmg rights .
[i=199,j=38690,dist=0.0375]
government officials had no immediate comment on the report , which advised a reduction in the overall size of the public investment programme and greater emphasis on the preservation of peru ' s export potential .
government officials had no immediate comment on the report , which advised a reduction in the overall size of the public investment program and greater emphasis on the preservation of peru ' s export potential .
[i=274,j=10048,dist=0.07083333333333333]
the measure was adopted as part of a wide - ranging trade bill that will be considered by the full house in april before it moves on to the senate .
the measure was adopted as part of a wide - ranging trade bill that will be considered by the full house in april before it moves onto the senate .
[i=294,j=27675,dist=0.04791666666666667]
the company said the start - up was necessitated by continuing strong demand for aluminum and dwindling worldwide inventories , and that the metal is needed to supply reynolds ' various fabricating businesses .
the company said the start up was necessitated by continuing strong demand for aluminum and dwindling worldwide inventories , and that the metal is needed to supply reynolds ' various fabricating businesses .
[i=311,j=13812,dist=0.04583333333333333]
he said in an interview with reuter that after a few years it was likely south korea would drop barriers to foreign goods and move toward a more balanced trade position .
he said in an interview with reuters that after a few years it was likely south korea would drop barriers to foreign goods and move toward a more balanced trade position .
[i=361,j=50938,dist=0.08958333333333333]
hog and cattle slaughter guesstimates chicago mercantile exchange floor traders and commission house representatives are guesstimating today ' s hog slaughter at about 295 , 000 to 305 , 000 head versus 307 , 000 week ago and 311 , 000 a year ago .
hog and cattle slaughter guesstimates chicago mercantile exchange floor traders and commission house representatives are guesstimating today ' s hog slaughter at about 295 , 000 to 308 , 000 head versus 305 , 000 week ago and 308 , 000 a year ago .
[i=469,j=6360,dist=0.035416666666666666]
the national planning department forecast that in 1987 coffee , colombia ' s traditional major export , will account for only one - third of total exports , or about 1 . 5 billion dlrs .
the national planning department forecast that in 1987 coffee , colombia ' s traditional major export , will account for only one third of total exports , or about 1 . 5 billion dlrs .
...

4. Testing the accuracy of 1-bit minwise hashing

LSH is an approximate solution, and you may want to know the accuracy. The executable minhash_acc allows you to examine

the mean absolute error that is the averaged gap between the normalized Hamming distance and the actual Jaccard distance; and
the number of true results, precisions, recalls, and F1-scores for search radii {0.01, 0.02, 0.05, 0.1, 0.2, 0.5}.

To use this executable, we recommend extracting a small subset from your dataset because it exactly computes distances for all possible pairs (although the computation is accelerated with parallelization).

$ head -5000 reuters.txt > reuters.5k.txt

You can test the number of Hamming dimensions from 64 to 6400 (i.e., the number of chunks from 1 to 100 indicated with -c) with the following command. The arguments for feature extraction are the same as those of jaccard.

$ cargo run --release -p find-simdoc-cli --bin minhash_acc -- -i reuters.5k.txt -w 5 > acc.csv

Approximation accuracy of 1-bit minwise hashing

LSH is an approximate solution, and the number of dimensions in the Hamming space (indicated with the command line argument -c) is related to the approximation accuracy. As a hint for choosing a parameter of -c, we show experimental results obtained from reuters.txt of 51,535 documents when setting -w 5.

Mean absolute error (MAE)

The following figure shows MAEs while varying the number of Hamming dimensions from 64 to 6400 (i.e., the number of chunks from 1 to 100 indicated with -c).

As expected, the larger the number, the higher the accuracy. For example, when the number of dimensions is 1024 (setting the argument -c 16), we achieve the MAE around 2.5%.

Recall

Of the precision, recall, and F1 score, the most interesting would be the recall. This is because false positives can be filtered out in post processing.

The following figure shows recalls in search with radii 0.05, 0.1, and 0.2 (indicated with the argument -r).

For radii 0.1 and 0.2, over 90% recalls are achieved in most cases. For smaller radius 0.05, 75-90% recalls are obtained because the MAE becomes larger relative to the radius.

By the way, the numbers of true results are 50, 201, and 626 for radii 0.05, 0.1, and 0.2, respecitvely.

F1 score

The following figure shows F1 scores in search with radii 0.05, 0.1, and 0.2 (indicated with the argument -r).

For radius 0.05, over 90% scores are achieved from 3520 dimensions (setting -c 55).
For radius 0.1, over 90% scores are achieved from 704 dimensions (setting -c 11).
For radius 0.2, over 90% scores are achieved from 448 dimensions (setting -c 7).

Disclaimer

This software is developed by LegalForce, Inc., but not an officially supported LegalForce product.

License

Licensed under either of

Apache License, Version 2.0 (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
MIT license (LICENSE-MIT or http://opensource.org/licenses/MIT)

at your option.

Contribution

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.

Rapidly Search and Hunt through Windows Event Logs

Rapidly Search and Hunt through Windows Event Logs Chainsaw provides a powerful ‘first-response’ capability to quickly identify threats within Windows

1.8k Dec 31, 2022

An example of web application by using Rust and Axum with Clean Architecture.

stock-metrics Stock price and stats viewer. Getting Started Middleware Launch the middleware by executing docker compose: cd local-middleware docker c

62 Dec 10, 2022

2 and 3-dimensional collision detection library in Rust.

2D Documentation | 3D Documentation | User Guide | Forum ⚠️ **This crate is now passively-maintained. It is being superseded by the Parry project.** ⚠

914 Dec 24, 2022

Tantivy is a full-text search engine library inspired by Apache Lucene and written in Rust

Tantivy is a full-text search engine library written in Rust. It is closer to Apache Lucene than to Elasticsearch or Apache Solr in the sense it is no

7.5k Jan 9, 2023

Configurable quick search engine shortcuts for your terminal and browser.

Quicksearch Configurable quick search engine shortcuts for your terminal and browser. Installation Run cargo install quicksearch to install Configurat

2 Oct 14, 2022

💰 Midas is a free and open source Moving Average Trading backtest simulator.

Midas is a free and open source Moving Average Trading backtest simulator Bilibili Video: https://www.bilibili.com/video/BV11o4y1B7fL ⚠️ Warning Inves

6 Mar 17, 2023

A Solr 8+ Client for Rust and Python

Solrstice: A Solr 8+ Client for Rust and Python Solrstice is a SolrCloud aware client library written in rust. It also provides a wrapper to python. U

4 Aug 26, 2023

Schema2000 is a tool that parses exsiting JSON documents and tries to derive a JSON schema from these documents.

Schema 2000 Schema2000 is a tool that parses exsiting JSON documents and tries to derive a JSON schema from these documents. Currently, Schema2000 is

12 Dec 6, 2022

C library for finding nearest (most similar) element in a set

VP-tree nearest neighbor search A relatively simple and readable Rust implementation of Vantage Point tree search algorithm. The VP tree algorithm doe

28 Aug 20, 2022

A simple library to get all pairs from any Dex and sync reserves.

pair_sync A simple library to get all pairs from any supported Dex and sync reserves. Crates.io Documentation in progress Filename: examples/sync-pair

116 Dec 30, 2022

Pathfinding library for calculating all node pairs' shortest paths in an unweighted undirected graph.

bit_gossip bit_gossip, named after its implementation technique, is a simple pathfinding library for calculating all node pairs' shortest paths in an

49 Sep 4, 2024

With Dejavu, you can have a perfect memory by capturing and organizing your visual recordings efficiently.

Dejavu The content in README.md is assisted by ChatGPT. Overview Dejavu is an open-source, cross-platform tool designed to help you record and search

127 Jul 31, 2023

Rust and TypeScript example code for finding all members from a collection id.

get-collection-rs Use the Crawler cargo build --release ./target/release/get-collection rpc_node collection_id Example: ./target/release/get-col

16 Dec 22, 2022

Efficiently in-memory log manager

ram-journal Efficiently in-memory log manager Ram journal is a system that considerably reduces disk read and write operations by keeping logs from th

2 Oct 24, 2022

Merge together and efficiently time-sort compressed .pcap files stored in AWS S3 object storage (or locally) to stdout for pipelined processing.

Merge together and efficiently time-sort compressed .pcap files stored in AWS S3 object storage (or locally) to stdout for pipelined processing. High performance and parallel implementation for 10 Gbps playback throughput with large numbers of files (~4k).

4 Aug 19, 2022

Fix English

There was no English usage of "all-pair similarity search", but there is "All pairs similarity search". This PR fixed the usage (and nested listing)

opened by kampersanda 0
Add parallelization and external buffering in minhash_acc for large data

In the previous version, minhash_acc runs in single processing and maintains data in the main memory. However, this executable exactly computes distances for all possible pairs and stores all the results, making evaluation on large datasets impossible.

This PR used rayon and wrote large data in a tmp file. Also, this PR updated the experimental results using a larger dataset.

opened by kampersanda 0
Move lsh module into find-simdoc

Before publishing find-simdoc in crateio, it is necessary to publish the dependencies, all-pairs-hamming and lsh. I will publish all-pairs-hamming, but lsh is not worth publishing as independent library. So, this PR moved the module in lsh into find-simdoc. (and some toml files were formatted)

opened by kampersanda 0
Add progress printing

This PR added mutex progress reporters in parallel processing because such reporters will be helpful to get psychological safety when handling massive data.

opened by kampersanda 0

Finding all pairs of similar documents time- and memory-efficiently

Related tags

Overview

Finding all pairs of similar documents

Problem definition

Features

Easy to use

Flexible tokenization

Time and memory efficiency

Tunable search performance

Pure Rust

Running example

1. Data preparation

2. Finding all pairs of similar documents

2.1 Jaccard space

2.2 Cosine space

3. Printing similar documents

4. Testing the accuracy of 1-bit minwise hashing

Approximation accuracy of 1-bit minwise hashing

Mean absolute error (MAE)

Recall

F1 score

Disclaimer

License

Contribution

You might also like...

Rapidly Search and Hunt through Windows Event Logs

An example of web application by using Rust and Axum with Clean Architecture.

2 and 3-dimensional collision detection library in Rust.

Tantivy is a full-text search engine library inspired by Apache Lucene and written in Rust

Configurable quick search engine shortcuts for your terminal and browser.

💰 Midas is a free and open source Moving Average Trading backtest simulator.

A Solr 8+ Client for Rust and Python

Schema2000 is a tool that parses exsiting JSON documents and tries to derive a JSON schema from these documents.

C library for finding nearest (most similar) element in a set

A simple library to get all pairs from any Dex and sync reserves.

Pathfinding library for calculating all node pairs' shortest paths in an unweighted undirected graph.

With Dejavu, you can have a perfect memory by capturing and organizing your visual recordings efficiently.

Rust and TypeScript example code for finding all members from a collection id.

Efficiently in-memory log manager

Merge together and efficiently time-sort compressed .pcap files stored in AWS S3 object storage (or locally) to stdout for pipelined processing.

Doubly-linked list that stores key-node pairs.

hashmap macro for creating hashmap from provided key/value pairs

A simple and lightweight fuzzy search engine that works in memory, searching for similar strings (a pun here).

SegVec data structure for rust. Similar to Vec, but allocates memory in chunks of increasing size.

Comments

Fix English

Add parallelization and external buffering in minhash_acc for large data

Move lsh module into find-simdoc

Add progress printing

Releases(v0.1.1)

v0.1.1(Sep 26, 2022)

v0.1.0(Sep 16, 2022)

Owner

EasyAlgolia is a Rust crate designed for utilizing the Algolia admin client. It simplifies the process of updating and inserting documents into Algolia's search index.

Search through millions of documents in milliseconds ⚡️

🔎 A simple in-memory search for collections and key-value stores.

Shogun search - Learning the principle of search engine. This is the first time I've written Rust.

weggli is a fast and robust semantic search tool for C and C++ codebases. It is designed to help security researchers identify interesting functionality in large codebases.

Represent large sets and maps compactly with finite state transducers.

Lightning Fast, Ultra Relevant, and Typo-Tolerant Search Engine

Perlin: An Efficient and Ergonomic Document Search-Engine

Tantivy is a full-text search engine library inspired by Apache Lucene and written in Rust

A full-text search and indexing server written in Rust.