jam-rs
Just another minhash (jam) implementation. A high performance minhash variant to screen extremely large (metagenomic) datasets in a very short timeframe. Implements parts of the ScaledMinHash / FracMinHash algorithm described in sourmash.
Unlike traditional implementations like sourmash or mash this version tries to specialise more on estimating containment of small sequences in large sets. This is intended to be used to screen terabytes of data in just a few seconds / minutes.
Comparison
- xxhash3 or ahash-fallback (for kmer < 31) instead of murmurhash3
- No jaccard similarity since this is meaningless when comparing small embeded sequences against large sets
- (coming soon) optimisations for specificity and sensitivity (and speed) specifically for search of small sequences in assembled metagenomes
Scaling methods
Multiple different scaling methods:
- FracMinHash (
fscale
): Restricts the hash-space to a maximum ofscale
*u64::MAX
- KmerCountScaling (
kscale
): Restrict the overall maximum number of hashes to a factor ofscale
- MinMaxAbsoluteScaling (
nscale
): Use a minimum or maximum number of hashes per sequence record
If KmerCountScaling
and MinMaxAbsoluteScaling
are used together the minimum number of hashes (per sequence record) will be guaranteed. FracMinHash
and KmerCountScaling
produce similar results, the first is mainly provided for sourmash compatibility.
Usage
$ jam
Just another minhasher, obviously blazingly fast
Usage: jam [OPTIONS] <COMMAND>
Commands:
sketch Sketches one or more files and writes the result to an output file
merge Merge multiple input sketches into a single sketch
dist Calculate distance of a (small) sketch against one or more sketches as database
help Print this message or the help of the given subcommand(s)
Options:
-t, --threads <THREADS> Number of threads to use [default: 1]
-f, --force Overwrite output files
-h, --help Print help (see more with '--help')
-V, --version Print version
Sketching
The easiest way to sketch files is to use the jam sketch
command. This accepts one or more input files (fastx / fastx.gz) or a .list
file with a full list of input files. And sketches all inputs to a specific outpuf sketch file.
$ jam sketch
Sketches one or more files and writes the result to an output file
Usage: jam sketch [OPTIONS] --input <INPUT> --output <OUTPUT>
Options:
-i, --input <INPUT> Input file, directory or file with list of files to be hashed
-o, --output <OUTPUT> Output file
-k, --kmer-size <KMER_SIZE> kmer size all sketches to be compared must have the same size [default: 21]
-s, --scale <SCALE> The estimated scaling factor to apply [default: 0.001]
-t, --threads <THREADS> Number of threads to use [default: 1]
-f, --force Overwrite output files
-h, --help Print help
Dist
Calculate the distance for one or more inputs vs. a large set of database sketches. Optionally specify a minimum cutoff in percent of matching kmers. Output is optional if not specified the result will be printed to stdout.
$ jam dist
Calculate distance of a (small) sketch against one or more sketches as database. Requires all sketches to have the same kmer size
Usage: jam dist [OPTIONS] --input <INPUT> --database <DATABASE>
Options:
-i, --input <INPUT> Input sketch or raw file
-d, --database <DATABASE> Database sketch(es)
-o, --output <OUTPUT> Output to file instead of stdout
-c, --cutoff <CUTOFF> Cut-off value for similarity [default: 0.0]
-t, --threads <THREADS> Number of threads to use [default: 1]
-f, --force Overwrite output files
-h, --help Print help
Merge
Merge multiple sketches into one large one.
$ jam merge
Merge multiple input sketches into a single sketch
Usage: jam merge [OPTIONS] --output <OUTPUT> [INPUTS]...
Arguments:
[INPUTS]... One or more input sketches
Options:
-o, --output <OUTPUT> Output file
-t, --threads <THREADS> Number of threads to use [default: 1]
-f, --force Overwrite output files
-h, --help Print help
License
This project is licensed under the MIT license. See the LICENSE file for more info.
Disclaimer
jam-rs is still in early active development and not ready for production use. Use at your own risk. Once a stable version is released additional information and installation guidelines will be added.
Credits
This tool is heavily inspired by finch-rs/License and sourmash/License. Check them out if you need a more mature ecosystem with well tested hash functions and more features.