k-mer counter in Rust using the rust-bio and rayon crates


krust is a k-mer counter written in Rust and run from the command line that will output canonical k-mers and their frequency across the records in a fasta file.

Run krust on the test data in the krust Github repo, searching for kmers of length 5, like this:
$ cargo run --release 5 cerevisae.pan.fa > output.tsv
or, searching for kmers of length 21:
$ cargo run --release 21 cerevisae.pan.fa > output.tsv

krust prints to stdout, writing, on alternate lines:
{canonical k-mer}
krust uses rust-bio, rayon, and dashmap.

A function like fn single_sequence_canonical_kmers(filepath: String, k: usize) {}
Would returns k-mer counts for individual sequences in a fasta file.

  • needletail other then bio to accelerate fasta parsing

    Hello Team,

    It seems needle tail is much faster than bio for fasta file parsing. For larger fasta files, parsing can also be parallelized. Is this doable?



    opened by jianshu93 3
  • Explore using the bytes crate

    The biggest feature it adds over Vec is shallow cloning. In other words, calling clone() on a Bytes instance does not copy the underlying data. Instead, a Bytes instance is a reference-counted handle to some underlying data. The Bytes type is roughly an Arc<Vec> but with some added capabilities.

    opened by suchapalaver 1
  • speed up by changing the utf8 processing, reverse-comp, and storage

    the utf8-processing of the kmers. The kmer iterator itself should really check it has valid kmers while iterating. Also, instead of storing the reverse-complement in heap-allocated strings, you can make a lazy reverse-complemented object. Alternatively, store the kmers in u64 - one of the reasons for using kmers in the first place is that they can be packed into machine integers for speed.

    opened by suchapalaver 0
  • It would be nice to have a better error message when you call it from command-line with wrong arguments

    something that explains which arguments to pass in. When the output directory exists, you just write "File exists", which is very confusing if you have an unrelated file called "output".

    opened by suchapalaver 0
  • avoid panicking at all in your library code

    If anyone wants to import your function they won't be happy with something that crashes the whole application when it fails. You can panic in the executable portion of the program though.

    opened by suchapalaver 0
  • change the Config struct member kmer_len to be a usize

    Rather than do let kmer_len = config.kmer_len.parse::().unwrap();, I would instead change the Config struct member kmer_len to be a usize, and perform parsing while constructing Config - Config::new already returns Result.

    opened by suchapalaver 0
  • speed up using hashmaps

    writing a line per kmer is too inefficient and rarely needed. Much better to just return a vector of kmer hashmaps. Alternatively, make a hashmap containing n -> m pairs, where N is the number of time some kmer has been seen, and m the number of distinct kmers having been seen n times.

    opened by suchapalaver 0
