Shared k-mer content between two genomes

Overview

skc

skc is a simple tool for finding shared k-mer content between two genomes.

Installation

Prebuilt binary

curl -sSL skc.mbh.sh | sh
# or with wget
wget -nv -O - skc.mbh.sh | sh

You can also pass options to the script like so

$ curl -sSL skc.mbh.sh | sh -s -- --help
install.sh [option]

Fetch and install the latest version of skc, if skc is already
installed it will be updated to the latest version.

Options
        -V, --verbose
                Enable verbose output for the installer

        -f, -y, --force, --yes
                Skip the confirmation prompt during installation

        -p, --platform
                Override the platform identified by the installer

        -b, --bin-dir
                Override the bin installation directory [default: /usr/local/bin]

        -a, --arch
                Override the architecture identified by the installer [default: x86_64]

        -B, --base-url
                Override the base URL used for downloading releases [default: https://github.com/mbhall88/skc/releases]

        -h, --help
                Display this help message

Cargo

cargo install skc

Conda

conda install skc

Local

cargo build --release
./target/release/skc --help

Usage

Check for shared 16-mers between the HIV-1 genome and the Mycobacterium tuberculosis genome.

$ skc -k 16 NC_001802.1.fa NC_000962.3.fa
[2023-06-20T01:46:36Z INFO ] 9079 unique k-mers in target
[2023-06-20T01:46:38Z INFO ] 2 shared k-mers between target and query
>4233642782 tcount=1 qcount=1 tpos=NC_001802.1:739 qpos=NC_000962.3:4008106
TGCAGAACATCCAGGG
>4237062597 tcount=1 qcount=1 tpos=NC_001802.1:8415 qpos=NC_000962.3:629482
CCAGCAGCAGATAGGG

So we can see there are two shared 16-mers between the genomes. By default, the shared k-mers are written to stdout - use the -o option to write them to file.

Fasta description

Example: >4233642782 tcount=1 qcount=1 tpos=NC_001802.1:739 qpos=NC_000962.3:4008106

The ID (4233642782) is the 64-bit integer representation of the k-mer's value in bit-space ( see Daniel Liu's brilliant cute-nucleotides repository for more information). tcount and qcount are the number of times the k-mer is present in the target and query genomes, respectively. tpos and qpos are the (1-based) k-mer starting position(s) within the target and query contigs - these will be comma-seperated if the k-mer occurs multiple times.

Usage help

$ skc --help
Shared k-mer content between two genomes

Usage: skc [OPTIONS] <TARGET> <QUERY>

Arguments:
  <TARGET>
          Target sequence

          Can be compressed with gzip, bzip2, xz, or zstd

  <QUERY>
          Query sequence

          Can be compressed with gzip, bzip2, xz, or zstd

Options:
  -k, --kmer <KMER>
          Size of k-mers (max. 32)

          [default: 21]

  -o, --output <OUTPUT>
          Output filepath(s); stdout if not present

  -O, --output-type <u|b|g|l|z>
          u: uncompressed; b: Bzip2; g: Gzip; l: Lzma; z: Zstd

          Output compression format is automatically guessed from the filename extension. This option is used to override that

          [default: u]

  -l, --compress-level <INT>
          Compression level to use if compressing output

          [default: 6]

  -h, --help
          Print help (see a summary with '-h')

  -V, --version
          Print version

Caveats

  • Make the first genome passed (<TARGET>) the smallest genome. This is to reduce memory usage as all unique k-mers ( well their u64 value) for this genome will be held in memory.
  • We do not use canonical k-mers
  • 32 is the largest k-mer size that can be used. This is basically a (lazy) implementation decision, but also helps to keep the memory footprint as low as possible. If you want larger k-mer values, I would suggest checking out some of the similar tools.

Alternate tools

skc does not claim to be the fastest or most memory-efficient tool to find shared k-mer content. I basically wrote it as I either struggled to install some alternate tools, they were clunky/verbose, or it was laborious to get shared k-mers out of the results (e.g. can only search one k-mer at a time or have to run many different subcommands). Here is a (non-exhaustive) list of other tools that can be used to get shared k-mer content

  • unikmer - this was brought to my attention after I wrote skc. Had I known about it beforehand, I probably wouldn't have written skc. So I would recommend unikmer for almost all use cases - Wei Shen writes awesome tools
  • Jellyfish
  • REINDEER
  • kmer-db
  • GGCAT
  • KAT

Acknowledgements

Daniel Liu's brilliant cute-nucleotides repository is used to (rapidly) convert k-mers into 64-bit integers.

You might also like...
This is an example Nostr rust project to enable '402 Payment Required' responses for requests to paid content.

Nostr Paywall Example This is an example Nostr rust project to enable 402 Payment Required responses for requests to paid content. To prove payment, a

Rust SDK for the core C2PA (Coalition for Content Provenance and Authenticity) specification

C2PA Rust SDK The Coalition for Content Provenance and Authenticity (C2PA) addresses the prevalence of misleading information online through the devel

A Content Discovery Tool insipired from Feroxbuster. Work In Progress

monologue A Content Discovery Tool written in Rust, insipired from Feroxbuster. Installation Dependencies OpenSSL (If You are on linux). Rust programm

excss is a small, simple, zero-runtime CSS-in-JS library with just two APIs.

excss excss is a small, simple, zero-runtime CSS-in-JS library with just two APIs.

Show HTML content
Show HTML content "inside" your egui rendered application

hframe Show HTML content "inside" your egui rendered application. "hframe" stands for "HTML Frame". Note: hframe only works when the application is co

Fast, deduplicated content and database seeding for WordPress

Sprout Fast, deduplicated content and database seeding for WordPress. Documentation | Install | Releases Store your uploads and database in a secure,

⚡🚀 Content Delivery Network written in Rustlang, optimized for speed and latency.
⚡🚀 Content Delivery Network written in Rustlang, optimized for speed and latency.

Supported Formats HTML Javscript Css Image PNG JPG JPEG GIF SVG Video MP4 WEBM FLV Audio OGG ACC MP3 Archives ZIP RAR Feeds & Data JSON YAML XML Docum

The PC-based component of a two-part Linux driver for using a TI calculator as an external keyboard
The PC-based component of a two-part Linux driver for using a TI calculator as an external keyboard

Introduction i68apollo is the computer-based component of the two-part i68 (*I*nput from Motorola *68*000[fn:4]-based calculator) prototype userspace

This tool was developed as part of a course on forensic analysis and cybersecurity. It is intended to be used as a training resource to help students understand the structure and content of job files in Windows environments.

Job File Parser Job File Parser is a Rust-based tool designed for parsing both legacy binary job files and modern XML job files used by the Windows Ta

Owner
Michael Hall
Postdoc @ University of Melbourne with @lachlancoin. Bioinformatics | TB | Nanopore | Microbial Genomics | Software Dev.
Michael Hall
A CLI for extracting libraries from Apple's dyld shared cache file

dyld-shared-cache-extractor As of macOS Big Sur, instead of shipping the system libraries with macOS, Apple ships a generated cache of all built in dy

Keith Smiley 238 Jan 4, 2023
This repo contains crates that are used to create the micro services and keep shared code in a common place.

MyEmma Helper Crates This repo contains crates that can are reused over different services. These crate are used in projects at MyEmma. But these crat

MyEmma 1 Jan 14, 2022
Shared Rust libraries for Hyperledger Indy.

indy-shared-rs Shared Rust libraries for Hyperledger Indy. indy-credx: Indy verifiable credential issuance and presentation (aka Anoncreds) indy-data-

Hyperledger 18 Dec 29, 2022
A simple made in Rust crack, automatic for Winrar, activated from shared virtual memory, for studies.

Simple Winrar Crack in Rust What does it do ? A simple project that allows you to modify the license check used by WinRaR, "RegKey" from virtual memor

João Vitor 7 Jan 2, 2023
Shared execution environment for constructing 3D virtual spaces from the inside.

Hearth Hearth is a shared, always-on execution environment for constructing 3D virtual spaces from the inside. Come join our Discord server! The Histo

null 6 Jan 31, 2023
A safe and idiomatic wrapper over shared memory APIs in rust with proper cleanups.

shmem-bind A safe and idiomatic wrapper over shared memory APIs in rust with proper cleanups. Quick start: check the message-passing example for bette

ArshiA Akhavan 3 Apr 6, 2024
:large_orange_diamond: Build beautiful terminal tables with automatic content wrapping

Comfy-table Comfy-table tries to provide utility for building beautiful tables, while being easy to use. Features: Dynamic arrangement of content to a

Arne Beer 525 Jan 8, 2023
A simple CLI tool for converting CSV file content to JSON.

fast-csv-to-json A simple CLI tool for converting CSV file content to JSON. 我花了一個小時搓出來,接著優化了兩天的快速 CSV 轉 JSON CLI 小工具 Installation Install Rust with ru

Ming Chang 3 Apr 5, 2023
Ideas => Creations, a multi-language CMS(Content Management System) based on Rust Web stacks, with long-term upgrade and maintenance.

Ideas => Creations 中文 RustHub: Rust ideas yesterday, shining creations today! This repository holds source code used to run https://rusthub.org, it's

rusthub.org 4 May 9, 2023
Tiny CLI tool that helps to visualize iCal file content in the terminal.

Calio Calio is a tiny CLI tool that helps to visualize iCal file in the terminal. Installation You can either install it via cargo or download the bin

Oscar Cortez 5 Jun 12, 2023