Compute a pairwise SNP distance matrix from one or two alignment(s)

Overview

codecov Rust CI License: MIT github release version

Compute a pairwise SNP distance matrix from one or two alignment(s)

Table of Contents

Motivation

A key point of difference for psdm is the pairwise SNP distance between two alignment files. This is particularly beneficial if you have computed a SNP distance matrix for many samples already and want to update the distances with some new samples - without rerunning the analysis for all samples in the original file.

Another potential use is having two alignment files for the same samples but with sequences generated by different techniques. For example, if you have produced consensus sequences from SNP calls from Illumina and Nanopore and want to see how similar the Illumina-to-Nanopore (inter-technology) distances are - compared to the intra-technology distances.

Despite these motivations, psdm can still be used to compute a "traditional" pairwise SNP distance matrix for a single FASTA alignment file.

Install

cargo

Crates.io

Prerequisite: rust toolchain (min. v1.55.0)

$ cargo install psdm

conda

Conda (channel only) bioconda version

Prerequisite: conda (and bioconda channel correctly set up)

$ conda install psdm

Precompiled binaries

tl;dr: Run the following snippet to download the latest binary for your system to the current directory and show the help menu.

version="0.1.0"
OS=$(uname -s)                                                                                                       
if [ "$OS" = "Linux" ]; then                                                                                         
    triple="x86_64-unknown-linux-musl"                                                                              
elif [ "$OS" = "Darwin" ]; then                                                                                        
    triple="x86_64-apple-darwin"                                                         
else                                                      
    echo "ERROR: $OS not a recognised operating system"
fi              
if [ -n "$triple" ]; then   
    URL="https://github.com/mbhall88/psdm/releases/download/${version}/psdm-${version}-${triple}.tar.gz"
    wget "$URL" -O - | tar -xzf -
    ./psdm --help             
fi

These binaries do not require that you have the rust toolchain installed.

Currently, there are two pre-compiled binaries available:

  • Linux kernel x86_64-unknown-linux-musl (works on most Linux distributions I tested)
  • OSX kernel x86_64-apple-darwin (works for any post-2007 Mac)

An example of downloading one of these binaries using wget

$ version="0.1.0"
$ URL="https://github.com/mbhall88/psdm/releases/download/${version}/psdm-${version}-x86_64-unknown-linux-musl.tar.gz"
$ wget "$URL" -O - | tar -xzf -
$ ./psdm --help

If these binaries do not work on your system please raise an issue and I will potentially add some additional target triples.

homebrew

Prerequisite: homebrew

The homebrew installation is done via the homebrew-bio tap.

$ brew install brewsci/bio/psdm

Container

Docker images are hosted at quay.io.

singularity

Prerequisite: singularity

$ URI="docker://quay.io/mbhall88/psdm"
$ singularity exec "$URI" psdm --help

The above will use the latest version. If you want to specify a version then use a tag (or commit) like so.

$ VERSION="0.1.0"
$ URI="docker://quay.io/mbhall88/psdm:${VERSION}"

docker

Docker Repository on Quay

Prerequisite: docker

$ docker pull quay.io/mbhall88/psdm
$ docker run quay.io/mbhall88/psdm psdm --help

You can find all the available tags on the quay.io repository.

Local

Prerequisite: rust toolchain

$ git clone https://github.com/mbhall88/psdm.git
$ cd psdm
$ cargo build --release
$ target/release/psdm --help
# if you want to check everything is working ok
$ cargo test

Usage

Quick

Single alignment file

aln1.fa

>s1
ABCDEFGH
>s2
aBN-XFnH
>s0
AbCdEfG-
$ psdm aln1.fa
,s1,s2,s0
s1,0,3,3
s2,3,0,5
s0,3,5,0

Two alignment files

aln2.fa.gz

>s2
xXNNfoo=
>s5 description
AB-DEFGG
$ psdm aln1.fa aln2.fa.gz
,s1,s2,s0
s2,6,6,5
s5,1,4,3

The column names represent the first alignment file provided.

Full

I'd like the sequences to be sorted by identifier in the output

$ psdm -s aln1.fa
,s0,s1,s2
s0,0,3,5
s1,3,0,3
s2,5,3,0

I want a tab-delimited (TSV) matrix instead of a comma-delimited (CSV) one

$ psdm -d "\t" aln1.fa
        s1      s2      s0
s1      0       3       3
s2      3       0       5
s0      3       5       0

Ignore the case of nucleotides - i.e., acgt is the same as ACGT

$ psdm -i aln1.fa
,s1,s2,s0
s1,0,1,0
s2,1,0,1
s0,0,1,0

By default, psdm ignores N's and gaps (-). However, maybe you also want to ignore X's

$ psdm -e NX- aln1.fa
,s1,s2,s0
s1,0,2,3
s2,2,0,4
s0,3,4,0

Or maybe you don't want to ignore anything

$ psdm -e "" aln1.fa
,s1,s2,s0
s1,0,5,4
s2,5,0,8
s0,4,8,0

I'm impatient, use all the threads I have!

$ psdm -t 0 aln1.fa

Give me long-form output instead of a matrix

$ psdm -l aln1.fa
s1,s1,0
s1,s2,3
s1,s0,3
s2,s1,3
s2,s2,0
s2,s0,5
s0,s1,3
s0,s2,5
s0,s0,0

I'd like to know the progress of the pairwise comparisons

$ psdm -P big.aln.fa
[2599/11476 (23%) comparisons] █████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ [ETA 00:00:28]

Write the matrix to a file please

$ psdm -o dists.csv aln1.fa
Delimiting character for the output table [default: ,] -e, --ignored-chars String of characters to ignore - e.g., `-e N-` -> dist(A, N) = 0 and dist(A, -) = 0 Note, this option is applied *after* `--ignore-case` - i.e., if using `--ignore-case`, only the uppercase form of a character is needed. To not ignore any characters, use `-e ''` or `-e ""` [default: N-] -o, --output Output file name [default: stdout] -t, --threads Number of threads to use. Setting to 0 will use all available [default: 1] ARGS: ... FASTA alignment file(s) to compute the pairwise distance for. Providing two files will compute the distances for all sequences in one file against all sequences from the other file - i.e., not between sequences in the same file. The first file will be the column names, while the second is the row names. The alignment file(s) can be compressed. ">
$ psdm --help
psdm 0.1.0
Compute a pairwise SNP distance matrix from one or two alignment(s)

USAGE:
    psdm [FLAGS] [OPTIONS] 
       
        ...

FLAGS:
    -h, --help
            Prints help information

    -i, --ignore-case
            Ignore case - i.e., dist(a, A) = 0

    -l, --long
            Output as long-form ("melted") table

            By default the output is a N x N or N x M table where N is the number of sequences in the first alignment
            and M is the number of sequences in the (optional) second alignment.
    -q, --quiet
            No logging (except progress info if `-P` is given)

    -P, --progress
            Show a progress bar

    -s, --sort
            Sort the alignment(s) by ID

    -V, --version
            Prints version information


OPTIONS:
    -d, --delim 
        
         
            Delimiting character for the output table [default: ,]

    -e, --ignored-chars 
         
          
            String of characters to ignore - e.g., `-e N-` -> dist(A, N) = 0 and dist(A, -) = 0

            Note, this option is applied *after* `--ignore-case` - i.e., if using `--ignore-case`, only the uppercase
            form of a character is needed. To not ignore any characters, use `-e ''` or `-e ""` [default: N-]
    -o, --output 
          
            Output file name [default: stdout]

    -t, --threads 
           
             Number of threads to use. Setting to 0 will use all available [default: 1] ARGS: 
            
             ... FASTA alignment file(s) to compute the pairwise distance for. Providing two files will compute the distances for all sequences in one file against all sequences from the other file - i.e., not between sequences in the same file. The first file will be the column names, while the second is the row names. The alignment file(s) can be compressed. 
            
           
         
        
       

Benchmark

We benchmark against snp-dists (v0.8.2). snp-dists is a brilliant tool for computing SNP distance matrices and is the inspiration for psdm.

As snp-dists doesn't allow inter-alignment comparisons, we will benchmark the intra-alignment mode. We use an alignment file of 151 sequences with a length of 4411532bp.

The benchmark times were recorded with hyperfine with 10 runs for each tool/threads combination.

Benchmark plot

tool threads mean SD min median max
psdm 1 95.4 1.0 94.2 95.3 97.9
snp-dists 1 231.0 4.8 224.2 231.8 240.4
psdm 2 54.3 1.5 53.2 53.6 57.5
snp-dists 2 120.3 0.8 118.9 120.3 121.3
psdm 4 33.0 0.8 32.5 32.7 35.0
snp-dists 4 61.8 0.2 61.5 61.8 62.2
psdm 8 22.3 0.1 22.1 22.3 22.4
snp-dists 8 33.1 0.3 32.8 33.1 33.7
psdm 16 17.6 0.2 17.4 17.5 18.0
snp-dists 16 20.7 0.2 20.4 20.7 21.1

Contributing

Contributions are always welcome. For changes to be accepted, they must pass the CI and coverage checks. These include:

  • Code is formatted (cargo fmt).
  • There are no compiler errors/warnings. cargo clippy --all-features --all-targets -- -D warnings
  • Test code coverage has not reduced.

Please also add a succinct description of the contribution in the CHANGELOG.

You might also like...
Robust and Fast tokenizations alignment library for Rust and Python
Robust and Fast tokenizations alignment library for Rust and Python

Robust and Fast tokenizations alignment library for Rust and Python

A tool to filter sites in a FASTA-format whole-genome pseudo-alignment

Core-SNP-filter This is a tool to filter sites (i.e. columns) in a FASTA-format whole-genome pseudo-alignment based on: Whether the site contains vari

A "Navie" Implementation of the Wavefront Algorithm For Sequence Alignment with Gap-Affine Scoring

A "Naive" Implementation of the Wavefront Algorithm for Sequence Alignment with Gap-Affine Scoring This repository contains some simple code that I wr

Cargo features alignment tool.

Cargo Featalign Cargo features alignment tool. Introduction The original version of this project can be found at subalfred check features. Upon furthe

A simple menu to keep all your most used one-liners and scripts in one place
A simple menu to keep all your most used one-liners and scripts in one place

Dama Desktop Agnostic Menu Aggregate This program aims to be a hackable, easy to use menu that can be paired to lightweight window managers in order t

Rust Programming Fundamentals - one course to rule them all, one course to find them...

Ultimate Rust Crash Course This is the companion repository for the Ultimate Rust Crash Course published online, presented live at O'Reilly virtual ev

A super simple /sbin/init for Linux which allows running one and only one program

Summary High-performance /sbin/init program for Linux This is designed to do literally nothing but accept binaries over the network and run them as a

Instance Distance is a fast pure-Rust implementation of the Hierarchical Navigable Small Worlds paper

Fast approximate nearest neighbor searching in Rust, based on HNSW index

Signed distance field font and image command line tool based on OpenCL.

SDFTool Signed distance field font and image command line tool based on OpenCL. Build Windows Run cargo build --release in Visual Studio developer x64

A fast and cross-platform Signed Distance Function (SDF) viewer, easily integrated with your SDF library.
A fast and cross-platform Signed Distance Function (SDF) viewer, easily integrated with your SDF library.

SDF Viewer (demo below) A fast and cross-platform Signed Distance Function (SDF) viewer, easily integrated with your SDF library. A Signed Distance Fu

Multi-channel signed distance field (MSDF) generator for fonts implemented in pure Rust.

msdfont WIP - school started so less updates from now on :(( Multi-channel signed distance field (MSDF) generator for fonts implemented in pure Rust.

A crate to help you copy things into raw buffers without invoking spooky action at a distance (undefined behavior).

🗜 presser Utilities to help make copying data around into raw, possibly-uninitialized buffers easier and safer. presser can help you when copying dat

Signed distance functions + Rust (CPU & GPU) = ❤️❤️
Signed distance functions + Rust (CPU & GPU) = ❤️❤️

sdf-playground Signed distance functions + Rust (CPU & GPU) = ❤️❤️ Platforms: Windows, Mac & Linux. About sdf-playground is a demo showcasing how you

Adjustment Identification Distance: A gadjid for Causal Structure Learning

Adjustment Identification Distance: A 𝚐𝚊𝚍𝚓𝚒𝚍 for Causal Structure Learning This is an early release of 𝚐𝚊𝚍𝚓𝚒𝚍 🐥 and feedback is very welc

Distributed compute platform implemented in Rust, and powered by Apache Arrow.
Distributed compute platform implemented in Rust, and powered by Apache Arrow.

Ballista: Distributed Compute Platform Overview Ballista is a distributed compute platform primarily implemented in Rust, powered by Apache Arrow. It

Distributed compute platform implemented in Rust, and powered by Apache Arrow.
Distributed compute platform implemented in Rust, and powered by Apache Arrow.

Ballista: Distributed Compute Platform Overview Ballista is a distributed compute platform primarily implemented in Rust, powered by Apache Arrow. It

Sample code for compute shader 101 training

Sample code for Compute Shader 101 This repo contains sample code to help you get started writing applications using compute shaders.

Interface definitions for the Compute@Edge platform in witx.

🔗 compute-at-edge-abi This repository contains witx definitions for the Compute@Edge platform ABI. About witx The witx file format is an experimental

Viceroy provides local testing for developers working with Compute@Edge.
Viceroy provides local testing for developers working with Compute@Edge.

Viceroy provides local testing for developers working with Compute@Edge. It allows you to run services written against the Compute@Edge APIs on your local development machine, and allows you to configure testing backends for your service to communicate with.

Comments
  • Make case insensitive the default

    Make case insensitive the default

    We've been freaking out about some enormous SNP distances in covid genomes, but it turned out we hadn't used the ignore-case option. Up to you, but it might be a good idea to default this to yes?

    enhancement 
    opened by iqbal-lab 2
Releases(0.2.0)
Owner
Michael Hall
Bioinformatics PhD Student at EMBL-EBI and University of Cambridge in @iqbal-lab Interested in Microbial Genomics.
Michael Hall
A Trig-less Line of Sight Algorithm in Two Dimensions

In many examples of 2D line-of-sight algorithms, expensive operations like trigonometry are used. Additionally, some methods have intentional inaccuracies in them for the sake of simplicity. Here, we give an algorithm which does not fudge the numbers, and uses only basic arithmetic: addition, subtraction, multiplication, and division. This is not intended to replace the existing algorithms, or even be more efficient in practice.

null 22 Dec 22, 2022
Rust edit distance routines accelerated using SIMD. Supports fast Hamming, Levenshtein, restricted Damerau-Levenshtein, etc. distance calculations and string search.

triple_accel Rust edit distance routines accelerated using SIMD. Supports fast Hamming, Levenshtein, restricted Damerau-Levenshtein, etc. distance cal

Daniel Liu 75 Jan 8, 2023
Python package to compute levensthein distance in rust

Contents Introduction Installation Usage License Introduction Rust implementation of levensthein distance (https://en.wikipedia.org/wiki/Levenshtein_d

Thibault Blanc 2 Feb 21, 2022
📏 ― Uses the Jaro similarity metric to measure the distance between two strings

distance distance: Uses the Jaro similarity metric to measure the distance between two strings FYI, this was just to test Neon, I do not recommend usi

Demigender 6 Dec 7, 2021
fast rust implementation of online nonnegative matrix factorization as laid out in the paper "detect and track latent factors with online nonnegative matrix factorization"

ONMF status: early work in progress. still figuring this out. code still somewhat messy. api still in flux. fast rust implementation of online nonnega

null 2 Apr 10, 2020
This code generate coinjoin transaction whici has two inputs and two outputs

How to create coinjoin transaction This code generate coinjoin transaction whici has two inputs and two outputs. First, we create two trasactions. The

kanna 2 Jun 25, 2022
Compare binary files using alignment algorithms

biodiff Compare binary files using alignment algorithms. What is this This is a tool for binary diffing. The tool is able to show two binary files sid

null 483 Dec 22, 2022
Simple terminal alignment viewer

Alen Simple terminal sequence alignment viewer. What is Alen? It's a command-like program to view DNA or protein alignments in FASTA formats. Alen is

Jakob Nybo Nissen 51 Dec 19, 2022
Rust crate `needleman_wunsch` of the `fasebare` package: reading FASTA sequences, Needleman-Wunsch alignment

fasebare Rust crate needleman_wunsch of the fasebare package: reading FASTA sequences, Needleman-Wunsch alignment. Synopsis The crate needleman_wunsch

Laurent Bloch 2 Nov 19, 2021
Robust and Fast tokenizations alignment library for Rust and Python

Robust and Fast tokenizations alignment library for Rust and Python Demo: demo Rust document: docs.rs Blog post: How to calculate the alignment betwee

Explosion 157 Dec 28, 2022