Adds SEQ and QUAL to secondary alignments from SAM/BAM/CRAM

Overview

secondary_rewriter

Some aligners such as minimap2 do not write the SEQ and QUAL fields to secondary alignments (which are sometimes called multi-mappers, see Multiple mapping in SAMv1.pdf https://samtools.github.io/hts-specs/SAMv1.pdf) making it hard to analyze them (for example, SNPs will not be visible in a genome browser for secondary alignments and variant calling would not work on them). This program adds SEQ/QUAL to secondary alignments, referring to the primary alignment to get the SEQ and QUAL.

Minimap2 reference lh3/minimap2#458 lh3/minimap2#687

Install

First install rust, probably with rustup https://rustup.rs/

Then

cargo install secondary_rewriter

Usage

This small shell script automates the multi-step pipeline (supports BAM or CRAM)


#!/bin/bash

# write_secondaries.sh
# usage
# ./write_secondaries.sh <input.cram> <ref.fa> <output.cram> <nthreads default 4> <memory gigabytes, per-thread, default 1G>
# e.g.
# ./write_secondaries.sh input.cram ref.fa output.cram 16 2G

THR=${4:-4}
MEM=${5:-1G}


samtools view -@$THR -h $1 -T $2 -f256 > sec.txt
samtools view -@$THR -h $1 -T $2 -F256 | secondary_rewriter --generate-primary-loc-tag --secondaries sec.txt | samtools sort --reference $2 -@$THR -m $MEM - -o $3

Two-pass strategy

The two-pass strategy works as follows

  1. First pass: output ALL secondary alignments (reads with flag 256) to a external file
  2. Second pass: read secondary alignments from external file into memory, and then scan original SAM/BAM/CRAM to add SEQ and QUAL fields on the primary alignments to the secondary alignments, and pipe to samtools sort (needed because all the secondary reads will be out of order, added right after the primary alignment where the SEQ/QUAL is found)

This process avoids loading the entire SAM/BAM/CRAM into memory, but does require the samtools sort which is a bit expensive. It does load all the secondary alignments (pre-them-having SEQ/QUAL fields which is generally on the order of a couple gigabytes instead of hundred(s) of gigabytes) into memory though.

Result

Your secondary reads will now display with SNPs and such in a genome browser. Having SEQ is also important for variant calling.

Screenshots from both IGV and JBrowse 2 (just to show it's not browser specific) showing the same file before and after calling with secondary_rewriter on a region of the genome with many secondary alignments in a centromeric region (ultra long read hs37d5 from https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/HG002_NA24385_son/Ultralong_OxfordNanopore/guppy-V2.3.4_2019-06-26/

Screenshot of before/after running secondary_rewriter in JBrowse 2

Screenshot of same data, before/after, in IGV

Help


secondary_rewriter 0.1.9
Adds SEQ and QUAL fields to secondary alignments from the primary alignment

USAGE:
    secondary_rewriter [OPTIONS]

OPTIONS:
    -g, --generate-primary-loc-tag     Boolean flag on whether to produce a tag like pl:Z:chr1:1000
                                       on the secondary alignments that says where the primary
                                       alignment is
    -h, --help                         Print help information
    -s, --secondaries <SECONDARIES>    Path to file of secondary reads (generated by e.g. samtools
                                       view -f256)
    -V, --version                      Print version information

--generate-primary-loc-tag creates a tag on the secondary reads like pl:Z:chr1:1000 to say where the primary read is

Runtime

The speed of this program is mostly limited by samtools view/samtools sort efficiency, so if you give samtools tons of threads and memory your performance will improve.

On a t2.2xlarge AWS instance it took 189 minutes (~3 hours) with 8 threads and 1GB per-thread sorting memory to run secondary_rewriter on a 218 gigabyte BAM file (ultra long read hs37d5 from https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/HG002_NA24385_son/Ultralong_OxfordNanopore/guppy-V2.3.4_2019-06-26/)

Note also that the output data file is larger, in this example the result was 267Gb BAM vs the original 218Gb BAM.

Possible consideration

  • There are reasons that minimap2 may not output these fields (size of output being cited by the author), but it is perfectly possible to add the SEQ and QUAL back. This PR to minimap2 natively outputs the SEQ and QUAL https://github.com/lh3/minimap2/pull/687/files but it has been stated that minimap2 "will not" output these.

  • This program does not handle hard clipping in the CIGAR (H operator) but I haven't seen minimap2 output yet. If you see this let me know and a fix can probably be made.

  • Finally, you may also want to think about the implications of how to treat secondary alignments in your pipeline, for while this program can help in this particular circumstance, it may be unclear what the implications of these secondary/multi-mapping alignments are.

You might also like...
Bolt is a desktop application that is designed to make the process of developing and testing APIs easier and more efficient.

Bolt ⚡ Bolt is a desktop application that is designed to make the process of developing and testing APIs easier and more efficient. Quick start 👩‍💻

A tool of generating and viewing dice roll success distributions.

AZDice A GUI tool for generating and visualising dice roll probability distributions. Aims Intended to help people trying to get game balance just rig

Check Have I Been Pwned and see if it's time for you to change passwords.

checkpwn Check Have I Been Pwned and see if it's time for you to change passwords. Getting started Install: cargo install checkpwn Update: cargo inst

Utilities and tools based around Amazon S3 to provide convenience APIs in a CLI

s3-utils Utilities and tools based around Amazon S3 to provide convenience APIs in a CLI. This tool contains a small set of command line utilities for

A low-ish level tool for easily writing and hosting WASM based plugins.

A low-ish level tool for easily writing and hosting WASM based plugins. The goal of wasm_plugin is to make communicating across the host-plugin bounda

Czkawka is a simple, fast and easy to use app to remove unnecessary files from your computer.
Czkawka is a simple, fast and easy to use app to remove unnecessary files from your computer.

Multi functional app to find duplicates, empty folders, similar images etc.

An n-tuple pendulum simulator in Rust + WebAssembly (and JavaScript)

An n-tuple pendulum simulator in Rust + WebAssembly (and JavaScript) Remaking this n-tuple pendulum simulator moving the math to Rust 🦀 and WebAssemb

BSV stdlib written in Rust and runs in WASM environments

BSV.WASM A Rust/WASM Library to interact with Bitcoin SV Installation NodeJS: npm i bsv-wasm --save Web: npm i bsv-wasm-web --save Rust: https://crate

Rust experiments involving Haskell-esque do notation, state, failure and Nom parsers!

Introduction As a long time Haskell developer recently delving into Rust, something I've really missed is monads and do notation. One thing monadic do

Owner
Colin Diesh
GMOD/jbrowse developer
Colin Diesh
Fusion is a cross-platform App Dev ToolKit build on Rust . Fusion lets you create Beautiful and Fast apps for mobile and desktop platform.

Fusion is a cross-platform App Dev ToolKit build on Rust . Fusion lets you create Beautiful and Fast apps for mobile and desktop platform.

Fusion 1 Oct 19, 2021
List of Persian Colors and hex colors for CSS, SCSS, PHP, JS, Python, and Ruby.

Persian Colors (Iranian colors) List of Persian Colors and hex colors for CSS, SCSS, PHP, C++, QML, JS, Python, Ruby and CSharp. Persian colors Name H

Max Base 12 Sep 3, 2022
Northstar is a horizontally scalable and multi-tenant Kubernetes cluster provisioner and orchestrator

Northstar Northstar is a horizontally scalable and multi-tenant Kubernetes cluster provisioner and orchestrator. Explore the docs » View Demo · Report

Lucas Clerisse 1 Jan 22, 2022
Time related types (and conversions) for scientific and astronomical usage.

astrotime Time related types (and conversions) for scientific and astronomical usage. This library is lightweight and high performance. Features The f

Michael Dilger 3 Aug 22, 2022
A Diablo II library for core and simple client functionality, written in Rust for performance, safety and re-usability

A Diablo II library for core and simple client functionality, written in Rust for performance, safety and re-usability

null 4 Nov 30, 2022
UnTeX is both a library and an executable that allows you to manipulate and understand TeX files.

UnTeX UnTeX is both a library and an executable that allows you to manipulate and understand TeX files. Usage Executable If you wish to use the execut

Jérome Eertmans 1 Apr 5, 2022
A convenient tracing config and init lib, with symlinking and local timezone.

clia-tracing-config A convenient tracing config and init lib, with symlinking and local timezone. Use these formats default, and can be configured: pr

Cris Liao 5 Jan 3, 2023
Code examples, data structures, and links from my book, Rust Atomics and Locks.

This repository contains the code examples, data structures, and links from Rust Atomics and Locks. The examples from chapters 1, 2, 3, and 8 can be f

Mara Bos 338 Jan 6, 2023
📮 load, write, and copy remote and local assets

axoasset This library offers read, write, and copy functions, for local and remote assets given a string that contains a relative or absolute local pa

axo 7 Jan 25, 2023
A tool for investigating file system and folder contents and their changes.

Sniff A tool for investigating file systems and folder contents and their changes. Sniff can create snapshots of file systems and folders, storing has

Niclas Schwarzlose 6 Jan 14, 2023