Minimal virus genome coverage assessment for metagenomic diagnostics

Eike Steinig

Last update: Oct 17, 2022

Related tags

Utilities rust coverage bioinformatics metagenomics diagnostics viral-metagenomics minimap2

Overview

vircov

Minimal virus genome coverage assessment for metagenomic diagnostics

Overview

v0.5.0

Purpose
Implementation
Installation
Usage
Tests
Concept
Clinical examples
Performance
Etymology
Contributors

Purpose

Viral metagenomic diagnostics from low-abundance clinical samples can be challenging in the absence of sufficient genome coverage. Vircov extracts distinct non-overlapping regions from a reference alignment and generates some helpful statistics. It can be used to flag potential hits without inspection of coverage plots in automated pipelines and reports.

Implementation

Vircov is written in Rust and works with alignments in the standard PAF or SAM/BAM/CRAM (next release) formats. It is extremely fast and can process alignments against thousands of viral reference genomes in seconds. Basic input filters can be selected to remove spurious alignments and text-style coverage plots can be printed to the terminal for visual confirmation.

Vircov is written for implementation in (accredited) metagenomics pipelines for human patients enroled in the META-GP network (Australia). As such, it attempts to be production-grade code with high test coverage, continuous integration, and versioned releases with precompiled binaries for Linux and MacOS.

Version 0.6.0 will be distributed on BioConda and Cargo and will have precompiled binaries available.

Installation

git clone https://github.com/esteinig/vircov 
cd vircov && cargo build --release

Usage

vircov tests/cases/test_full_ok.paf --fasta tests/cases/test_ok.fasta --table

Tests

git clone https://github.com/esteinig/vircov 
cd vircov && cargo test && cargo tarpaulin

Concept

Definitive viral diagnosis from metagenomic clinical samples can be extremely challenging due to low sequencing depth, large amounts of host reads and low infectious titres, especially in blood or CSF. One way to distinguish a positive viral diagnosis is to look at alignment coverage against one or multiple reference sequences. When only few reads map to the reference, and genome coverage is low, positive infections often display multiple distinct alignment regions, as opposed to reads mapping to a single or few regions on the reference.

De Vries et al. (2021) summarize this concept succinctly in this figure (adapted):

Positive calls in these cases can be made from coverage plots showing the distinct alignment regions and a threshold on the number of regions is chosen by the authors (> 3). However, coverage plots require visual assessment and may not be suitable for flagging potential hits in automated pipelines or summary reports.

Vircov attempts to make visual inspection and automated flagging easier by counting the distinct (non-overlapping) coverage regions in an alignment and reports some helpful statistics to make an educated call.

Clinical examples

Performance

Alignments conducted with minimap2 -c -x sr (PAF) and minimap2 -ax sr (SAM/BAM, all sequences) or minimap2 --sam-hits-only -ax sr (SAM/BAM, only aligned sequences). Peak memory is mainly determined by aligned interval records that are stored for the overlap computations. It may vary depending on how many alignments remain after filtering, the number of aligned reads, output formats and size of the reference database.

Sample 1: ~ 80 million Illumina PE reads against ~ 70k reference genomes, ~ 1.7 million alignments
Sample 2: ~ 80 million Illumina PE reads against ~ 70k reference genomes, ~ 63 million alignments

.paf

Sample 1 (212 MB): 0.56 seconds, peak memory: 12 MB
Sample 2 (8.9 GB): 55.4 seconds, peak memory: 2.9 GB

Etymology

Not a very creative abbreviation of "virus coverage" but the little spectacles in the logo are a reference to Rudolf Virchow who described such trivial concepts as cells, cancer and pathology. His surname is pronounced like vircov if you mumble the terminal v.

Contributors

Prof. Deborah Williamson and Prof. Lachlan Coin (principal investigators for META-GP)
Dr. Leon Caly (samples and sequencing for testing on clinical data)

Comments

Coverage terminal plot

An approximate coverage distribution plot to print in the terminal.

Possibly something like:

5'------------------------------------[---]----------3'

or with colors?
new feature

opened by esteinig 3
Replace pafrecord from_str with rust-csv deserialize [#8]
tiny degradation in parsing speed on large test file, speedup on small test file

in favor of safer record parsing using rust-csv and serde deserialization

simplifies PafRecord custom string parsing

full test coverage
opened by esteinig 1
Reporting threshold

Implement a coverage region threshold to filter outputs for reporting (as in the de Vries paper). Particularly useful for large databases - tested this on Virosaurus (70k genomes) and larger alignments / deeper sequencing runs.
new feature

opened by esteinig 1
Enhanced output
[x] interval coordinates with number of alignments

[x] number of unique reads in the alignments

[x] implement verbosity command line option

Implemented in last column (otherwise empty) format: start:stop:count

Intervals now contain the read identifier as val - these are then reduced to unique read identities. For paired end data (mapping to + for forward sequence and - for reverse sequence) this means that the final read count is for unique read pairs (and single mapped reads of a pair) if mate pairs have the same identifier.

[x] check when read identifiers contain /1 or /2 paired end designations

In case of different read identifier names in forward and reverse ({id}/1 or {id}/2, for example) the number of reads output represents the total count of forward and reverse reads.
new feature
opened by esteinig 1
Segmented genomes

Add option to parse header fields for segment qualifiers with custom formats to include and select best segments from alignments. At the moment hard-coded to include all reference sequences that have the segment=N/A field in the header (as in Virosaurus) and are ignored during grouped single selection, outputting all segments into the selected genome sequence files.

Also add option to select the best segment if multiple segments (same identifier) are grouped.
enhancement

opened by esteinig 0
Strobealign BAM --min-cov

For some arcane reason, when using BAM files output by Strobealign v0.7.1 the --min-cov filter removes all alignments (noted in the MetaGP remap loop process). Need to investigate the SAM/BAM query coverage calculation and compare with PAF in Strobealign.
bug

opened by esteinig 0

Output column names

Looks very interesting. Did I miss the output column names somewhere ?

gi_1070105496_ref_NC_031108.1__Propionibacterium_phage_PFR2,_complete_genome           2   2    2    132   0   0.0000
gi_1070125494_ref_NC_031129.1__Salmonella_phage_SJ46,_complete_genome                  2   2    2    128   0   0.0000
gi_110645916_ref_NC_001401.2__Adeno-associated_virus_-_2,_complete_genome              1   2    2    67    0   0.0000

Thanks

documentation question

opened by colindaven 7

Reference sequence description in table output

This may be useful when the reference sequence identifiers are not verbose e.g. in the Virosaurus database where the sequence header description specifies name of a virus.
enhancement new feature

opened by esteinig 4
Performance optimisation [PAF]

Need to implement an iterator based PAF parser and link it to interval extraction directly in the PafRecord structs - large PAF files could probably be parsed faster this way.
enhancement

opened by esteinig 9

Releases(0.5.0)

0.5.0(Mar 23, 2022)
Command line:

--paf | --bam input with "-" for reading from stdin

changed long name of --cov-reg to --regions

Main:

added SAM/BAM/CRAM support [#3]

rewrote interval parsing for PAF format [#8]

fixed bug in filtering coverage plot outputs [#9]

added table output confirmation test [#5]

added basic BAM reader tests, including query alignment length from CIGAR [#5]

reimplemented custom PAF parser due to variable CIGAR tags [#8]

Other:

replaced noodles fasta parsing with rust-bio

removed csv crate

Test coverage:

couldn't figure out one line for file name match statement [#14]

slight regression in coverage from reader functions

Source code(tar.gz)
Source code(zip)
0.4.0(Mar 18, 2022)
Operational, added features / command line options:

input alignment now required arg: vircov test.paf [previously: --paf option]

filter results output by

--seq-len: minimum reference sequence length

--cov-reg minimum number of detected coverage regions

long help menu with --help

pretty table output with --table

100% test coverage :partying_face:

continuous integration for Linux and MacOS

Source code(tar.gz)
Source code(zip)
0.3.0(Mar 16, 2022)
coverage plot implemented [#2]

added tests (~40% coverage)

Source code(tar.gz)
Source code(zip)
0.2.0(Mar 15, 2022)
output enhanced with useful summary statistics [#1]

Source code(tar.gz)
Source code(zip)
0.1.0(Mar 12, 2022)

Working prototype
Source code(tar.gz)
Source code(zip)

Owner

Eike Steinig

Bioinformatics | Infectious Diseases | Nanopore | Software Development

GitHub

Add nice user-facing diagnostics to your errors without being weird about it.

thisdiagnostic is a Rust library for adding rich diagnostic metadata to errors, for some really fancy and customizable error reporting!

14 Feb 2, 2022

A code coverage tool for Rust projects

Tarpaulin Tarpaulin is a code coverage reporting tool for the Cargo build system, named for a waterproof cloth used to cover cargo on a ship. Currentl

1.8k Jan 2, 2023

Transmute - a binary that works alone or in coordination with coverage formatter to report test quality

Transmute is a binary that works alone or in coordination with coverage formatter to report test quality. It will change your code and make the tests fail. If don't, we will raise it for you.

5 Nov 17, 2022

A lean, minimal, and stable set of types for color interoperation between crates in Rust.

This library provides a lean, minimal, and stable set of types for color interoperation between crates in Rust. Its goal is to serve the same function that mint provides for (linear algebra) math types.

16 Sep 21, 2022

Minimal, flexible framework for implementing solutions to Advent of Code in Rust

This is advent_of_code_traits, a minimal, flexible framework for implementing solutions to Advent of Code in Rust.

8 Apr 17, 2022

Minimal viable ZFS autosnapshot tool

zfs-autosnap Minimal viable ZFS snapshot utility. Add zfs-autosnap snap to your cron.hourly, and zfs-autosnap gc to cron.daily; then set at.rollc.at:s

17 Dec 19, 2022

untyped-arena provides an Arena allocator implementation that is safe and untyped with minimal complexity

untyped-arena untyped-arena provides an Arena allocator implementation that is safe and untyped with minimal complexity Usage let arena = Arena::new()

1 Jan 9, 2022

diff successive buffers with embedded ansi codes in rust, outputting a minimal change

ansi-diff diff successive buffers with embedded ansi codes in rust, outputting a minimal change You can use this crate to build command-line interface

7 Aug 11, 2022

Minimal runtime / startup for RISC-V CPUs from Espressif

esp-riscv-rt Minimal runtime / startup for RISC-V CPUs from Espressif. Much of the code in this repository originated in the rust-embedded/riscv-rt re

13 Feb 2, 2023

Minimal, flexible & user-friendly X and Wayland tiling window manager with rust

SSWM Minimal, flexible & user-friendly X and Wayland tiling window manager but with rust. Feel free to open issues and make pull requests. [Overview]

19 Aug 28, 2023

Dead simple, minimal SPDX License generator library written in Rust.

lice Dead simple, minimal SPDX License generator library written in Rust. Lice is in beta Install | User Docs | Crate Docs | Reference | Contributing

9 Oct 22, 2023

OpenAPI-based test coverage analysis tool that helps teams improve integration test coverage in CI/CD pipelines

Ready-to-use OpenAPI test coverage analysis tool that helps teams improve integration CoveAPI is an advanced test coverage analysis tool based on the

18 Aug 3, 2023

Easy c̵̰͠r̵̛̠ö̴̪s̶̩̒s̵̭̀-t̶̲͝h̶̯̚r̵̺͐e̷̖̽ḁ̴̍d̶̖̔ ȓ̵͙ė̶͎ḟ̴͙e̸̖͛r̶̖͗ë̶̱́ṉ̵̒ĉ̷̥e̷͚̍ s̷̹͌h̷̲̉a̵̭͋r̷̫̊ḭ̵̊n̷̬͂g̵̦̃ f̶̻̊ơ̵̜ṟ̸̈́ R̵̞̋ù̵̺s̷̖̅ţ̸͗!̸̼͋

Rust S̵̓i̸̓n̵̉ I̴n̴f̶e̸r̵n̷a̴l mutability! Howdy, friendly Rust developer! Ever had a value get m̵̯̅ð̶͊v̴̮̾ê̴̼͘d away right under your nose just when

294 Dec 23, 2022

Simple prepender virus written in Rust

Linux.Fe2O3 This is a POC ELF prepender written in Rust. I like writting prependers on languages that I'm learning and find interesting. As for the na

91 Dec 9, 2022

MetagenOmic read Re-Assigner and abundance quantifier

Mora Mora is an read re-assigner that re-assigns query reads to a unique reference. Main steps of Mora: Calculate the expected abundance levels of the

8 Nov 18, 2022

A tool to filter sites in a FASTA-format whole-genome pseudo-alignment

Core-SNP-filter This is a tool to filter sites (i.e. columns) in a FASTA-format whole-genome pseudo-alignment based on: Whether the site contains vari

15 Apr 2, 2023

Guardian Self Assessment CLI tool

Guardian Self Assessment CLI tool What? self-assessment is a tool that generates a list of PRs authored and reviewed by you. Why? Assessing oneself is

5 Jul 6, 2022

Mercy is a public Rust crate created to assist with building cybersecurity frameworks, assessment tools, and numerous other projects

Mercy ?? Documentation Mercy is a public Rust crate created to assist with building cybersecurity frameworks, assessment tools, and numerous other pro

2 Nov 27, 2022

Mercy is a public Rust crate created to assist with building cybersecurity frameworks, assessment tools, and numerous other projects

Mercy ?? Documentation Mercy is a public Rust crate created to assist with building cybersecurity frameworks, assessment tools, and numerous other pro

2 Nov 27, 2022

Markdown LSP server for easy note-taking with cross-references and diagnostics.

Zeta Note is a language server that helps you write and manage notes. The primary focus is to support Zettelkasten-like1, 2 note taking by providing an easy way to cross-reference notes (see more about features below).

4 Oct 27, 2022

Minimal virus genome coverage assessment for metagenomic diagnostics

Related tags

Overview

vircov

Overview

Purpose

Implementation

Installation

Usage

Tests

Concept

Clinical examples

Performance

Etymology

Contributors

Comments

Releases(0.5.0)

0.5.0(Mar 23, 2022)

0.4.0(Mar 18, 2022)

0.3.0(Mar 16, 2022)

0.2.0(Mar 15, 2022)

0.1.0(Mar 12, 2022)

Owner

Eike Steinig

Add nice user-facing diagnostics to your errors without being weird about it.

A code coverage tool for Rust projects

Transmute - a binary that works alone or in coordination with coverage formatter to report test quality

A lean, minimal, and stable set of types for color interoperation between crates in Rust.

Minimal, flexible framework for implementing solutions to Advent of Code in Rust

Minimal viable ZFS autosnapshot tool

untyped-arena provides an Arena allocator implementation that is safe and untyped with minimal complexity

diff successive buffers with embedded ansi codes in rust, outputting a minimal change

Minimal runtime / startup for RISC-V CPUs from Espressif

Minimal, flexible & user-friendly X and Wayland tiling window manager with rust

Dead simple, minimal SPDX License generator library written in Rust.

OpenAPI-based test coverage analysis tool that helps teams improve integration test coverage in CI/CD pipelines

Easy c̵̰͠r̵̛̠ö̴̪s̶̩̒s̵̭̀-t̶̲͝h̶̯̚r̵̺͐e̷̖̽ḁ̴̍d̶̖̔ ȓ̵͙ė̶͎ḟ̴͙e̸̖͛r̶̖͗ë̶̱́ṉ̵̒ĉ̷̥e̷͚̍ s̷̹͌h̷̲̉a̵̭͋r̷̫̊ḭ̵̊n̷̬͂g̵̦̃ f̶̻̊ơ̵̜ṟ̸̈́ R̵̞̋ù̵̺s̷̖̅ţ̸͗!̸̼͋

Simple prepender virus written in Rust

MetagenOmic read Re-Assigner and abundance quantifier

A tool to filter sites in a FASTA-format whole-genome pseudo-alignment

Guardian Self Assessment CLI tool

Mercy is a public Rust crate created to assist with building cybersecurity frameworks, assessment tools, and numerous other projects

Mercy is a public Rust crate created to assist with building cybersecurity frameworks, assessment tools, and numerous other projects

Markdown LSP server for easy note-taking with cross-references and diagnostics.