Minimal virus genome coverage assessment for metagenomic diagnostics

Overview

vircov

build codecov

Minimal virus genome coverage assessment for metagenomic diagnostics

Overview

v0.5.0

Purpose

Viral metagenomic diagnostics from low-abundance clinical samples can be challenging in the absence of sufficient genome coverage. Vircov extracts distinct non-overlapping regions from a reference alignment and generates some helpful statistics. It can be used to flag potential hits without inspection of coverage plots in automated pipelines and reports.

Implementation

Vircov is written in Rust and works with alignments in the standard PAF or SAM/BAM/CRAM (next release) formats. It is extremely fast and can process alignments against thousands of viral reference genomes in seconds. Basic input filters can be selected to remove spurious alignments and text-style coverage plots can be printed to the terminal for visual confirmation.

Vircov is written for implementation in (accredited) metagenomics pipelines for human patients enroled in the META-GP network (Australia). As such, it attempts to be production-grade code with high test coverage, continuous integration, and versioned releases with precompiled binaries for Linux and MacOS.

Version 0.6.0 will be distributed on BioConda and Cargo and will have precompiled binaries available.

Installation

git clone https://github.com/esteinig/vircov 
cd vircov && cargo build --release

Usage

vircov tests/cases/test_full_ok.paf --fasta tests/cases/test_ok.fasta --table

Tests

git clone https://github.com/esteinig/vircov 
cd vircov && cargo test && cargo tarpaulin 

Concept

Definitive viral diagnosis from metagenomic clinical samples can be extremely challenging due to low sequencing depth, large amounts of host reads and low infectious titres, especially in blood or CSF. One way to distinguish a positive viral diagnosis is to look at alignment coverage against one or multiple reference sequences. When only few reads map to the reference, and genome coverage is low, positive infections often display multiple distinct alignment regions, as opposed to reads mapping to a single or few regions on the reference.

De Vries et al. (2021) summarize this concept succinctly in this figure (adapted):

devries

Positive calls in these cases can be made from coverage plots showing the distinct alignment regions and a threshold on the number of regions is chosen by the authors (> 3). However, coverage plots require visual assessment and may not be suitable for flagging potential hits in automated pipelines or summary reports.

Vircov attempts to make visual inspection and automated flagging easier by counting the distinct (non-overlapping) coverage regions in an alignment and reports some helpful statistics to make an educated call.

Clinical examples

Performance

Alignments conducted with minimap2 -c -x sr (PAF) and minimap2 -ax sr (SAM/BAM, all sequences) or minimap2 --sam-hits-only -ax sr (SAM/BAM, only aligned sequences). Peak memory is mainly determined by aligned interval records that are stored for the overlap computations. It may vary depending on how many alignments remain after filtering, the number of aligned reads, output formats and size of the reference database.

  • Sample 1: ~ 80 million Illumina PE reads against ~ 70k reference genomes, ~ 1.7 million alignments
  • Sample 2: ~ 80 million Illumina PE reads against ~ 70k reference genomes, ~ 63 million alignments

.paf

  • Sample 1 (212 MB): 0.56 seconds, peak memory: 12 MB
  • Sample 2 (8.9 GB): 55.4 seconds, peak memory: 2.9 GB

Etymology

Not a very creative abbreviation of "virus coverage" but the little spectacles in the logo are a reference to Rudolf Virchow who described such trivial concepts as cells, cancer and pathology. His surname is pronounced like vircov if you mumble the terminal v.

Contributors

  • Prof. Deborah Williamson and Prof. Lachlan Coin (principal investigators for META-GP)
  • Dr. Leon Caly (samples and sequencing for testing on clinical data)
Comments
  • Coverage terminal plot

    Coverage terminal plot

    An approximate coverage distribution plot to print in the terminal.

    Possibly something like:

    5'------------------------------------[---]----------3'

    or with colors?

    new feature 
    opened by esteinig 3
  • Replace pafrecord from_str with rust-csv deserialize [#8]

    Replace pafrecord from_str with rust-csv deserialize [#8]

    • tiny degradation in parsing speed on large test file, speedup on small test file
    • in favor of safer record parsing using rust-csv and serde deserialization
    • simplifies PafRecord custom string parsing
    • full test coverage
    opened by esteinig 1
  • Reporting threshold

    Reporting threshold

    Implement a coverage region threshold to filter outputs for reporting (as in the de Vries paper). Particularly useful for large databases - tested this on Virosaurus (70k genomes) and larger alignments / deeper sequencing runs.

    new feature 
    opened by esteinig 1
  • Enhanced output

    Enhanced output

    • [x] interval coordinates with number of alignments
    • [x] number of unique reads in the alignments
    • [x] implement verbosity command line option

    Implemented in last column (otherwise empty) format: start:stop:count

    Intervals now contain the read identifier as val - these are then reduced to unique read identities. For paired end data (mapping to + for forward sequence and - for reverse sequence) this means that the final read count is for unique read pairs (and single mapped reads of a pair) if mate pairs have the same identifier.

    • [x] check when read identifiers contain /1 or /2 paired end designations

    In case of different read identifier names in forward and reverse ({id}/1 or {id}/2, for example) the number of reads output represents the total count of forward and reverse reads.

    new feature 
    opened by esteinig 1
  • Segmented genomes

    Segmented genomes

    Add option to parse header fields for segment qualifiers with custom formats to include and select best segments from alignments. At the moment hard-coded to include all reference sequences that have the segment=N/A field in the header (as in Virosaurus) and are ignored during grouped single selection, outputting all segments into the selected genome sequence files.

    Also add option to select the best segment if multiple segments (same identifier) are grouped.

    enhancement 
    opened by esteinig 0
  • Strobealign BAM --min-cov

    Strobealign BAM --min-cov

    For some arcane reason, when using BAM files output by Strobealign v0.7.1 the --min-cov filter removes all alignments (noted in the MetaGP remap loop process). Need to investigate the SAM/BAM query coverage calculation and compare with PAF in Strobealign.

    bug 
    opened by esteinig 0
  • Output column names

    Output column names

    Looks very interesting. Did I miss the output column names somewhere ?

    gi_1070105496_ref_NC_031108.1__Propionibacterium_phage_PFR2,_complete_genome           2   2    2    132   0   0.0000
    gi_1070125494_ref_NC_031129.1__Salmonella_phage_SJ46,_complete_genome                  2   2    2    128   0   0.0000
    gi_110645916_ref_NC_001401.2__Adeno-associated_virus_-_2,_complete_genome              1   2    2    67    0   0.0000
    
    

    Thanks

    documentation question 
    opened by colindaven 7
  • Reference sequence description in table output

    Reference sequence description in table output

    This may be useful when the reference sequence identifiers are not verbose e.g. in the Virosaurus database where the sequence header description specifies name of a virus.

    enhancement new feature 
    opened by esteinig 4
  • Performance optimisation [PAF]

    Performance optimisation [PAF]

    Need to implement an iterator based PAF parser and link it to interval extraction directly in the PafRecord structs - large PAF files could probably be parsed faster this way.

    enhancement 
    opened by esteinig 9
Releases(0.5.0)
  • 0.5.0(Mar 23, 2022)

    Command line:

    • --paf | --bam input with "-" for reading from stdin
    • changed long name of --cov-reg to --regions

    Main:

    • added SAM/BAM/CRAM support [#3]
    • rewrote interval parsing for PAF format [#8]
    • fixed bug in filtering coverage plot outputs [#9]
    • added table output confirmation test [#5]
    • added basic BAM reader tests, including query alignment length from CIGAR [#5]
    • reimplemented custom PAF parser due to variable CIGAR tags [#8]

    Other:

    • replaced noodles fasta parsing with rust-bio
    • removed csv crate

    Test coverage:

    • couldn't figure out one line for file name match statement [#14]
    • slight regression in coverage from reader functions
    Source code(tar.gz)
    Source code(zip)
  • 0.4.0(Mar 18, 2022)

    Operational, added features / command line options:

    • input alignment now required arg: vircov test.paf [previously: --paf option]
    • filter results output by
      • --seq-len: minimum reference sequence length
      • --cov-reg minimum number of detected coverage regions
    • long help menu with --help
    • pretty table output with --table
    • 100% test coverage :partying_face:
    • continuous integration for Linux and MacOS
    Source code(tar.gz)
    Source code(zip)
  • 0.3.0(Mar 16, 2022)

  • 0.2.0(Mar 15, 2022)

  • 0.1.0(Mar 12, 2022)

Owner
Eike Steinig
Bioinformatics | Infectious Diseases | Nanopore | Software Development
Eike Steinig
Add nice user-facing diagnostics to your errors without being weird about it.

thisdiagnostic is a Rust library for adding rich diagnostic metadata to errors, for some really fancy and customizable error reporting!

Kat Marchán 14 Feb 2, 2022
A code coverage tool for Rust projects

Tarpaulin Tarpaulin is a code coverage reporting tool for the Cargo build system, named for a waterproof cloth used to cover cargo on a ship. Currentl

null 1.8k Jan 2, 2023
Transmute - a binary that works alone or in coordination with coverage formatter to report test quality

Transmute is a binary that works alone or in coordination with coverage formatter to report test quality. It will change your code and make the tests fail. If don't, we will raise it for you.

Victor Antoniazzi 5 Nov 17, 2022
A lean, minimal, and stable set of types for color interoperation between crates in Rust.

This library provides a lean, minimal, and stable set of types for color interoperation between crates in Rust. Its goal is to serve the same function that mint provides for (linear algebra) math types.

Gray Olson 16 Sep 21, 2022
Minimal, flexible framework for implementing solutions to Advent of Code in Rust

This is advent_of_code_traits, a minimal, flexible framework for implementing solutions to Advent of Code in Rust.

David 8 Apr 17, 2022
Minimal viable ZFS autosnapshot tool

zfs-autosnap Minimal viable ZFS snapshot utility. Add zfs-autosnap snap to your cron.hourly, and zfs-autosnap gc to cron.daily; then set at.rollc.at:s

Kamil 17 Dec 19, 2022
untyped-arena provides an Arena allocator implementation that is safe and untyped with minimal complexity

untyped-arena untyped-arena provides an Arena allocator implementation that is safe and untyped with minimal complexity Usage let arena = Arena::new()

Max Bruce 1 Jan 9, 2022
diff successive buffers with embedded ansi codes in rust, outputting a minimal change

ansi-diff diff successive buffers with embedded ansi codes in rust, outputting a minimal change You can use this crate to build command-line interface

James Halliday 7 Aug 11, 2022
Minimal runtime / startup for RISC-V CPUs from Espressif

esp-riscv-rt Minimal runtime / startup for RISC-V CPUs from Espressif. Much of the code in this repository originated in the rust-embedded/riscv-rt re

esp-rs 13 Feb 2, 2023
Minimal, flexible & user-friendly X and Wayland tiling window manager with rust

SSWM Minimal, flexible & user-friendly X and Wayland tiling window manager but with rust. Feel free to open issues and make pull requests. [Overview]

Linus Walker 19 Aug 28, 2023
Dead simple, minimal SPDX License generator library written in Rust.

lice Dead simple, minimal SPDX License generator library written in Rust. Lice is in beta Install | User Docs | Crate Docs | Reference | Contributing

refcell.eth 9 Oct 22, 2023
OpenAPI-based test coverage analysis tool that helps teams improve integration test coverage in CI/CD pipelines

Ready-to-use OpenAPI test coverage analysis tool that helps teams improve integration CoveAPI is an advanced test coverage analysis tool based on the

Yasser Tahiri 18 Aug 3, 2023
Easy c̵̰͠r̵̛̠ö̴̪s̶̩̒s̵̭̀-t̶̲͝h̶̯̚r̵̺͐e̷̖̽ḁ̴̍d̶̖̔ ȓ̵͙ė̶͎ḟ̴͙e̸̖͛r̶̖͗ë̶̱́ṉ̵̒ĉ̷̥e̷͚̍ s̷̹͌h̷̲̉a̵̭͋r̷̫̊ḭ̵̊n̷̬͂g̵̦̃ f̶̻̊ơ̵̜ṟ̸̈́ R̵̞̋ù̵̺s̷̖̅ţ̸͗!̸̼͋

Rust S̵̓i̸̓n̵̉ I̴n̴f̶e̸r̵n̷a̴l mutability! Howdy, friendly Rust developer! Ever had a value get m̵̯̅ð̶͊v̴̮̾ê̴̼͘d away right under your nose just when

null 294 Dec 23, 2022
Simple prepender virus written in Rust

Linux.Fe2O3 This is a POC ELF prepender written in Rust. I like writting prependers on languages that I'm learning and find interesting. As for the na

Guilherme Thomazi Bonicontro 91 Dec 9, 2022
MetagenOmic read Re-Assigner and abundance quantifier

Mora Mora is an read re-assigner that re-assigns query reads to a unique reference. Main steps of Mora: Calculate the expected abundance levels of the

Andrew Zheng 8 Nov 18, 2022
A tool to filter sites in a FASTA-format whole-genome pseudo-alignment

Core-SNP-filter This is a tool to filter sites (i.e. columns) in a FASTA-format whole-genome pseudo-alignment based on: Whether the site contains vari

Ryan Wick 15 Apr 2, 2023
Guardian Self Assessment CLI tool

Guardian Self Assessment CLI tool What? self-assessment is a tool that generates a list of PRs authored and reviewed by you. Why? Assessing oneself is

The Guardian 5 Jul 6, 2022
Mercy is a public Rust crate created to assist with building cybersecurity frameworks, assessment tools, and numerous other projects

Mercy ?? Documentation Mercy is a public Rust crate created to assist with building cybersecurity frameworks, assessment tools, and numerous other pro

Umiko Security 2 Nov 27, 2022
Mercy is a public Rust crate created to assist with building cybersecurity frameworks, assessment tools, and numerous other projects

Mercy ?? Documentation Mercy is a public Rust crate created to assist with building cybersecurity frameworks, assessment tools, and numerous other pro

CyberSuki 2 Nov 27, 2022
Markdown LSP server for easy note-taking with cross-references and diagnostics.

Zeta Note is a language server that helps you write and manage notes. The primary focus is to support Zettelkasten-like1, 2 note taking by providing an easy way to cross-reference notes (see more about features below).

Artem Pyanykh 4 Oct 27, 2022