vircov
Minimal virus genome coverage assessment for metagenomic diagnostics
Overview
v0.5.0
- Purpose
- Implementation
- Installation
- Usage
- Tests
- Concept
- Clinical examples
- Performance
- Etymology
- Contributors
Purpose
Viral metagenomic diagnostics from low-abundance clinical samples can be challenging in the absence of sufficient genome coverage. Vircov
extracts distinct non-overlapping regions from a reference alignment and generates some helpful statistics. It can be used to flag potential hits without inspection of coverage plots in automated pipelines and reports.
Implementation
Vircov
is written in Rust and works with alignments in the standard PAF
or SAM/BAM/CRAM
(next release) formats. It is extremely fast and can process alignments against thousands of viral reference genomes in seconds. Basic input filters can be selected to remove spurious alignments and text-style coverage plots can be printed to the terminal for visual confirmation.
Vircov
is written for implementation in (accredited) metagenomics pipelines for human patients enroled in the META-GP
network (Australia). As such, it attempts to be production-grade code with high test coverage, continuous integration, and versioned releases with precompiled binaries for Linux and MacOS.
Version 0.6.0
will be distributed on BioConda
and Cargo
and will have precompiled binaries available.
Installation
git clone https://github.com/esteinig/vircov
cd vircov && cargo build --release
Usage
vircov tests/cases/test_full_ok.paf --fasta tests/cases/test_ok.fasta --table
Tests
git clone https://github.com/esteinig/vircov
cd vircov && cargo test && cargo tarpaulin
Concept
Definitive viral diagnosis from metagenomic clinical samples can be extremely challenging due to low sequencing depth, large amounts of host reads and low infectious titres, especially in blood or CSF. One way to distinguish a positive viral diagnosis is to look at alignment coverage against one or multiple reference sequences. When only few reads map to the reference, and genome coverage is low, positive infections often display multiple distinct alignment regions, as opposed to reads mapping to a single or few regions on the reference.
De Vries et al. (2021) summarize this concept succinctly in this figure (adapted):
Positive calls in these cases can be made from coverage plots showing the distinct alignment regions and a threshold on the number of regions is chosen by the authors (> 3). However, coverage plots require visual assessment and may not be suitable for flagging potential hits in automated pipelines or summary reports.
Vircov
attempts to make visual inspection and automated flagging easier by counting the distinct (non-overlapping) coverage regions in an alignment and reports some helpful statistics to make an educated call.
Clinical examples
Performance
Alignments conducted with minimap2 -c -x sr
(PAF) and minimap2 -ax sr
(SAM/BAM, all sequences) or minimap2 --sam-hits-only -ax sr
(SAM/BAM, only aligned sequences). Peak memory is mainly determined by aligned interval records that are stored for the overlap computations. It may vary depending on how many alignments remain after filtering, the number of aligned reads, output formats and size of the reference database.
- Sample 1: ~ 80 million Illumina PE reads against ~ 70k reference genomes, ~ 1.7 million alignments
- Sample 2: ~ 80 million Illumina PE reads against ~ 70k reference genomes, ~ 63 million alignments
.paf
- Sample 1 (212 MB): 0.56 seconds, peak memory: 12 MB
- Sample 2 (8.9 GB): 55.4 seconds, peak memory: 2.9 GB
Etymology
Not a very creative abbreviation of "virus coverage" but the little spectacles in the logo are a reference to Rudolf Virchow who described such trivial concepts as cells, cancer and pathology. His surname is pronounced like vircov
if you mumble the terminal v
.
Contributors
- Prof. Deborah Williamson and Prof. Lachlan Coin (principal investigators for
META-GP
) - Dr. Leon Caly (samples and sequencing for testing on clinical data)