VEP-like tool for sequence ontology and HGVS annotation of VCF files

Related tags

Command-line mehari

Overview

Mehari

Mehari is a software package for annotating VCF files with variant effect/consequence. The program uses hgvs-rs for projecting genomic variants to transcripts and proteins and thus has high prediction quality.

Other popular tools offering variant effect/consequence prediction include:

Mehari offers predictions that aim to mirror VariantValidator, the gold standard for HGVS variant descriptions. Further, it is written in the Rust programming language and can be used as a library for users' Rust software.

Supported Sequence Variant Frequency Databases

Mehari can import public sequence variant frequency databases. The supported set slightly differs between import for GRCh37 and GRCh38.

GRCh37

gnomAD r2.1.1 Exomes gnomad.exomes.r2.1.1.sites.vcf.bgz
gnomAD r2.1.1 Genomes gnomad.genomes.r2.1.1.sites.vcf.bgz
gnomAD v3.1 mtDNA gnomad.genomes.v3.1.sites.chrM.vcf.bgz
HelixMTdb HelixMTdb_20200327.tsv

GRCh38

gnomAD r2.1.1 lift-over Exomes gnomad.exomes.r2.1.1.sites.liftover_grch38.vcf.bgz
gnomAD v3.1 Genomes gnomad.genomes.v3.1.2.sites.$CHROM.vcf.bgz
gnomAD v3.1 mtDNA gnomad.genomes.v3.1.sites.chrM.vcf.bgz
HelixMTdb HelixMTdb_20200327.tsv

Internal Notes

rm -rf /tmp/out ; cargo run -- db create seqvar-freqs --path-output-db /tmp/out --genome-release grch38 --path-helix-mtdb ~/Downloads/HelixMTdb_20200327.vcf.gz --path-gnomad-mtdna ~/Downloads/gnomad.genomes.v3.1.sites.chrM.vcf.bgz --path-gnomad-exomes-xy tests/data/db/create/seqvar_freqs/xy-38/gnomad.exomes.r2.1.1.sites.chrX.vcf --path-gnomad-exomes-xy tests/data/db/create/seqvar_freqs/xy-38/gnomad.exomes.r2.1.1.sites.chrY.vcf --path-gnomad-genomes-xy tests/data/db/create/seqvar_freqs/xy-38/gnomad.genomes.r3.1.1.sites.chrX.vcf --path-gnomad-genomes-xy tests/data/db/create/seqvar_freqs/xy-38/gnomad.genomes.r3.1.1.sites.chrY.vcf --path-gnomad-exomes-auto tests/data/db/create/seqvar_freqs/12-38/gnomad.exomes.r2.1.1.sites.chr1.vcf  --path-gnomad-exomes-auto tests/data/db/create/seqvar_freqs/12-38/gnomad.exomes.r2.1.1.sites.chr2.vcf --path-gnomad-genomes-auto tests/data/db/create/seqvar_freqs/12-38/gnomad.genomes.r3.1.1.sites.chr1.vcf --path-gnomad-genomes-auto tests/data/db/create/seqvar_freqs/12-38/gnomad.genomes.r3.1.1.sites.chr2.vcf

rm -rf /tmp/out ; cargo run -- db create seqvar-freqs --path-output-db /tmp/out --genome-release grch37 --path-gnomad-mtdna ~/Downloads/gnomad.genomes.v3.1.sites.chrM.vcf.bgz --path-gnomad-exomes-xy tests/data/db/create/seqvar_freqs/xy-37/gnomad.exomes.r2.1.1.sites.chrX.vcf --path-gnomad-exomes-xy tests/data/db/create/seqvar_freqs/xy-37/gnomad.exomes.r2.1.1.sites.chrY.vcf --path-gnomad-genomes-xy tests/data/db/create/seqvar_freqs/xy-37/gnomad.genomes.r2.1.1.sites.chrX.vcf --path-gnomad-exomes-auto tests/data/db/create/seqvar_freqs/12-37/gnomad.exomes.r2.1.1.sites.chr1.vcf  --path-gnomad-exomes-auto tests/data/db/create/seqvar_freqs/12-37/gnomad.exomes.r2.1.1.sites.chr2.vcf --path-gnomad-genomes-auto tests/data/db/create/seqvar_freqs/12-37/gnomad.genomes.r2.1.1.sites.chr1.vcf --path-gnomad-genomes-auto tests/data/db/create/seqvar_freqs/12-37/gnomad.genomes.r2.1.1.sites.chr2

prepare()
{
    in=$1
    out=$2

    zcat $in \
    | head -n 5000 \
    | grep ^# \
    > $out

    zcat $in \
    | grep -v ^# \
    | head -n 3 \
    >> $out
}

base=/data/sshfs/data/gpfs-1/groups/cubi/work/projects/2021-07-20_varfish-db-downloader-holtgrewe/varfish-db-downloader/

mkdir -p tests/data/db/create/seqvar_freqs/{12,xy}-{37,38}

## 37 exomes

prepare \
    $base/GRCh37/gnomAD_exomes/r2.1.1/download/gnomad.exomes.r2.1.1.sites.chr1.vcf.bgz \
    tests/data/db/create/seqvar_freqs/12-37/gnomad.exomes.r2.1.1.sites.chr1.vcf
prepare \
    $base/GRCh37/gnomAD_exomes/r2.1.1/download/gnomad.exomes.r2.1.1.sites.chr2.vcf.bgz \
    tests/data/db/create/seqvar_freqs/12-37/gnomad.exomes.r2.1.1.sites.chr2.vcf
prepare \
    $base/GRCh37/gnomAD_exomes/r2.1.1/download/gnomad.exomes.r2.1.1.sites.chrX.vcf.bgz \
    tests/data/db/create/seqvar_freqs/xy-37/gnomad.exomes.r2.1.1.sites.chrX.vcf
prepare \
    $base/GRCh37/gnomAD_exomes/r2.1.1/download/gnomad.exomes.r2.1.1.sites.chrY.vcf.bgz \
    tests/data/db/create/seqvar_freqs/xy-37/gnomad.exomes.r2.1.1.sites.chrY.vcf

## 37 genomes

prepare \
    $base/GRCh37/gnomAD_genomes/r2.1.1/download/gnomad.genomes.r2.1.1.sites.chr1.vcf.bgz \
    tests/data/db/create/seqvar_freqs/12-37/gnomad.genomes.r2.1.1.sites.chr1.vcf
prepare \
    $base/GRCh37/gnomAD_genomes/r2.1.1/download/gnomad.genomes.r2.1.1.sites.chr2.vcf.bgz \
    tests/data/db/create/seqvar_freqs/12-37/gnomad.genomes.r2.1.1.sites.chr2.vcf
prepare \
    $base/GRCh37/gnomAD_genomes/r2.1.1/download/gnomad.genomes.r2.1.1.sites.chrX.vcf.bgz \
    tests/data/db/create/seqvar_freqs/xy-37/gnomad.genomes.r2.1.1.sites.chrX.vcf

## 38 exomes

prepare \
    $base/GRCh38/gnomAD_exomes/r2.1.1/download/gnomad.exomes.r2.1.1.sites.chr1.vcf.bgz \
    tests/data/db/create/seqvar_freqs/12-38/gnomad.exomes.r2.1.1.sites.chr1.vcf
prepare \
    $base/GRCh38/gnomAD_exomes/r2.1.1/download/gnomad.exomes.r2.1.1.sites.chr2.vcf.bgz \
    tests/data/db/create/seqvar_freqs/12-38/gnomad.exomes.r2.1.1.sites.chr2.vcf
prepare \
    $base/GRCh38/gnomAD_exomes/r2.1.1/download/gnomad.exomes.r2.1.1.sites.chrX.vcf.bgz \
    tests/data/db/create/seqvar_freqs/xy-38/gnomad.exomes.r2.1.1.sites.chrX.vcf
prepare \
    $base/GRCh38/gnomAD_exomes/r2.1.1/download/gnomad.exomes.r2.1.1.sites.chrY.vcf.bgz \
    tests/data/db/create/seqvar_freqs/xy-38/gnomad.exomes.r2.1.1.sites.chrY.vcf

## 38 genomes

prepare \
    $base/GRCh38/gnomAD_genomes/r3.1.1/download/gnomad.genomes.r3.1.1.sites.chr1.vcf.bgz \
    tests/data/db/create/seqvar_freqs/12-38/gnomad.genomes.r3.1.1.sites.chr1.vcf
prepare \
    $base/GRCh38/gnomAD_genomes/r3.1.1/download/gnomad.genomes.r3.1.1.sites.chr2.vcf.bgz \
    tests/data/db/create/seqvar_freqs/12-38/gnomad.genomes.r3.1.1.sites.chr2.vcf
prepare \
    $base/GRCh38/gnomAD_genomes/r3.1.1/download/gnomad.genomes.r3.1.1.sites.chrX.vcf.bgz \
    tests/data/db/create/seqvar_freqs/xy-38/gnomad.genomes.r3.1.1.sites.chrX.vcf
prepare \
    $base/GRCh38/gnomAD_genomes/r3.1.1/download/gnomad.genomes.r3.1.1.sites.chrY.vcf.bgz \
    tests/data/db/create/seqvar_freqs/xy-38/gnomad.genomes.r3.1.1.sites.chrY.vcf

Building tx database

cd hgvs-rs-data

seqrepo --root-directory seqrepo-data/master init

mkdir -p mirror/ftp.ncbi.nih.gov/refseq/H_sapiens/mRNA_Prot
cd !$
wget https://ftp.ncbi.nih.gov/refseq/H_sapiens/mRNA_Prot/human.files.installed
parallel -j 16 'wget https://ftp.ncbi.nih.gov/refseq/H_sapiens/mRNA_Prot/{}' ::: $(cut -f 2 human.files.installed | grep fna)
cd -

mkdir -p mirror/ftp.ensembl.org/pub/release-108/fasta/homo_sapiens/cdna
cd !$
wget https://ftp.ensembl.org/pub/release-108/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz
cd -
mkdir -p mirror/ftp.ensembl.org/pub/release-108/fasta/homo_sapiens/ncrna
cd !$
wget https://ftp.ensembl.org/pub/release-109/fasta/homo_sapiens/ncrna/Homo_sapiens.GRCh38.ncrna.fa.gz
cd -
mkdir -p mirror/ftp.ensembl.org/pub/grch37/release-108/fasta/homo_sapiens/cdna/
cd !$
wget https://ftp.ensembl.org/pub/grch37/release-108/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh37.cdna.all.fa.gz
cd -
mkdir -p mirror/ftp.ensembl.org/pub/grch37/release-108/fasta/homo_sapiens/ncrna/
cd !$
wget https://ftp.ensembl.org/pub/grch37/release-108/fasta/homo_sapiens/ncrna/Homo_sapiens.GRCh37.ncrna.fa.gz
cd -

seqrepo --root-directory seqrepo-data/master load -n NCBI $(find mirror/ftp.ncbi.nih.gov -name '*.fna.gz' | sort)
seqrepo --root-directory seqrepo-data/master load -n ENSEMBL $(find mirror/ftp.ensembl.org -name '*.fa.gz' | sort)

cd ../mehari

cargo run --release -- \
    -v \
    db create txs \
        --path-out /tmp/txs-out.bin \
        --path-cdot-json ../cdot-0.2.12.ensembl.grch37_grch38.json.gz \
        --path-cdot-json ../cdot-0.2.12.refseq.grch37_grch38.json.gz \
        --path-seqrepo-instance ../hgvs-rs-data/seqrepo-data/master/master

Development Setup

You will need a recent version of flatbuffers, e.g.:

# bash utils/install-flatbuffers.sh
# export PATH=$PATH:$HOME/.local/share/flatbuffers/bin

Comments

Implement frequency database building for sequence variants
We need to annotate frequencies from frequency databases. We need to store them in fast key/value access mode. We should use rocksdb for such databases.

[ ] implement a CLI for importing the databases

GRCh37 & GRCh38

[ ] gnomAD exomes

[ ] gnomAD genomes

[ ] gnomAD mtDB

[ ] helixmtdb

The following are considered outdated by now (2023/03/03)

mtdb

mitomap

IGSR

GRCh38 http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000_genomes_project/release/20190312_biallelic_SNV_and_INDEL/

GRCh37 http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/

GRCh37 only

ExAC
opened by holtgrewe 2
Implement transcript-based annotation of sequence variants

We should implement the 2018 version of the annotation standard and document the annotated fields.

https://pcingola.github.io/SnpEff/adds/VCFannotationformat_v1.0.pdf

opened by holtgrewe 0
Document end-to-end invocation
In particular:

[x] how to build frequency database

[x] how to build transcript database

[x] how to create the combined database

[x] how to invoke for annotation of VCF files
opened by holtgrewe 0
Implement command for annotating sequence variants with frequency
Once #2 has been implemented, we need to create a command for annotating sequence variants from a VCF file. Writing the results to VCF is fine for now.

#2
opened by holtgrewe 0
Implement transcript database building
We need a fast database for transcripts and sequences. We have comprehensive transcript database solutions via hgvs. We have made good experience with flatbuffers for fast access.

[ ] create flatbuffers based data structures for transcript information and sequence

[ ] create a CLI for importing transcripts via uta and cdot
opened by holtgrewe 0
Evalute bincode as flatbuffers alternative

While flatbuffers allows to load data only partially, this does not really work when having to put it into an Rc<...> for passing it to the hgvs library.

By this benchmark we should evaluate the bincode crate as well.

opened by holtgrewe 1
Allow for bit packed DNA encoding

The create bio-seq will come in handy.

Also, we might need to encode N gaps somehow for reference sequence. This can be done by storing the sequence as A and encoding the gap positions separately.

opened by holtgrewe 0

VEP-like tool for sequence ontology and HGVS annotation of VCF files

Related tags

Overview

Mehari

Supported Sequence Variant Frequency Databases

Internal Notes

Development Setup

Comments

Implement frequency database building for sequence variants

Implement transcript-based annotation of sequence variants

Document end-to-end invocation

Implement command for annotating sequence variants with frequency

Implement transcript database building

Evalute bincode as flatbuffers alternative

Allow for bit packed DNA encoding

Owner

Berlin Institute of Health

Parallel iteration of FASTA/FASTQ files, for when sequence order doesn't matter but speed does

A tool that allow you to run SQL-like query on local files instead of database files using the GitQL SDK.

Native cross-platform full feature terminal-based sequence editor for git interactive rebase.

ClangQL is a tool that allow you to run SQL-like query on C/C++ Code instead of database files using the GitQL SDK

A bit like tee, a bit like script, but all with a fake tty. Lets you remote control and watch a process

Rust File Management CLI is a command-line tool written in Rust that provides essential file management functionalities. Whether you're working with files or directories, this tool simplifies common file operations with ease.

An open source, programmed in rust, privacy focused tool for reading programming resources (like stackoverflow) fast, efficient and asynchronous from the terminal.

rpsc is a *nix command line tool to quickly search for file systems items matching varied criterions like permissions, extended attributes and much more.

🍅 A command-line tool to get and set values in toml files while preserving comments and formatting

This tool was developed as part of a course on forensic analysis and cybersecurity. It is intended to be used as a training resource to help students understand the structure and content of job files in Windows environments.

A simple command line tool for creating font palettes for engines like libtcod

GREP like cli tool written in rust.

CLI Tool for tagging and organizing files by tags.

CLI tool to bake your fresh and hot MD files

miniserve - a CLI tool to serve files and dirs over HTTP

Voila is a domain-specific language launched through CLI tool for operating with files and directories in massive amounts in a fast & reliable way.

RnR is a command-line tool to securely rename multiple files and directories that supports regular expressions

rsv is a command line tool to deal with small and big CSV, TXT, EXCEL files (especially >10G)

Sniffer - a tool to quickly inspect csv and flat-file files for basic information