VEP-like tool for sequence ontology and HGVS annotation of VCF files

Related tags

Command-line mehari
Overview

CI codecov

Mehari

Mehari is a software package for annotating VCF files with variant effect/consequence. The program uses hgvs-rs for projecting genomic variants to transcripts and proteins and thus has high prediction quality.

Other popular tools offering variant effect/consequence prediction include:

Mehari offers predictions that aim to mirror VariantValidator, the gold standard for HGVS variant descriptions. Further, it is written in the Rust programming language and can be used as a library for users' Rust software.

Supported Sequence Variant Frequency Databases

Mehari can import public sequence variant frequency databases. The supported set slightly differs between import for GRCh37 and GRCh38.

GRCh37

GRCh38

Internal Notes

rm -rf /tmp/out ; cargo run -- db create seqvar-freqs --path-output-db /tmp/out --genome-release grch38 --path-helix-mtdb ~/Downloads/HelixMTdb_20200327.vcf.gz --path-gnomad-mtdna ~/Downloads/gnomad.genomes.v3.1.sites.chrM.vcf.bgz --path-gnomad-exomes-xy tests/data/db/create/seqvar_freqs/xy-38/gnomad.exomes.r2.1.1.sites.chrX.vcf --path-gnomad-exomes-xy tests/data/db/create/seqvar_freqs/xy-38/gnomad.exomes.r2.1.1.sites.chrY.vcf --path-gnomad-genomes-xy tests/data/db/create/seqvar_freqs/xy-38/gnomad.genomes.r3.1.1.sites.chrX.vcf --path-gnomad-genomes-xy tests/data/db/create/seqvar_freqs/xy-38/gnomad.genomes.r3.1.1.sites.chrY.vcf --path-gnomad-exomes-auto tests/data/db/create/seqvar_freqs/12-38/gnomad.exomes.r2.1.1.sites.chr1.vcf  --path-gnomad-exomes-auto tests/data/db/create/seqvar_freqs/12-38/gnomad.exomes.r2.1.1.sites.chr2.vcf --path-gnomad-genomes-auto tests/data/db/create/seqvar_freqs/12-38/gnomad.genomes.r3.1.1.sites.chr1.vcf --path-gnomad-genomes-auto tests/data/db/create/seqvar_freqs/12-38/gnomad.genomes.r3.1.1.sites.chr2.vcf

rm -rf /tmp/out ; cargo run -- db create seqvar-freqs --path-output-db /tmp/out --genome-release grch37 --path-gnomad-mtdna ~/Downloads/gnomad.genomes.v3.1.sites.chrM.vcf.bgz --path-gnomad-exomes-xy tests/data/db/create/seqvar_freqs/xy-37/gnomad.exomes.r2.1.1.sites.chrX.vcf --path-gnomad-exomes-xy tests/data/db/create/seqvar_freqs/xy-37/gnomad.exomes.r2.1.1.sites.chrY.vcf --path-gnomad-genomes-xy tests/data/db/create/seqvar_freqs/xy-37/gnomad.genomes.r2.1.1.sites.chrX.vcf --path-gnomad-exomes-auto tests/data/db/create/seqvar_freqs/12-37/gnomad.exomes.r2.1.1.sites.chr1.vcf  --path-gnomad-exomes-auto tests/data/db/create/seqvar_freqs/12-37/gnomad.exomes.r2.1.1.sites.chr2.vcf --path-gnomad-genomes-auto tests/data/db/create/seqvar_freqs/12-37/gnomad.genomes.r2.1.1.sites.chr1.vcf --path-gnomad-genomes-auto tests/data/db/create/seqvar_freqs/12-37/gnomad.genomes.r2.1.1.sites.chr2
prepare()
{
    in=$1
    out=$2

    zcat $in \
    | head -n 5000 \
    | grep ^# \
    > $out

    zcat $in \
    | grep -v ^# \
    | head -n 3 \
    >> $out
}

base=/data/sshfs/data/gpfs-1/groups/cubi/work/projects/2021-07-20_varfish-db-downloader-holtgrewe/varfish-db-downloader/

mkdir -p tests/data/db/create/seqvar_freqs/{12,xy}-{37,38}

## 37 exomes

prepare \
    $base/GRCh37/gnomAD_exomes/r2.1.1/download/gnomad.exomes.r2.1.1.sites.chr1.vcf.bgz \
    tests/data/db/create/seqvar_freqs/12-37/gnomad.exomes.r2.1.1.sites.chr1.vcf
prepare \
    $base/GRCh37/gnomAD_exomes/r2.1.1/download/gnomad.exomes.r2.1.1.sites.chr2.vcf.bgz \
    tests/data/db/create/seqvar_freqs/12-37/gnomad.exomes.r2.1.1.sites.chr2.vcf
prepare \
    $base/GRCh37/gnomAD_exomes/r2.1.1/download/gnomad.exomes.r2.1.1.sites.chrX.vcf.bgz \
    tests/data/db/create/seqvar_freqs/xy-37/gnomad.exomes.r2.1.1.sites.chrX.vcf
prepare \
    $base/GRCh37/gnomAD_exomes/r2.1.1/download/gnomad.exomes.r2.1.1.sites.chrY.vcf.bgz \
    tests/data/db/create/seqvar_freqs/xy-37/gnomad.exomes.r2.1.1.sites.chrY.vcf

## 37 genomes

prepare \
    $base/GRCh37/gnomAD_genomes/r2.1.1/download/gnomad.genomes.r2.1.1.sites.chr1.vcf.bgz \
    tests/data/db/create/seqvar_freqs/12-37/gnomad.genomes.r2.1.1.sites.chr1.vcf
prepare \
    $base/GRCh37/gnomAD_genomes/r2.1.1/download/gnomad.genomes.r2.1.1.sites.chr2.vcf.bgz \
    tests/data/db/create/seqvar_freqs/12-37/gnomad.genomes.r2.1.1.sites.chr2.vcf
prepare \
    $base/GRCh37/gnomAD_genomes/r2.1.1/download/gnomad.genomes.r2.1.1.sites.chrX.vcf.bgz \
    tests/data/db/create/seqvar_freqs/xy-37/gnomad.genomes.r2.1.1.sites.chrX.vcf

## 38 exomes

prepare \
    $base/GRCh38/gnomAD_exomes/r2.1.1/download/gnomad.exomes.r2.1.1.sites.chr1.vcf.bgz \
    tests/data/db/create/seqvar_freqs/12-38/gnomad.exomes.r2.1.1.sites.chr1.vcf
prepare \
    $base/GRCh38/gnomAD_exomes/r2.1.1/download/gnomad.exomes.r2.1.1.sites.chr2.vcf.bgz \
    tests/data/db/create/seqvar_freqs/12-38/gnomad.exomes.r2.1.1.sites.chr2.vcf
prepare \
    $base/GRCh38/gnomAD_exomes/r2.1.1/download/gnomad.exomes.r2.1.1.sites.chrX.vcf.bgz \
    tests/data/db/create/seqvar_freqs/xy-38/gnomad.exomes.r2.1.1.sites.chrX.vcf
prepare \
    $base/GRCh38/gnomAD_exomes/r2.1.1/download/gnomad.exomes.r2.1.1.sites.chrY.vcf.bgz \
    tests/data/db/create/seqvar_freqs/xy-38/gnomad.exomes.r2.1.1.sites.chrY.vcf

## 38 genomes

prepare \
    $base/GRCh38/gnomAD_genomes/r3.1.1/download/gnomad.genomes.r3.1.1.sites.chr1.vcf.bgz \
    tests/data/db/create/seqvar_freqs/12-38/gnomad.genomes.r3.1.1.sites.chr1.vcf
prepare \
    $base/GRCh38/gnomAD_genomes/r3.1.1/download/gnomad.genomes.r3.1.1.sites.chr2.vcf.bgz \
    tests/data/db/create/seqvar_freqs/12-38/gnomad.genomes.r3.1.1.sites.chr2.vcf
prepare \
    $base/GRCh38/gnomAD_genomes/r3.1.1/download/gnomad.genomes.r3.1.1.sites.chrX.vcf.bgz \
    tests/data/db/create/seqvar_freqs/xy-38/gnomad.genomes.r3.1.1.sites.chrX.vcf
prepare \
    $base/GRCh38/gnomAD_genomes/r3.1.1/download/gnomad.genomes.r3.1.1.sites.chrY.vcf.bgz \
    tests/data/db/create/seqvar_freqs/xy-38/gnomad.genomes.r3.1.1.sites.chrY.vcf

Building tx database

cd hgvs-rs-data

seqrepo --root-directory seqrepo-data/master init

mkdir -p mirror/ftp.ncbi.nih.gov/refseq/H_sapiens/mRNA_Prot
cd !$
wget https://ftp.ncbi.nih.gov/refseq/H_sapiens/mRNA_Prot/human.files.installed
parallel -j 16 'wget https://ftp.ncbi.nih.gov/refseq/H_sapiens/mRNA_Prot/{}' ::: $(cut -f 2 human.files.installed | grep fna)
cd -

mkdir -p mirror/ftp.ensembl.org/pub/release-108/fasta/homo_sapiens/cdna
cd !$
wget https://ftp.ensembl.org/pub/release-108/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz
cd -
mkdir -p mirror/ftp.ensembl.org/pub/release-108/fasta/homo_sapiens/ncrna
cd !$
wget https://ftp.ensembl.org/pub/release-109/fasta/homo_sapiens/ncrna/Homo_sapiens.GRCh38.ncrna.fa.gz
cd -
mkdir -p mirror/ftp.ensembl.org/pub/grch37/release-108/fasta/homo_sapiens/cdna/
cd !$
wget https://ftp.ensembl.org/pub/grch37/release-108/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh37.cdna.all.fa.gz
cd -
mkdir -p mirror/ftp.ensembl.org/pub/grch37/release-108/fasta/homo_sapiens/ncrna/
cd !$
wget https://ftp.ensembl.org/pub/grch37/release-108/fasta/homo_sapiens/ncrna/Homo_sapiens.GRCh37.ncrna.fa.gz
cd -

seqrepo --root-directory seqrepo-data/master load -n NCBI $(find mirror/ftp.ncbi.nih.gov -name '*.fna.gz' | sort)
seqrepo --root-directory seqrepo-data/master load -n ENSEMBL $(find mirror/ftp.ensembl.org -name '*.fa.gz' | sort)

cd ../mehari

cargo run --release -- \
    -v \
    db create txs \
        --path-out /tmp/txs-out.bin \
        --path-cdot-json ../cdot-0.2.12.ensembl.grch37_grch38.json.gz \
        --path-cdot-json ../cdot-0.2.12.refseq.grch37_grch38.json.gz \
        --path-seqrepo-instance ../hgvs-rs-data/seqrepo-data/master/master

Development Setup

You will need a recent version of flatbuffers, e.g.:

# bash utils/install-flatbuffers.sh
# export PATH=$PATH:$HOME/.local/share/flatbuffers/bin
Comments
  • Implement frequency database building for sequence variants

    Implement frequency database building for sequence variants

    We need to annotate frequencies from frequency databases. We need to store them in fast key/value access mode. We should use rocksdb for such databases.

    • [ ] implement a CLI for importing the databases

    GRCh37 & GRCh38

    • [ ] gnomAD exomes
    • [ ] gnomAD genomes
    • [ ] gnomAD mtDB
    • [ ] helixmtdb

    The following are considered outdated by now (2023/03/03)

    • mtdb
    • mitomap
    • IGSR
      • GRCh38 http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000_genomes_project/release/20190312_biallelic_SNV_and_INDEL/
      • GRCh37 http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/

    GRCh37 only

    • ExAC
    opened by holtgrewe 2
  • Implement transcript-based annotation of sequence variants

    Implement transcript-based annotation of sequence variants

    We should implement the 2018 version of the annotation standard and document the annotated fields.

    https://pcingola.github.io/SnpEff/adds/VCFannotationformat_v1.0.pdf

    opened by holtgrewe 0
  • Document end-to-end invocation

    Document end-to-end invocation

    In particular:

    • [x] how to build frequency database
    • [x] how to build transcript database
    • [x] how to create the combined database
    • [x] how to invoke for annotation of VCF files
    opened by holtgrewe 0
  • Implement command for annotating sequence variants with frequency

    Implement command for annotating sequence variants with frequency

    Once #2 has been implemented, we need to create a command for annotating sequence variants from a VCF file. Writing the results to VCF is fine for now.

    • #2
    opened by holtgrewe 0
  • Implement transcript database building

    Implement transcript database building

    We need a fast database for transcripts and sequences. We have comprehensive transcript database solutions via hgvs. We have made good experience with flatbuffers for fast access.

    • [ ] create flatbuffers based data structures for transcript information and sequence
    • [ ] create a CLI for importing transcripts via uta and cdot
    opened by holtgrewe 0
  • Evalute bincode as flatbuffers alternative

    Evalute bincode as flatbuffers alternative

    While flatbuffers allows to load data only partially, this does not really work when having to put it into an Rc<...> for passing it to the hgvs library.

    By this benchmark we should evaluate the bincode crate as well.

    opened by holtgrewe 1
  • Allow for bit packed DNA encoding

    Allow for bit packed DNA encoding

    The create bio-seq will come in handy.

    Also, we might need to encode N gaps somehow for reference sequence. This can be done by storing the sequence as A and encoding the gap positions separately.

    opened by holtgrewe 0
Owner
Berlin Institute of Health
BIH Core Unit Bioinformatics & BIH HPC IT
Berlin Institute of Health
Parallel iteration of FASTA/FASTQ files, for when sequence order doesn't matter but speed does

Rust-parallelfastx A truly parallel parser for FASTA/FASTQ files. Principle The input file is memory-mapped then virtually split into N chunks. Each c

Rayan Chikhi 8 Oct 24, 2022
A tool that allow you to run SQL-like query on local files instead of database files using the GitQL SDK.

FileQL - File Query Language FileQL is a tool that allow you to run SQL-like query on local files instead of database files using the GitQL SDK. Sampl

Amr Hesham 39 Mar 12, 2024
Native cross-platform full feature terminal-based sequence editor for git interactive rebase.

Native cross-platform full feature terminal-based sequence editor for git interactive rebase.

Tim Oram 1.2k Jan 2, 2023
A bit like tee, a bit like script, but all with a fake tty. Lets you remote control and watch a process

teetty teetty is a wrapper binary to execute a command in a pty while providing remote control facilities. This allows logging the stdout of a process

Armin Ronacher 259 Jan 3, 2023
An open source, programmed in rust, privacy focused tool for reading programming resources (like stackoverflow) fast, efficient and asynchronous from the terminal.

Falion An open source, programmed in rust, privacy focused tool for reading programming resources (like StackOverFlow) fast, efficient and asynchronou

Obscurely 17 Dec 20, 2022
rpsc is a *nix command line tool to quickly search for file systems items matching varied criterions like permissions, extended attributes and much more.

rpsc rpsc is a *nix command line tool to quickly search for file systems items matching varied criterions like permissions, extended attributes and mu

null 3 Dec 15, 2022
🍅 A command-line tool to get and set values in toml files while preserving comments and formatting

tomato Get, set, and delete values in TOML files while preserving comments and formatting. That's it. That's the feature set. I wrote tomato to satisf

C J Silverio 15 Dec 23, 2022
A simple command line tool for creating font palettes for engines like libtcod

palscii A simple command line tool for creating font palettes for engines like libtcod. Usage This can also be viewed by running palscii --help. palsc

Steve Troetti 2 May 2, 2022
GREP like cli tool written in rust.

Show [ grep,tail,cat ] like cli tool written in rust. Only one release as of now which does very basic function,code has been refactored where other f

Siri 4 Jul 24, 2023
CLI Tool for tagging and organizing files by tags.

wutag ?? ??️ CLI tool for tagging and organizing files by tags. Install If you use arch Linux and have AUR repositories set up you can use your favour

Wojciech Kępka 32 Dec 6, 2022
CLI tool to bake your fresh and hot MD files

At least once in your Rust dev lifetime you wanted to make sure all code examples in your markdown files are up-to-date, correct and code is formated, but you couldn't make that done with already existing tools - fear not!

Patryk Budzyński 39 May 8, 2021
miniserve - a CLI tool to serve files and dirs over HTTP

?? For when you really just want to serve some files over HTTP right now!

Sven-Hendrik Haase 4.1k Jan 6, 2023
Voila is a domain-specific language launched through CLI tool for operating with files and directories in massive amounts in a fast & reliable way.

Voila is a domain-specific language designed for doing complex operations to folders & files. It is based on a CLI tool, although you can write your V

Guillem Jara 86 Dec 12, 2022
RnR is a command-line tool to securely rename multiple files and directories that supports regular expressions

RnR is a command-line tool to securely rename multiple files and directories that supports regular expressions. Features Batch rename files and direct

Ismael González Valverde 219 Dec 31, 2022
rsv is a command line tool to deal with small and big CSV, TXT, EXCEL files (especially >10G)

csv, excel toolkit written in Rust rsv is a command line tool to deal with small and big CSV, TXT, EXCEL files (especially >10G). rsv has following fe

Zhuang Dai 39 Jan 30, 2023
Sniffer - a tool to quickly inspect csv and flat-file files for basic information

sniffer sniffer is a tool to quickly inspect csv and flat-file files for basic information. Need to see how many rows are in a csv file? Want to see t

Daniel B 10 Apr 4, 2023
A tool for adding new lines to files, skipping duplicates and write in Rust!

anew A tool for adding new lines to files written in Rust. The tool aids in appending lines from stdin to a file, but only if they don't already appea

z3r0yu 4 Nov 16, 2023
A cli tool to download specific GitHub directories or files

cloneit A cli tool to download specific GitHub directories or files. Installation From git git clone https://github.com/alok8bb/cloneit cd cloneit car

Alok 54 Dec 20, 2022
Command line tool to extract various data from Blender .blend files

blendtool Command line tool to extract various data from Blender .blend files. Currently supports dumping Eevee irradiance volumes to .dds, new featur

null 2 Sep 26, 2021