VCF/BCF [un]compressed [un]indexed

Brent Pedersen

Last update: Sep 24, 2023

Related tags

Utilities genomics vcf

Overview

This is a small library that attempts to allow efficient reading of VCF and BCF files that are either compressed or uncompressed and indexed or not. noodles is used for the parsing, this unifies handling above that.

Even when not indexed, it allows "skipping" via iterating until the requested region is reached. This is useful for bedder-rs but might also be useful elsewhere.

The skipping will be most efficient when the files are compressed and indexed.

NOTES

Implementation

As implemented, only compressed, indexed VCF or BCF will have the absolute highest performance. Others will require skipping via iterating over the file until the given region is reached. This assumes that any indexed file will also support Seek.

Specialization (abandoned)

This is not used so we can avoid dependence on nightly rust.

AFAICT, one way to do this is with specialization

If the file is not indexed and Seekable, then we just iterate over the records to skip to a given region. If it is Seekable and has an index, then we use the query functionality.

The rust documentation indicates that :

Only specialization using the min_specialization feature should be used. The full specialization feature is known to be unsound.

But I can't get this to compile with min_specialization. It also seems that even min_specialization will require nightly for the foreseeable future.

Have a look here

Perhaps there's completely different, and better way to do this. Let me know.

You might also like...

Owner

Brent Pedersen

Doing genomics from the Netherlands

Brent Pedersen

GitHub

A vcf parser that use memory mapping to get high performance.

biommap An efficient bioinformatics file parser based on memory mapping of file. WARNING: biommap work only on uncompressed seekable file biommap is t

Pierre Marijon

2 Dec 15, 2022

VEP-like tool for sequence ontology and HGVS annotation of VCF files

Mehari Mehari is a software package for annotating VCF files with variant effect/consequence. The program uses hgvs-rs for projecting genomic variants

Berlin Institute of Health

5 Mar 31, 2023

Human-friendly indexed collections

Indexical: Human-Friendly Indexed Collections Indexical is a library for conveniently and efficiently working with indexed collections of objects. "In

Will Crichton

45 Nov 1, 2023

SIMD Floating point and integer compressed vector library

compressed_vec Floating point and integer compressed vector library, SIMD-enabled for fast processing/iteration over compressed representations. This

Evan Chan

56 Nov 24, 2022

A Quest to Find a Highly Compressed Emoji :shortcode: Lookup Function

Highly Compressed Emoji Shortcode Mapping An experiment to try and find a highly compressed representation of the entire unicode shortcodes-to-emoji m

Daniel Prilik

13 Nov 16, 2021

Merge together and efficiently time-sort compressed .pcap files stored in AWS S3 object storage (or locally) to stdout for pipelined processing.

Merge together and efficiently time-sort compressed .pcap files stored in AWS S3 object storage (or locally) to stdout for pipelined processing. High performance and parallel implementation for > 10 Gbps playback throughput with large numbers of files (~4k).

null

4 Aug 19, 2022

Convenience library for reading and writing compressed files/streams

compress_io Convenience library for reading and writing compressed files/streams The aim of compress_io is to make it simple for an application to sup

Simon Heath

0 Dec 16, 2021

tectonicdb is a fast, highly compressed standalone database and streaming protocol for order book ticks.

tectonicdb crate docs.rs crate.io tectonicdb tdb-core tdb-server-core tdb-cli tectonicdb is a fast, highly compressed standalone database and streamin

Ricky Han

525 Dec 23, 2022