Fastest GTF/GFF-to-BED converter chilling around

Overview

Crates.io GitHub Crates.io Total Downloads Conda Platform

gxf2bed

The fastest G{F,T}F-to-BED converter around the block!

translates

chr27 gxf2bed gene 17266470 17285418 . + . gene_id "ENSG00000151743";

chr27 gxf2bed transcript 17266470 17281218 . + . gene_id "ENSG00000151743"; transcript_id "ENST00000541931.8";

chr27 gxf2bed exon 17266470 17266572 . + . gene_id "ENSG00000151743"; transcript_id "ENST00000541931.8"; exon_number "1"; exon_id "ENST00000541931.8.1";

...

into

chr27 17266469 17281218 ENST00000541931.8 1000 + 17266469 17281218 0,0,200 2 103,74, 0,14675,

before your eyes blink!

Converts

  • Homo sapiens GRCh38 GENCODE 44 (252,835 transcripts) in 2.99 seconds.
  • Mus musculus GRCm39 GENCODE 44 (149,547 transcritps) in 1.91 seconds.
  • Canis lupus familiaris ROS_Cfam_1.0 Ensembl 110 (55,335 transcripts) in 0.95 seconds.
  • Gallus galus bGalGal1 Ensembl 110 (72,689 transcripts) in 1.07 seconds.

What's new on v.0.2.1

  • Fixes some rare GTF bugs due to extra '\n' at the end (malformed GTFs)
  • Adds test modules for record parsing and attr parsing
  • Build a nextflow module! Thanks to @edmundmiller!
  • Updates run times. Now gxf2bed is faster (~0.7s +/-0.2s) due to a change in the hashing algo (now using hashbrown)!

Usage

Usage: gxf2bed[EXE] --input/-i <GTF/GFF> --output/-o <BED> [--parent/-p <PARENT>] [--child/-c <CHILD>] [--feature/-f <FEATURE>]
 
Arguments:
    --input/-i <GTF/GFF>: a .gtf/.gff file
    --output/-o <BED>: path to output .bed file
    --parent/-p <PARENT>: parent node [default: "transcript"]
    --child/-c <CHILD>: child node [default: "exon"]
    --feature/-f <FEATURE>: feature to extract from the attribute line [default: "transcript_id"]

Options:
    --help: print help
    --version: print version
    --threads/-t: number of threads (default: max ncpus)

Installation

to install gxf2bed on your system follow this steps:

  1. get rust: curl https://sh.rustup.rs -sSf | sh on unix, or go here for other options
  2. run cargo install gxf2bed (make sure ~/.cargo/bin is in your $PATH before running it)
  3. use gxf2bed with the required arguments
  4. enjoy!

Build

to build gxf2bed from this repo, do:

  1. get rust (as described above)
  2. run git clone https://github.com/alejandrogzi/gxf2bed.git && cd gxf2bed
  3. run cargo run --release -- -i <GTF/GFF> -o <BED>

Container image

to build the development container image:

  1. run git clone https://github.com/alejandrogzi/gxf2bed.git && cd gxf2bed
  2. initialize docker with start docker or systemctl start docker
  3. build the image docker image build --tag gxf2bed .
  4. run docker run --rm -v "[dir_where_your_gtf_is]:/dir" gxf2bed -i /dir/<GTF/GFF> -o /dir/<BED>

Conda

to use gxf2bed through Conda just:

  1. conda install gxf2bed -c bioconda or conda create -n gxf2bed -c bioconda gxf2bed

Benchmark + FAQ

If you google "gtf to bed" or "gff to bed" you'll find some posts about recommended tools. Part of these tools are already deprecated or too difficult to run (even in Linux), due to poor maintenance. This project was conceived to provide an easy way to convert GTF/GFF files to BED structures and finish a set of high-performance converters between gene model formats in Rust (bed2gtf, bed2gff and now gxf2bed).

To test the efficiency of gxf2bed, I took 4 GTF/GFF-to-BED converters and ran them with the current GRCh38 GENCODE annotation (v.44), that has ~250,000 transcripts [the values showed here are the mean of 5 consecutive runs]. Briefly, the competitors are:

  • UCSC's utils: gtfToGenePred | genePredToBed & gffToGenePred | genePredToBed (taken from here)
  • Signal & Brown: gtf2bed.py & gff32gtf.py && gtf2bed.py (taken from here)
  • ea-utils: gtf2bed.pl & gff2gtf.pl | gtf2bed.pl (taken from here)
  • BEDOPS: gtf2bed & gtf2bed (taken from here)

Besides the easy way to make gxf2bed run and the GFF-GTF channel that allows centralize both in a single converter, this tool showed a significant difference in computation time against the other tools in each one of the two formats. Even using a combination of different number of threads, times were practically the same 3.5s +/- 0.4s and the differences were maintained. Is important to note that GFF3 files took a lot more time than its GTF counterpart. This could be due to the need of chaining two different programs 1) GFF-to-GTF and 2) GTF-to-BED.

Taken together, gxf2bed offers an easier and faster way to convert GTF/GFF files to BED files. This tool could save you at least x2-3 times the time you used to spend using other tools to convert GTF files and x5 times if you want to convert GFF3 files!

You might also like...
Rust library for regular expressions using "fancy" features like look-around and backreferences

fancy-regex A Rust library for compiling and matching regular expressions. It uses a hybrid regex implementation designed to support a relatively rich

A wrapper around the code action dump from https://mcdiamondfire.com.

DiamondFire Action Dump for Rust A wrapper around the code action dump from https://mcdiamondfire.com. This currently only provides schema types for u

it aims to augment git with primitives to build integrated, cryptographically verifiable collaboration workflows around source code

it aims to augment git with primitives to build integrated, cryptographically verifiable collaboration workflows around source code. It maintains the distributed property of git, not requiring a central server. it is transport agnostic, and permits data dissemination in client-server, federated, as well as peer-to-peer network topologies.

Python wrapper around reth db. Written in Rust.

reth-db-py Bare-bones Python package allowing you to interact with the Reth DB via Python. Written with Rust and Pyo3. This python wrapper can access

R warpper around `toml_edit` crate

tomleditR The goal of tomleditR is to expose the toml_edit crate to R. Installation You can install the development version of tomleditR from GitHub w

Pet project to get acquainted with Rust, and mess around with symbolic expressions.

Symba Pet project to get acquainted with Rust, and to mess around with symbolic expressions, hence the name 'Symba'. Example: use asg::{deftree, r

A fully extensible command interface to navigate around your leptos application.

leptos-kbar A fully extensible command interface to navigate around your leptos application. See demo: https://leptos-kbar.vercel.app/ Roadmap leptos-

Test bed for gtk-rs-core experiments

Rust GObject Experiments class macro #[gobject::class(final)] mod obj { #[derive(Default)] pub struct MyObj { #[property(get, set)]

A Rust BED-to-GFF3 translator

bed2gff A Rust BED-to-GFF3 translator. translates chr7 56766360 56805692 ENST00000581852.25 1000 + 56766360 56805692 0,0,200 3 3,135,81, 0,496,39251,

Artsy pixel image to vector graphics converter
Artsy pixel image to vector graphics converter

inkdrop inkdrop is an artsy bitmap to vector converter. Command line interface The CLI binary is called inkdrop-cli and reads almost any image bitmap

A Simple Image to Ascii converter in Rust
A Simple Image to Ascii converter in Rust

Image to Ascii A Simple Image to Ascii converter in Rust Brief 📖 In my way to learn Rust i decided to make this converter. Challenges 🐢 new to Rust

Converter valores entre moedas com RUST.
Converter valores entre moedas com RUST.

Exchange Rust 🦀 Uma CLI escrita em Rust que converte um valor de uma moeda para outra consumindo a API CurrencyConverterApi (projeto de estudo). Util

Camera RAW to DNG file format converter

DNGLab - A camera RAW to DNG file format converter Command line tool to convert camera RAW files to Digital Negative Format (DNG). It is currently in

A fast wordlist to nthash converter

nthasher A fast wordlist to nthash converter Usage Pass it a UTF8 encoded wordlist, and write the output to a file. ./nthasher wordlist wordlist.n

elite - c converter

elite - c converter

elite - python3 converter

elite - python3 converter

elite - rust converter

elite - rust converter

elite - c++17 converter

elite - c++17 converter

An intel PT trace converter from `perf.data` to Fuchsia trace format.
An intel PT trace converter from `perf.data` to Fuchsia trace format.

Introduction Recent Intel processors feature the "Intel Processor Trace" feature, which can be used to capture the full execution trace of a program.

Comments
  • Error parsing attribute

    Error parsing attribute

    Hey! Love the concept going on here and wanted to replace some perl scripts in some pipelines with the tool.

    I went to make a nf-core module for it and just got errors though.

    https://github.com/nf-core/modules/pull/4951

    Details

    thread '<unnamed>' panicked at src/main.rs:197:33:
    called `Result::unwrap()` on an `Err` value: "Error parsing attribute"
    thread '<unnamed>' panicked at src/main.rs:197:33:
    called `Result::unwrap()` on an `Err` value: "Error parsing attribute"thread '
    <unnamed>' panicked at src/main.rs:197:33:
    called `Result::unwrap()` on an `Err` value: "Error parsing attribute"
    thread '<unnamed>' panicked at src/main.rs:197:33:
    called `Result::unwrap()` on an `Err` value: "Error parsing attribute"
    thread '<unnamed>' panicked at src/main.rs:197:33:
    called `Result::unwrap()` on an `Err` value: "Error parsing attribute"
    thread '<unnamed>' panicked at src/main.rs:197:33:
    called `Result::unwrap()` on an `Err` value: "Error parsing attribute"
    thread '<unnamed>' panicked at src/main.rs:197:33:
    called `Result::unwrap()` on an `Err` value: "Error parsing attribute"
    thread '<unnamed>' panicked at src/main.rs:197:33:
    called `Result::unwrap()` on an `Err` value: "Error parsing attribute"
    thread '<unnamed>' panicked at src/main.rs:197:33:
    called `Result::unwrap()` on an `Err` value: "Error parsing attribute"
    thread '<unnamed>' panicked at src/main.rs:197:33:
    called `Result::unwrap()` on an `Err` value: "Error parsing attribute"
    thread '<unnamed>' panicked at src/main.rs:197:33:
    called `Result::unwrap()` on an `Err` value: "Error parsing attribute"thread '
    <unnamed>' panicked at src/main.rs:197:33:
    called `Result::unwrap()` on an `Err` value: "Error parsing attribute"
    thread '<unnamed>' panicked at src/main.rs:197:33:
    called `Result::unwrap()` on an `Err` value: "Error parsing attribute"
    thread '<unnamed>' panicked at src/main.rs:197:33:
    called `Result::unwrap()` on an `Err` value: "Error parsing attribute"
    thread '<unnamed>' panicked at src/main.rs:197:33:
    called `Result::unwrap()` on an `Err` value: "Error parsing attribute"
    thread '<unnamed>' panicked at src/main.rs:197:33:
    called `Result::unwrap()` on an `Err` value: "Error parsing attribute"
    thread '<unnamed>' panicked at src/main.rs:197:33:
    called `Result::unwrap()` on an `Err` value: "Error parsing attribute"
    thread '<unnamed>' panicked at src/main.rs:197:33:
    called `Result::unwrap()` on an `Err` value: "Error parsing attribute"
    thread '<unnamed>' panicked at src/main.rs:197:33:
    called `Result::unwrap()` on an `Err` value: "Error parsing attribute"
    thread '<unnamed>' panicked at src/main.rs:197:33:
    called `Result::unwrap()` on an `Err` value: "Error parsing attribute"
    thread '<unnamed>' panicked at src/main.rs:197:33:
    called `Result::unwrap()` on an `Err` value: "Error parsing attribute"
    thread '<unnamed>' panicked at src/main.rs:197:33:
    called `Result::unwrap()` on an `Err` value: "Error parsing attribute"
    thread '<unnamed>' panicked at src/main.rs:197:33thread ':
    <unnamed>called `Result::unwrap()` on an `Err` value: "Error parsing attribute"' panicked at 
    src/main.rs:197:33:
    called `Result::unwrap()` on an `Err` value: "Error parsing attribute"
    thread '<unnamed>' panicked at src/main.rs:197:33:
    called `Result::unwrap()` on an `Err` value: "Error parsing attribute"
    

    Reproduced it locally with gxf2bed -i genome.gff3 -o genome.bed as well.

    bug 
    opened by edmundmiller 3
Releases(v0.2.1)
  • v0.2.1(Feb 27, 2024)

    What's new on v0.2.1

    • Fixes some rare GTF bugs due to extra '\n' at the end (malformed GTFs)
    • Adds test modules for record parsing and attr parsing
    • Build a nextflow module! Thanks to @edmundmiller!
    • Updates run times. Now gxf2bed is faster (~0.7s +/-0.2s) due to a change in the hashing algo (now using hashbrown)!
    Source code(tar.gz)
    Source code(zip)
  • v0.2.0(Feb 20, 2024)

    What's new on v.0.2

    • Now gxf2bed provides 3 new optional arguments: parent, child, feature
    • The parent argument can be used to specify the main node from which the new .bed file will be build. This could be interpreted as 'which feature (3rd column) I want my .bed file to be build from'
    • The child argument can be used to specify the child node from the bed file. This can be interpreted as 'which lines will compose my parent node information'
    • The feature argument can be used to specify what does gxf2bed will use as the names on the .bed file. Could be 'gene_id', 'transcript_id' or anything you want. You just need to be sure that this feature is present in all parent and child lines.
    • These new additions do not compromise the initial functionality.
    Source code(tar.gz)
    Source code(zip)
  • v0.1.0(Feb 6, 2024)

Owner
Alejandro Gonzales-Irribarren
rare case of a vet turned bioinformatician | DVM @ UNMSM
Alejandro Gonzales-Irribarren
elite -> c++17 converter

elite -> c++17 converter

Ferhat Geçdoğan 1 Feb 10, 2022
Markdown to HTML converter written in Rust. Inspired by Katsuki Yuri's Makudaun Tool.

Makurust Makurust is a powerful tool written in Rust that allows you to effortlessly convert your Markdown files into static HTML pages. Inspired by T

Said (Fromgodd) 15 Apr 9, 2023
💱 A crusty currency converter

?? moneyman A crusty currency converter Example $ moneyman convert 50 --from EUR --to PHP --on 2023-05-06 --fallback 50 EUR -> 3044.5833333333350 PHP

SEKUN 6 May 22, 2023
Not the fastest terminal colors library. Don't even ask about size.

TROLOLORS Not the fastest terminal colors library. Don't even ask about size. Why? Don't even try to use it. But maybe you need to say to your boss th

Dmitriy Kovalenko 15 Oct 27, 2021
This tool will profile official instances of OpenSUSE mirrorcache to determine the fastest repositories for your system

Mirror Magic tool to Magically make OpenSUSE Mirrors Magic-er This tool will profile official instances of OpenSUSE mirrorcache to determine the faste

Firstyear 30 Dec 22, 2022
The fastest bloom filter in Rust. No accuracy compromises. Use any hasher.

b100m-filter The fastest bloom filter in Rust. No accuracy compromises. Use any hasher. Usage # Cargo.toml [dependencies] b100m-filter = "0.3.0" use b

null 4 Nov 19, 2023
The fastest memoizing and caching Python library written in Rust.

Cachebox Cachebox is a Python library (written in Rust) that provides memoizations and cache implementions with different cache replecement policies.

Ali 3 Feb 28, 2024
Wrapper around atspi-code to provide higher-level at-spi Rust bindings

atspi Wrapper around atspi-codegen to provide higher-level at-spi Rust bindings. Contributions Take a look at our atspi-codegen crate, and try inpleme

Odilia 3 Feb 7, 2022
Shared memory - A Rust wrapper around native shared memory for Linux and Windows

shared_memory A crate that allows you to share memory between processes. This crate provides lightweight wrappers around shared memory APIs in an OS a

elast0ny 274 Dec 29, 2022
Navigating around TUM with excellence – An API and website to search for rooms, buildings and other places

NavigaTUM NavigaTUM is a non-official tool developed by students for students, that aims to help you get around at TUM. Feel free to contribute. Featu

TUM Developers 21 Dec 22, 2022