bed2gff
A Rust BED-to-GFF3 translator.
translates
chr7 56766360 56805692 ENST00000581852.25 1000 + 56766360 56805692 0,0,200 3 3,135,81, 0,496,39251,
into
chr7 bed2gff gene 56399404 56805692 . + . ID=ENSG00000166960;gene_id=ENSG00000166960
chr7 bed2gff transcript 56766361 56805692 . + . ID=ENST00000581852.25;Parent=ENSG00000166960;gene_id=ENSG00000166960;transcript_id=ENST00000581852.25
chr7 bed2gff exon 56766361 56766363 . + . ID=exon:ENST00000581852.25.1;Parent=ENST00000581852.25;gene_id=ENSG00000166960;transcript_id=ENST00000581852.25,exon_number=1
chr7 bed2gff CDS 56766361 56766363 . + 0 ID=CDS:ENST00000581852.25.1;Parent=ENST00000581852.25;gene_id=ENSG00000166960;transcript_id=ENST00000581852.25,exon_number=1
...
chr7 bed2gff start_codon 56766361 56766363 . + 0 ID=start_codon:ENST00000581852.25.1;Parent=ENST00000581852.25;gene_id=ENSG00000166960;transcript_id=ENST00000581852.25,exon_number=1
chr7 bed2gff stop_codon 56805690 56805692 . + 0 ID=stop_codon:ENST00000581852.25.3;Parent=ENST00000581852.25;gene_id=ENSG00000166960;transcript_id=ENST00000581852.25,exon_number=3
...
in a few seconds.
Usage
Usage: bed2gff[EXE] --bed <BED> --isoforms <ISOFORMS> --output <OUTPUT>
Arguments:
--bed <BED>: a .bed file
--isoforms <ISOFORMS>: a tab-delimited file
--output <OUTPUT>: path to output file
Options:
--help: print help
--version: print version
Warning
All the transcripts in .bed file should appear in the isoforms file.
crate: https://crates.io/crates/bed2gff
click for detailed formats
bed2gff just needs two files:
-
a .bed file
tab-delimited files with 3 required and 9 optional fields:
chrom chromStart chromEnd name ... | | | | chr20 50222035 50222038 ENST00000595977 ...
see BED format for more information
-
a tab-delimited .txt/.tsv/.csv/... file with genes/isoforms (all the transcripts in .bed file should appear in the isoforms file):
> cat isoforms.txt ENSG00000198888 ENST00000361390 ENSG00000198763 ENST00000361453 ENSG00000198804 ENST00000361624 ENSG00000188868 ENST00000595977
you can build a custom file for your preferred species using Ensembl BioMart.
Installation
to install bed2gff on your system follow this steps:
- get rust:
curl https://sh.rustup.rs -sSf | sh
on unix, or go here for other options - run
cargo install bed2gff
(make sure~/.cargo/bin
is in your$PATH
before running it) - use
bed2gff
with the required arguments - enjoy!
Library
to include bed2gff as a library and use it within your project follow these steps:
-
include
bed2gff = 0.1.0
under[dependencies]
in theCargo.toml
file -
the library name is
bed2gff
, to use it just write:use bed2gff::bed2gff;
or
use bed2gff::*;
-
invoke
let gff = bed2gff(bed: &String, isoforms: &String, output: &String)
Build
to build bed2gff from this repo, do:
- get rust (as described above)
- run
git clone https://github.com/alejandrogzi/bed2gff.git && cd bed2gff
- run
cargo run --release <BED> <ISOFORMS> <OUTPUT>
(arguments are positional, so you do not need to specify --bed/--isoforms)
Output
bed2gff will send the output directly to the same .bed file path if you specify so
bed2gff annotation.bed isoforms.txt output.gff
.
├── ...
├── isoforms.txt
├── annotation.bed
└── output.gff3
where output.gff3
is the result.
FAQ
Why?
Converting formats is a daily practice in bioinformatics. This is way more common while working with gene annotations as tools differ in input/output layouts. GTF/GFF/BED are the most used structures to store gene-related annotations and the conversion needs are not well covered by available software.
A considerable portion of genomic tools reduce the software space by accepting GTF/GFF3 files only, directing BED users to translate their files into different formats. While some of this issues have already been covered (e.g. bed2gtf) with GTF files, the GFF3 layout lacks stable converting tools (1, 2).
bed2gff is presented as a straightforward option to convert BED files into ready-to-use GFF3 files, closing that gap.
How?
bed2gff, takes the base code of bed2gtf, that basically is the reimplementation of UCSC's C binaries merged in 1 step (bedToGenePred + genePredToGtf). Before any conversion, this tool sorts the .bed file internally using a similar algorithmic approach seen in gtfsort. This step allows bed2gff to directly present the output file sorted in a natural and convenient way. Then, evaluates the position of exons and other features (CDS, stop/start, UTRs), preserving reading frames and adjusting the indexing count.
Following the rationale of bed2gtf, bed2gff is able to produce a ready-to-use gff3 file by using an isoforms file, that works as the refTable in C binaries to map each transcript to their respective gene.
To Do's
- Allow users to input compressed files (e.g. .gz, .bgzip)
- Test GFF3 with different types of aligners
- Improve the error module
- Add test modules for most of the scripts
- Allow users to specify their parent/child relationships (?)