Parallel iteration of FASTA/FASTQ files, for when sequence order doesn't matter but speed does

Overview

Rust-parallelfastx

A truly parallel parser for FASTA/FASTQ files.

Principle

The input file is memory-mapped then virtually split into N chunks. Each chunk is fed to a regular FASTA/FASTQ parser (here, the excellent https://github.com/markschl/seq_io library).

Rationale

Virtually all other "multithreaded" FASTA/FASTQ parsers typically use only one thread to parse the file, then they feed the parsed sequences to threads. If your disk is fast enough (> 2 GB/s) that parsing the file becomes a CPU bottleneck, then you might benefit from this library as the parsing is truly multithreaded.

How to use

see src/main.rs, should be self explanatory.

Related work

Inspiration for this repository is the amazing fastlwc-mt tool from https://github.com/expr-fi/fastlwc which does multi-threaded line counting faster than wc.

A related work on parallel FASTX parsing, I was not aware of at the time of development, is: https://github.com/natir/in_place_fastx. See natir/in_place_fastx#1 for a benchmark.

Caveat

Input file needs to be seekable, which rules out all compression methods except blocked ones, which currently aren't supported by this library, but could be in principle.

Author

Rayan Chikhi, 2022

You might also like...
A BASIC language interpreter. Does not conform to existing standards. Mostly a toy.
A BASIC language interpreter. Does not conform to existing standards. Mostly a toy.

JW-Basic A toy language that is somewhat like QBasic. Features: Graphics: 160x96 (255 colors & transparent) Text: 32x16 (4x5 font) Character set: ASCI

Use your computer as a cosmic ray detector! One of the memory errors Rust does not protect against.

Your computer can double up as a cosmic ray detector. Yes, really! Cosmic rays hit your computer all the time. If they hit the RAM, this can sometimes

A pastebin that does just enough to be really useful.

Rocket Powered Pastebin (rktpb | paste.rs) A pastebin that does just enough to be really useful. Really fast, really lightweight. Renders markdown lik

1 library and 2 binary crates to run SSH/SCP commands on a
1 library and 2 binary crates to run SSH/SCP commands on a "mass" of hosts in parallel

massh 1 library and 2 binary crates to run SSH/SCP commands on a "mass" of hosts in parallel. The binary crates are CLI and GUI "frontends" for the li

Cloud-optimized GeoTIFF ... Parallel I/O 🦀

cog3pio Cloud-optimized GeoTIFF ... Parallel I/O Yet another attempt at creating a GeoTIFF reader, in Rust, with Python bindings. Installation Rust ca

A PAM module that runs multiple other PAM modules in parallel, succeeding as long as one of them succeeds.

PAM Any A PAM module that runs multiple other PAM modules in parallel, succeeding as long as one of them succeeds. Development I created a VM to test

A toy example showing how to run Rust code in Python for speed and progress.

PoC: Integrating Rust in Python A toy example showing how to run Rust code in Python for speed and progress. Requirements Python 3.6+ Rust 1.44+ Cargo

A tool to control the fan speed by monitoring the temperature of CPU via IPMI.

ipmi-fan-control A tool to control the fan speed by monitoring the temperature of CPU via IPMI. Why Our Dell R730 server's iDRAC is not works as expec

This CLI will help you improve your typing accuracy and speed
This CLI will help you improve your typing accuracy and speed

This CLI will help you improve your typing accuracy and speed! Improve your personal bests and look back on your previous records in a graph. All in the convenience of your own terminal!

Comments
  • compare with needletail and kseq.h

    compare with needletail and kseq.h

    Hello Rayan,

    It would be interesting to compare with biofast: https://github.com/lh3/biofast

    It seems for the test dataset: M_abscessus_HiSeq.fq, needtail is still faster than even paralleled version of seq_io. This repo takes about 3 seconds with 12 threads while the needtail takes 0.8 seconds for plain fastq( M_abscessus_HiSeq.fq) file.

    Thanks,

    Jianshu

    opened by jianshu93 4
  • parsing larger number of fasta files

    parsing larger number of fasta files

    Hello Rayan,

    I have more than 1 million fasta files to parse, each is about 3-5 Mb, total is about 3T, I am wondering how I can read those huge number of files in parallel using all cpu cores I have. It seems there are not such tool available right now. Any suggestions?

    Thanks,

    Jianshu

    opened by jianshu93 1
  • Thoughts about the possibility of paired end parsing

    Thoughts about the possibility of paired end parsing

    I've been very interested in fast fasta/q parsing, but one thing that always comes up is the need to synchronize paired end sequencing files. Any thoughts of if that might be poddible here?

    opened by rob-p 1
Owner
Rayan Chikhi
Rayan Chikhi
Technically, this does exactly what sleep does but completes much faster!

hypersleep Technically does everything that sleep does but it is "blazingly fast!" For example, $ time sleep 1 real 0m1.005s user 0m0.001s sys

Nigel 4 Oct 27, 2022
ABQ is a universal test runner that runs test suites in parallel. It’s the best tool for splitting test suites into parallel jobs locally or on CI

?? abq.build   ?? @rwx_research   ?? discord   ?? documentation ABQ is a universal test runner that runs test suites in parallel. It’s the best tool f

RWX 13 Apr 7, 2023
VEP-like tool for sequence ontology and HGVS annotation of VCF files

Mehari Mehari is a software package for annotating VCF files with variant effect/consequence. The program uses hgvs-rs for projecting genomic variants

Berlin Institute of Health 5 Mar 31, 2023
Fast line based iteration almost entirely lifted from ripgrep's grep_searcher.

?? ripline This is not the greatest line reader in the world, this is just a tribute. Fast line based iteration almost entirely lifted from ripgrep's

Seth 11 Feb 18, 2022
Rust crate `needleman_wunsch` of the `fasebare` package: reading FASTA sequences, Needleman-Wunsch alignment

fasebare Rust crate needleman_wunsch of the fasebare package: reading FASTA sequences, Needleman-Wunsch alignment. Synopsis The crate needleman_wunsch

Laurent Bloch 2 Nov 19, 2021
A tool to filter sites in a FASTA-format whole-genome pseudo-alignment

Core-SNP-filter This is a tool to filter sites (i.e. columns) in a FASTA-format whole-genome pseudo-alignment based on: Whether the site contains vari

Ryan Wick 15 Apr 2, 2023
Like HashSet but retaining INSERTION order and without `Hash` requirement on the Element type.

identified_vec A collection of unique identifiable elements which retains insertion order, inspired by Pointfree's Swift Identified Collections. Simil

Alexander Cyon 4 Dec 11, 2023
Native cross-platform full feature terminal-based sequence editor for git interactive rebase.

Native cross-platform full feature terminal-based sequence editor for git interactive rebase.

Tim Oram 1.2k Jan 2, 2023
ddi is a wrapper for dd. It takes all the same arguments, and all it really does is call dd in the background

ddi A safer dd Introduction If you ever used dd, the GNU coreutil that lets you copy data from one file to another, then you may have encountered a ty

Tomás Ralph 80 Sep 8, 2022
Human numeric sorting program — does what `sort -h` is supposed to do!

hns — Human Numeric Sort v0.1.0 (⏫︎2022-09-20) © 2022 Fredrick R. Brennan and hns Authors Apache 2.0 licensed, see LICENSE. man page Packages hns_0.1.

Fredrick Brennan 7 Sep 25, 2022