This library provides implementations of many algorithms and data structures that are useful for bioinformatics.

Overview

Crates.io Crates.io Crates.io GitHub Workflow Status (branch) Coveralls DOI

Rust-Bio logo Rust-Bio, a bioinformatics library for Rust.

This library provides Rust implementations of algorithms and data structures useful for bioinformatics. All provided implementations are rigorously tested via continuous integration.

Please see the API documentation for available features and examples of how to use them.

When using Rust-Bio, please cite the following article:

Köster, J. (2016). Rust-Bio: a fast and safe bioinformatics library. Bioinformatics, 32(3), 444-446.

Further, you can cite the used versions via DOIs:

Rust-Bio: DOI

Contribute

Any contributions are welcome, from a simple bug report to full-blown new modules:

If you find a bug and don't have the time or in-depth knowledge to fix it, just check if you can add info to an existing issue and otherwise file a bug report with as many infos as possible. Pull requests are welcome if you want to contribute fixes, documentation, or new code. Before making commits, it would be helpful to first install pre-commit to avoid failed continuous integration builds due to issues such as formatting:

  1. Install pre-commit (see pre-commit.com/#installation)
  2. Run pre-commit install in the rust-bio base directory

Depending on your intended contribution frequency, you have two options for opening pull requests:

  1. For one-time contributions, simply fork the repository, apply your changes to a branch in your fork and then open a pull request.
  2. If you plan on contributing more than once, become a contributor by saying hi on the rust-bio Discord server, Together with a short sentence saying who you are and mentioning what you want to contribute. We'll add you to the team. Then, you don't have to create a fork, but can simply push new branches into the main repository and open pull requests there.

If you want to contribute and don't know where to start, have a look at the roadmap.

Documentation guidelines

Every public function and module should have documentation comments. Check out which types of comments to use where. In rust-bio, documentation comments should:

  • explain functionality
  • give at least one useful example of how to use it (best as doctests, that run during testing, and using descriptive expect() statements for handling any Err()s that might occur)
  • describe time and memory complexity listed (where applicable)
  • cite and link sources and explanations for data structures, algorithms or code (where applicable)

For extra credit, feel free to familiarize yourself with:

Minimum supported Rust version

Currently the minimum supported Rust version is 1.51.0.

License

Licensed under the MIT license http://opensource.org/licenses/MIT. This project may not be copied, modified, or distributed except according to those terms.

Comments
  • Join the team!

    Join the team!

    Everybody is welcome to officially join the team. This does not involve any duties. Just contribute features you need for your own work. To join, please post here.

    opened by johanneskoester 87
  • Usage survey for grant application

    Usage survey for grant application

    Dear @rust-bio/contributors, @MikkelSchubert, @mbrubeck, @veldsla, @bpr, @rednum, @natir, @flying-sheep, @althonos, @llogiq, @eclarke, @skade, @parir, @marcsch, @bow, @apirogov, @mahdi-b, @audy, @mariokostelac, @zitsen, @ghost,

    I am currently writing a grant application for the sustainable development of Rust-Bio. For this, it is beneficial to name the research projects, tools or libraries, that use the software. Would you be so kind to post stuff you know about or work on yourself? If you don't mind, please also mention unpublished projects.

    Thank you so much in advance! If we get the grant, we can hire a full-time developer.

    Best, Johannes

    opened by johanneskoester 32
  • Added a naive orf research algorithm

    Added a naive orf research algorithm

    Added an orf research algorithm.

    Matched orf are returned as &[u8]. The algorithm only searches for orf in the 3 possible reading frames of the provided sequence without examining reverse complementary strand sequence. If it did, then it would be necessary to calculate the reverse complementary strand, and orf couldn't be returned as slices because the reverse complementary strand sequence wouldn't live outside of the function.

    The complexity of this algorithm is linear, the 3 reading frames are read simultaneously which means the sequence only needs to be parsed one time.

    Refining the method could include comparing the gc content of the obtained orf to the gc content of the whole sequence (which is why a gc.rs is present in the seq_analysis folder).

    opened by althonos 27
  • Occ::get performance, redux

    Occ::get performance, redux

    I ended up writing a blog post about the tools I used for profiling the change I proposed to Occ::get. Fun story, @aatch pointed out on Reddit that the slowdown in some other versions I had tried was probably due to the overhead of dealing with alignment in a vectorized version of the function, and noted that 3 may be an odd sample rate to test with the FMIndex. As an example, bowtie2's FM index appears to default to a sampling rate of 32, and I've seen a few cases of using higher rates as well.

    The version of Occ::get proposed by Aatch is at approximate parity with the filter/count version for performance at a sampling rate of 32 (just a tiny bit slower), but it quickly gets much, much faster as the sampling rate increases. For example, with a sampling rate of 128 (in use in one of my applications):

    filter/count, k = 128

    test search_index_seeds ... bench:     179,442 ns/iter (+/- 1,947)
    

    vectorized, k = 128

    test search_index_seeds ... bench:     152,368 ns/iter (+/- 433)
    

    filter/count, k = 64

    test search_index_seeds ... bench:     105,141 ns/iter (+/- 1,665)
    

    vectorized, k = 64

    test search_index_seeds ... bench:      99,044 ns/iter (+/- 252)
    

    filter/count, k = 32

    test search_index_seeds ... bench:      74,427 ns/iter (+/- 5,166)
    

    vectorized, k = 32

    test search_index_seeds ... bench:      75,333 ns/iter (+/- 303)
    

    Seems like an overall win to me. At a k of 32, the averages are just slightly higher than the filter/count version, but well within the error margin reported by libtest.

    The one downside here is that at smaller sampling rates, a non-vectorized version wins out:

    filter/count, k = 8

    test search_index_seeds ... bench:      40,598 ns/iter (+/- 1,170)
    

    vectorized, k = 8

    test search_index_seeds ... bench:      53,842 ns/iter (+/- 224)
    

    Thoughts here? Without integer generics (at all) and specialization (in stable), I'm not sure it's possible to cleanly maximize performance of small sampling rates and large ones. I'm biased here because I'm working with very large indices that need high sampling rates, but I'm not sure what other values of k are "in the wild."

    enhancement WIP 
    opened by anp 26
  • Multiple sentinels and a sampled suffix array

    Multiple sentinels and a sampled suffix array

    In making SampledSuffixArray work, I discovered that multiple "canonical" implementations of FMIndex don't seem to support multiple sentinels in the text. Ben Langmead's Python notebooks host an implementation which can't handle multiple sentinels, and it looks like SeqAn doesn't either (although they do in a roundabout way by using a generalized suffix array).

    So the question I have is -- is it possible to support a sampled suffix array in rust-bio without breaking the FMDIndex? I see some potential options:

    • Make it clear in docs (and in code, somehow?) that only text with a single sentinel can be used in a SampledSuffixArray. Prevent FMDIndex-produced intervals from querying a SampledSuffixArray, etc.
    • Implement a generalized suffix array and sidestep the problem (although I'm not sure what interaction that has with sampling, and have only glanced at the literature on them)
    • Find a way to make a single-sequence suffix array sampleable without breaking on multiple sentinels. I haven't seen any research yet on this exact issue, in part because I think most who need to support multiple sequences are building generalized SAs.

    I'm not sure what the best course of action is, but I'm eager to push forward on sampled SAs, so I'm very curious what others thoughts are here.

    enhancement 
    opened by anp 26
  • Partial-Order Alignments

    Partial-Order Alignments

    I'd like to create a tracking-issue for the branch I'm working on to add a partial-order aligner to the repetoire of alignment tools in rust-bio: https://github.com/rust-bio/rust-bio/tree/partial-order-aligner

    To make the code reach a sort of minimum viable product-level, it still needs:

    • Convert main() and example4.fa to inline example code
    • Add unit-tests
    • Individual read-path tracking and multi-alignment sequence output
    • Make the internal aligner a member variable and make it customizabie at instantiation
    • Some form of user-specified path-pruning

    Beyond that, I am open to suggestions, particularly if some of the more experienced contributors have suggestions for improving the API, or stylistic or efficiency improvements to suggest for the code itself.

    opened by bnbowman 24
  • Use tuple structs/newtypes for data structures

    Use tuple structs/newtypes for data structures

    @cramertj reminded me of this with a comment on a PR. A number of data structures are implemented as type aliases currently:

    • https://github.com/rust-bio/rust-bio/blob/master/src/data_structures/suffix_array.rs#L26
    • https://github.com/rust-bio/rust-bio/blob/master/src/data_structures/bwt.rs#L17
    • https://github.com/rust-bio/rust-bio/blob/master/src/stats/mod.rs#L14
    • https://github.com/rust-bio/rust-bio/blob/master/src/alphabets/mod.rs#L27
    • https://github.com/rust-bio/rust-bio/blob/master/src/pattern_matching/kmp.rs#L31

    However, I think that using tuple structs/newtypes (pub struct RawSuffixArray(Vec<usize>)) would grant some extra compile-time verification to the use of these data structures (i.e. to prevent accidentally giving any old Vec to a function which needs RawSuffixArray), along with better error messages.

    This would be a significant change to the API, but would only break clients who used the underlying types in a function signature, rather than the aliases. It would also allow greater leeway in the future to change the representation of the data structures without it being a breaking change.

    Thoughts?

    opened by anp 24
  • Docathon: update documentation for IO methods

    Docathon: update documentation for IO methods

    Updates for examples to FASTA (#295) and GFF (#297) file reading and writing. I've updates FASTA and BED examples to be consistent.

    Please excuse errors as I'm new at Rust. I hope this helps. Suggestions to improve it before merging are welcome!

    If there is recommended (quick) way to test whether these examples work, that would be helpful. I have files of these format(s) that I can test it on in a Linux environment.

    The package builds locally but I think examples are commented out so of course it does.

    documentation 
    opened by TomKellyGenetics 23
  • Alignment w/o traceback

    Alignment w/o traceback

    Add a function for hamming distance hamm_dist + doc + test + bench. Add a function for edit distance lev_dist + doc + test + bench. Add the pretty() method to Alignment to return a String with pretty formatted alignments + doc. Add 3x3 easy-to-use functions for computing alignments, scores and returning pretty formatted alignments + doc + test + bench for all of them.

    Due to a bug in the global alignment method the benchmark bench_fn_global_score fails now.

    opened by vadimnazarov 20
  • Gzip compatible fasta reader

    Gzip compatible fasta reader

    Adds a method to bio::io::fasta::Reader that takes a path to a gunzip-compressed FASTA file and a single test to ensure each Record is identical between plain-text and compressed inputs. Adds flate2 as a library dependency.

    A flate2::read::GzDecoder object can already be passed to Reader::new() on the user side, but I noticed there are no library methods for this functionality (I hope I didn't miss them) and figured this could be a good first-time contribution for a rust novice if there is interest. Implementation as a trait would probably be best for introducing this functionality library-wide. Happy to take that on if this sounds like a good idea to folks.

    opened by amsesk 19
  • Logo for Rust-Bio

    Logo for Rust-Bio

    During the rust-bio docathon 2020 I created a logo for our discord channel. I finally found some time to prepare two additional versions and put on some finishing touches. If we would like to adopt a project logo, I would offer these as a suggestion (or an initial draft to iterate upon).

    All of these are based on Ferris the Rustacean designed and released under CC0 1.0 by Karen Rustad Tölva. SVG versions, some background information on how these were created and png exports can be found at https://github.com/HenningTimm/rust-bio-logo.

    Version 1: As used on the discord server

    rust_bio_logo_discord

    Version 2: As used on the discord server but with thicker lines

    rust_bio_logo_thick_strand_512

    Version 3: With a curled strand

    rust_bio_logo_alt_small

    opened by HenningTimm 17
  • chore: release 1.1.1

    chore: release 1.1.1

    :robot: I have created a release *beep* *boop*

    1.1.1 (2023-01-02)

    Performance Improvements


    This PR was generated with Release Please. See documentation.

    autorelease: pending 
    opened by github-actions[bot] 0
  • Wrap sequence in output file

    Wrap sequence in output file

    A common convention is to wrap lines at 60 characters in a FASTA file. Would it be possible to add an option to wrap the sequences when writing out to a file? For example:

    /// Write a Fasta record with given id, optional description and sequence.
        pub fn write(&mut self, id: &str, desc: Option<&str>, seq: TextSlice<'_>) -> io::Result<()> {
            self.writer.write_all(b">")?;
            self.writer.write_all(id.as_bytes())?;
            if let Some(desc) = desc {
                self.writer.write_all(b" ")?;
                self.writer.write_all(desc.as_bytes())?;
            }
            self.writer.write_all(b"\n")?;
            for chunk in seq.chunks(wrap) {
                self.writer.write_all(chunk)?;
                self.writer.write_all(b"\n")?;
            }
            self.writer.write_all(b"\n")?;
    
            Ok(())
        }
    
    opened by alvanuffelen 1
  • docs: Documenting expand kmer

    docs: Documenting expand kmer

    This is still a WIP, I've tried to understand the function and what it does and document it to the best of my abilities. To answer the original post:

    • allowed_mismatches is strictly tested >therefore there's indeed a maximum of allowed_mismatches + 1 mismatch possible.
    • Output matches don't seem to undergo any kind of overlapping verification, therefore no need to know incase of overlaping mismatches close #494
    opened by Ptipiak 0
  • DOC: Not documented that FM index / suffix array / BWT require terminal sentinel character

    DOC: Not documented that FM index / suffix array / BWT require terminal sentinel character

    I spent a good hour debugging backward_search using the code fragment from backward_search: https://github.com/rust-bio/rust-bio/blob/eb9d378c70bc43d1dcf477e48cb167802799a546/src/data_structures/fmindex.rs#L103-L145

    Turns out, the reference fasta I read in didn't have a sentinel character $ at the end. Lacking this terminal $ makes backward_search produce wrong results.

    While the example texts do contain that character (that's how I got the idea to attach it to my reference) - it's not mentioned anywhere as far as I can see.

    There should either be a note, or maybe even a check during creation of the FM index - unless there's a use case without terminal $. Or even more convenient - add a terminal $.

    opened by corneliusroemer 0
  • DOC: public fn alignment::sparse::expand_kmer_matches not documented

    DOC: public fn alignment::sparse::expand_kmer_matches not documented

    The following public function lacks a doc string: https://github.com/rust-bio/rust-bio/blob/43d198e87c69867fad39f07374d3fdcd9d879729/src/alignment/sparse.rs#L398

    Some important questions to answer:

    • exact meaning of allowed_mismatches, if I read the code correctly, extension happens independently on either side of the k-mer match, and stops after allowed_mismatches in each direction. So if you specify allowed_mismatches = 1 the extended match will actually contain 2 mismatches, one on either side - which is not what one may expect at all.
    • whether output matches can overlap or not, if they don't overlap, will adjoining matches be merged or just border directly side-by-side?
    opened by corneliusroemer 0
Releases(v1.1.0)
Owner
Rust-Bio
A bioinformatics library for the Rust language.
Rust-Bio
Radiate is a parallel genetic programming engine capable of evolving solutions to many problems as well as training learning algorithms.

Radiate Coming from Evolutionary Radiation. Evolutionary radiation is a rapid increase in the number of species with a common ancestor, characterized

kalivas 109 Dec 18, 2022
Genetic Algorithms Library

genx genx provides modular building blocks to run simulations of optimization and search problems using Genetic Algorithms (GA). The vision for genx i

Lakshya Singh 29 Aug 9, 2022
A library that can be used as a building block for high-performant graph algorithms.

graph A library that can be used as a building block for high-performant graph algorithms. Graph provides implementations for directed and undirected

Martin Junghanns 222 Jan 1, 2023
ANISE provides an open-source and open-governed library and algorithmic specification for most computations for astrodynamics

ANISE provides an open-source and open-governed library and algorithmic specification for most computations for astrodynamics. It is heavily inspired by NAIF SPICE, and may be considered as an open-source modern rewrite of SPICE.

ANISE 4 Mar 9, 2022
Fast, parallel, extensible and adaptable genetic algorithms framework written in Rust

oxigen Oxigen is a parallel genetic algorithm framework implemented in Rust. The name comes from the merge of OXIdación (Rust translated to Spanish) a

Martín Pozo 146 Dec 18, 2022
A browser app that evolves vehicles using genetic algorithms, written in Rust and Bevy

Vehicle Evolver Deluxe This is a simulation that uses AI (to be specific: genetic algorithms) to try to build better and better vehicles. The vehicles

null 95 Dec 26, 2022
zine/book about bitmap drawing algorithms and math with code examples in Rust

A Bitmapper's Companion - zine/book about bitmap drawing algorithms and math with code examples in Rust A small zine/book written in LaTeX. In progres

Manos Pitsidianakis 42 Nov 8, 2022
A Rust implementation the HOTP and TOTP Algorithms

xotp A Rust implementation the HOTP and TOTP Algorithms. HOTP was implemented in accordance with RFC4226 TOTP was implemented in accordance with RFC62

Tejas Mehta 3 Jan 21, 2022
darwin-rs, evolutionary algorithms with rust

darwin-rs This library allows you to write evolutionary algorithms (EA) using the Rust programming language. Written by Willi Kappler, License: MIT -

Willi Kappler 95 Jan 1, 2023
Compare binary files using alignment algorithms

biodiff Compare binary files using alignment algorithms. What is this This is a tool for binary diffing. The tool is able to show two binary files sid

null 483 Dec 22, 2022
A tool used to evaluate the output of retrieval algorithms. Written in Rust.

Rusteval A tool used to evaluate the output of retrieval algorithms. Written in Rust. Building Install Rust with curl -sSf https://static.rust-lang.or

Giorgos Sfikas 17 Mar 22, 2022
This crate implements fast route planning algorithms in Rust.

This crate implements fast route planning algorithms in Rust. Algorithms Currently implemented: Contraction Hierarchies: The implementat

Payas Rajan 4 Aug 15, 2021
DSP algorithms for embedded. Often integer math.

This crate contains some tuned DSP algorithms for general and especially embedded use.

QUARTIQ 12 Nov 3, 2022
ECFFT algorithms on the BN254 base field

ECFFT algorithms on the BN254 base field This crate implements structs and traits for the ECFFT algorithms from the paper Elliptic Curve Fast Fourier

null 36 Nov 28, 2022
"Algorithms for approximate string matching" in Rust, with Python bindings.

ukkonen Implementation of a bounded Levenshtein distance by Esko Ukkonen in "Algorithms for approximate string matching" in Rust, with Python bindings

Ethan Smith 1 Dec 1, 2021
Algorithms implemented in Rust, explained.

Rusty Algorithms & Data Structures for Learners This repository presents Rust implementations of common algorithms and data structures, many of which

Tianyi Shi 327 Dec 19, 2022
Bandit Algorithms in Rust

Multi-armed bandit algorithms in Rust Cargo [dependencies] bandit = "0.12.3" Description and Scope This library currently only implements the annealin

Michael Bohn 30 Nov 24, 2022
A small collection of procedural dungeon generation algorithms.

mapgen A small collection of procedural dungeon generation algorithms. Built with Rust & macroquad. WebAssembly build deployed here. Animated version

null 16 Aug 17, 2022
Library that implements different versions of PPMD algorithm (compress and decompress)

ppmd-rs Library that implements different versions of PPMD algorithm (compress and decompress) Dependencies Rust 1.58 or newer Cargo How to build Clon

Alexander Zaitsev 3 Jun 20, 2022