This library provides implementations of many algorithms and data structures that are useful for bioinformatics.

Rust-Bio

Last update: Dec 26, 2022

Related tags

Genetic algorithms rust-bio

Overview

Rust-Bio, a bioinformatics library for Rust.

This library provides Rust implementations of algorithms and data structures useful for bioinformatics. All provided implementations are rigorously tested via continuous integration.

Please see the API documentation for available features and examples of how to use them.

When using Rust-Bio, please cite the following article:

Köster, J. (2016). Rust-Bio: a fast and safe bioinformatics library. Bioinformatics, 32(3), 444-446.

Further, you can cite the used versions via DOIs:

Rust-Bio:

Contribute

Any contributions are welcome, from a simple bug report to full-blown new modules:

If you find a bug and don't have the time or in-depth knowledge to fix it, just check if you can add info to an existing issue and otherwise file a bug report with as many infos as possible. Pull requests are welcome if you want to contribute fixes, documentation, or new code. Before making commits, it would be helpful to first install pre-commit to avoid failed continuous integration builds due to issues such as formatting:

Install pre-commit (see pre-commit.com/#installation)
Run pre-commit install in the rust-bio base directory

Depending on your intended contribution frequency, you have two options for opening pull requests:

For one-time contributions, simply fork the repository, apply your changes to a branch in your fork and then open a pull request.
If you plan on contributing more than once, become a contributor by saying hi on the rust-bio Discord server, Together with a short sentence saying who you are and mentioning what you want to contribute. We'll add you to the team. Then, you don't have to create a fork, but can simply push new branches into the main repository and open pull requests there.

If you want to contribute and don't know where to start, have a look at the roadmap.

Documentation guidelines

Every public function and module should have documentation comments. Check out which types of comments to use where. In rust-bio, documentation comments should:

explain functionality
give at least one useful example of how to use it (best as doctests, that run during testing, and using descriptive expect() statements for handling any Err()s that might occur)
describe time and memory complexity listed (where applicable)
cite and link sources and explanations for data structures, algorithms or code (where applicable)

For extra credit, feel free to familiarize yourself with:

the Rust documentation conventions
the Rust API documentation guidelines

Minimum supported Rust version

Currently the minimum supported Rust version is 1.51.0.

License

Licensed under the MIT license http://opensource.org/licenses/MIT. This project may not be copied, modified, or distributed except according to those terms.

Comments

Join the team!

Everybody is welcome to officially join the team. This does not involve any duties. Just contribute features you need for your own work. To join, please post here.

opened by johanneskoester 87
Usage survey for grant application

Dear @rust-bio/contributors, @MikkelSchubert, @mbrubeck, @veldsla, @bpr, @rednum, @natir, @flying-sheep, @althonos, @llogiq, @eclarke, @skade, @parir, @marcsch, @bow, @apirogov, @mahdi-b, @audy, @mariokostelac, @zitsen, @ghost,

I am currently writing a grant application for the sustainable development of Rust-Bio. For this, it is beneficial to name the research projects, tools or libraries, that use the software. Would you be so kind to post stuff you know about or work on yourself? If you don't mind, please also mention unpublished projects.

Thank you so much in advance! If we get the grant, we can hire a full-time developer.

Best, Johannes

opened by johanneskoester 32
Added a naive orf research algorithm

Added an orf research algorithm.

Matched orf are returned as &[u8]. The algorithm only searches for orf in the 3 possible reading frames of the provided sequence without examining reverse complementary strand sequence. If it did, then it would be necessary to calculate the reverse complementary strand, and orf couldn't be returned as slices because the reverse complementary strand sequence wouldn't live outside of the function.

The complexity of this algorithm is linear, the 3 reading frames are read simultaneously which means the sequence only needs to be parsed one time.

Refining the method could include comparing the gc content of the obtained orf to the gc content of the whole sequence (which is why a gc.rs is present in the seq_analysis folder).

opened by althonos 27
Occ::get performance, redux
I ended up writing a blog post about the tools I used for profiling the change I proposed to Occ::get. Fun story, @aatch pointed out on Reddit that the slowdown in some other versions I had tried was probably due to the overhead of dealing with alignment in a vectorized version of the function, and noted that 3 may be an odd sample rate to test with the FMIndex. As an example, bowtie2's FM index appears to default to a sampling rate of 32, and I've seen a few cases of using higher rates as well.

The version of Occ::get proposed by Aatch is at approximate parity with the filter/count version for performance at a sampling rate of 32 (just a tiny bit slower), but it quickly gets much, much faster as the sampling rate increases. For example, with a sampling rate of 128 (in use in one of my applications):

filter/count, k = 128

test search_index_seeds ... bench: 179,442 ns/iter (+/- 1,947)

vectorized, k = 128

test search_index_seeds ... bench: 152,368 ns/iter (+/- 433)

filter/count, k = 64

test search_index_seeds ... bench: 105,141 ns/iter (+/- 1,665)

vectorized, k = 64

test search_index_seeds ... bench: 99,044 ns/iter (+/- 252)

filter/count, k = 32

test search_index_seeds ... bench: 74,427 ns/iter (+/- 5,166)

vectorized, k = 32

test search_index_seeds ... bench: 75,333 ns/iter (+/- 303)

Seems like an overall win to me. At a k of 32, the averages are just slightly higher than the filter/count version, but well within the error margin reported by libtest.

The one downside here is that at smaller sampling rates, a non-vectorized version wins out:

filter/count, k = 8

test search_index_seeds ... bench: 40,598 ns/iter (+/- 1,170)

vectorized, k = 8

test search_index_seeds ... bench: 53,842 ns/iter (+/- 224)

Thoughts here? Without integer generics (at all) and specialization (in stable), I'm not sure it's possible to cleanly maximize performance of small sampling rates and large ones. I'm biased here because I'm working with very large indices that need high sampling rates, but I'm not sure what other values of k are "in the wild."
enhancement WIP
opened by anp 26
Multiple sentinels and a sampled suffix array
In making SampledSuffixArray work, I discovered that multiple "canonical" implementations of FMIndex don't seem to support multiple sentinels in the text. Ben Langmead's Python notebooks host an implementation which can't handle multiple sentinels, and it looks like SeqAn doesn't either (although they do in a roundabout way by using a generalized suffix array).

So the question I have is -- is it possible to support a sampled suffix array in rust-bio without breaking the FMDIndex? I see some potential options:

Make it clear in docs (and in code, somehow?) that only text with a single sentinel can be used in a SampledSuffixArray. Prevent FMDIndex-produced intervals from querying a SampledSuffixArray, etc.

Implement a generalized suffix array and sidestep the problem (although I'm not sure what interaction that has with sampling, and have only glanced at the literature on them)

Find a way to make a single-sequence suffix array sampleable without breaking on multiple sentinels. I haven't seen any research yet on this exact issue, in part because I think most who need to support multiple sequences are building generalized SAs.

I'm not sure what the best course of action is, but I'm eager to push forward on sampled SAs, so I'm very curious what others thoughts are here.
enhancement
opened by anp 26
Partial-Order Alignments
I'd like to create a tracking-issue for the branch I'm working on to add a partial-order aligner to the repetoire of alignment tools in rust-bio: https://github.com/rust-bio/rust-bio/tree/partial-order-aligner

To make the code reach a sort of minimum viable product-level, it still needs:

Convert main() and example4.fa to inline example code

Add unit-tests

Individual read-path tracking and multi-alignment sequence output

Make the internal aligner a member variable and make it customizabie at instantiation

Some form of user-specified path-pruning

Beyond that, I am open to suggestions, particularly if some of the more experienced contributors have suggestions for improving the API, or stylistic or efficiency improvements to suggest for the code itself.
opened by bnbowman 24
Use tuple structs/newtypes for data structures
@cramertj reminded me of this with a comment on a PR. A number of data structures are implemented as type aliases currently:

https://github.com/rust-bio/rust-bio/blob/master/src/data_structures/suffix_array.rs#L26

https://github.com/rust-bio/rust-bio/blob/master/src/data_structures/bwt.rs#L17

https://github.com/rust-bio/rust-bio/blob/master/src/stats/mod.rs#L14

https://github.com/rust-bio/rust-bio/blob/master/src/alphabets/mod.rs#L27

https://github.com/rust-bio/rust-bio/blob/master/src/pattern_matching/kmp.rs#L31

However, I think that using tuple structs/newtypes (pub struct RawSuffixArray(Vec<usize>)) would grant some extra compile-time verification to the use of these data structures (i.e. to prevent accidentally giving any old Vec to a function which needs RawSuffixArray), along with better error messages.

This would be a significant change to the API, but would only break clients who used the underlying types in a function signature, rather than the aliases. It would also allow greater leeway in the future to change the representation of the data structures without it being a breaking change.

Thoughts?
opened by anp 24
Docathon: update documentation for IO methods

Updates for examples to FASTA (#295) and GFF (#297) file reading and writing. I've updates FASTA and BED examples to be consistent.

Please excuse errors as I'm new at Rust. I hope this helps. Suggestions to improve it before merging are welcome!

If there is recommended (quick) way to test whether these examples work, that would be helpful. I have files of these format(s) that I can test it on in a Linux environment.

The package builds locally but I think examples are commented out so of course it does.
documentation

opened by TomKellyGenetics 23
Alignment w/o traceback

Add a function for hamming distance hamm_dist + doc + test + bench. Add a function for edit distance lev_dist + doc + test + bench. Add the pretty() method to Alignment to return a String with pretty formatted alignments + doc. Add 3x3 easy-to-use functions for computing alignments, scores and returning pretty formatted alignments + doc + test + bench for all of them.

Due to a bug in the global alignment method the benchmark bench_fn_global_score fails now.

opened by vadimnazarov 20
Gzip compatible fasta reader

Adds a method to bio::io::fasta::Reader that takes a path to a gunzip-compressed FASTA file and a single test to ensure each Record is identical between plain-text and compressed inputs. Adds flate2 as a library dependency.

A flate2::read::GzDecoder object can already be passed to Reader::new() on the user side, but I noticed there are no library methods for this functionality (I hope I didn't miss them) and figured this could be a good first-time contribution for a rust novice if there is interest. Implementation as a trait would probably be best for introducing this functionality library-wide. Happy to take that on if this sounds like a good idea to folks.

opened by amsesk 19
Logo for Rust-Bio

During the rust-bio docathon 2020 I created a logo for our discord channel. I finally found some time to prepare two additional versions and put on some finishing touches. If we would like to adopt a project logo, I would offer these as a suggestion (or an initial draft to iterate upon).

All of these are based on Ferris the Rustacean designed and released under CC0 1.0 by Karen Rustad Tölva. SVG versions, some background information on how these were created and png exports can be found at https://github.com/HenningTimm/rust-bio-logo.

Version 1: As used on the discord server

Version 2: As used on the discord server but with thicker lines

Version 3: With a curled strand

opened by HenningTimm 17
chore: release 1.1.1
:robot: I have created a release *beep* *boop*

1.1.1 (2023-01-02)

Performance Improvements

mark complement as inline (#510) (bd08234)

This PR was generated with Release Please. See documentation.
autorelease: pending
opened by github-actions[bot] 0

Wrap sequence in output file

A common convention is to wrap lines at 60 characters in a FASTA file. Would it be possible to add an option to wrap the sequences when writing out to a file? For example:

/// Write a Fasta record with given id, optional description and sequence.
    pub fn write(&mut self, id: &str, desc: Option<&str>, seq: TextSlice<'_>) -> io::Result<()> {
        self.writer.write_all(b">")?;
        self.writer.write_all(id.as_bytes())?;
        if let Some(desc) = desc {
            self.writer.write_all(b" ")?;
            self.writer.write_all(desc.as_bytes())?;
        }
        self.writer.write_all(b"\n")?;
        for chunk in seq.chunks(wrap) {
            self.writer.write_all(chunk)?;
            self.writer.write_all(b"\n")?;
        }
        self.writer.write_all(b"\n")?;

        Ok(())
    }

opened by alvanuffelen 1

docs: Documenting expand kmer
This is still a WIP, I've tried to understand the function and what it does and document it to the best of my abilities. To answer the original post:

allowed_mismatches is strictly tested >therefore there's indeed a maximum of allowed_mismatches + 1 mismatch possible.

Output matches don't seem to undergo any kind of overlapping verification, therefore no need to know incase of overlaping mismatches close #494
opened by Ptipiak 0
DOC: Not documented that FM index / suffix array / BWT require terminal sentinel character

I spent a good hour debugging backward_search using the code fragment from backward_search: https://github.com/rust-bio/rust-bio/blob/eb9d378c70bc43d1dcf477e48cb167802799a546/src/data_structures/fmindex.rs#L103-L145

Turns out, the reference fasta I read in didn't have a sentinel character $ at the end. Lacking this terminal $ makes backward_search produce wrong results.

While the example texts do contain that character (that's how I got the idea to attach it to my reference) - it's not mentioned anywhere as far as I can see.

There should either be a note, or maybe even a check during creation of the FM index - unless there's a use case without terminal $. Or even more convenient - add a terminal $.

opened by corneliusroemer 0
DOC: public fn alignment::sparse::expand_kmer_matches not documented
The following public function lacks a doc string: https://github.com/rust-bio/rust-bio/blob/43d198e87c69867fad39f07374d3fdcd9d879729/src/alignment/sparse.rs#L398

Some important questions to answer:

exact meaning of allowed_mismatches, if I read the code correctly, extension happens independently on either side of the k-mer match, and stops after allowed_mismatches in each direction. So if you specify allowed_mismatches = 1 the extended match will actually contain 2 mismatches, one on either side - which is not what one may expect at all.

whether output matches can overlap or not, if they don't overlap, will adjoining matches be merged or just border directly side-by-side?
opened by corneliusroemer 0

Releases(v1.1.0)

v1.1.0(Dec 13, 2022)
Features

add standard derives to all public types (#505) (c08623b)

Source code(tar.gz)
Source code(zip)
v1.0.0(Sep 29, 2022)
⚠ BREAKING CHANGES

update to latest ordered-float (#503)

Features

detect nested ORFs (#501) (19b6c36)

Bug Fixes

update to latest ordered-float (#503) (63fb752)

Source code(tar.gz)
Source code(zip)
v0.42.0(Aug 31, 2022)
⚠ BREAKING CHANGES

Update strum and ordered-float dependencies and change From into TryFrom for NotNan. (#491)

Features

Update strum and ordered-float dependencies and change From into TryFrom for NotNan. (#491) (57ccf8f)

Miscellaneous Chores

widen range on statrs (#499) (0ff1e70)

Source code(tar.gz)
Source code(zip)
v0.41.0(Mar 30, 2022)
Features

adaptive integration of density functions using a binary search approach that tries to achieve good resolution around the maximum (#486) (207b76f)

Source code(tar.gz)
Source code(zip)
v0.40.0(Feb 25, 2022)
Features

base specific hop parameters in homopoly-pair-hmm (#480) (cf75b6c)

Miscellaneous Chores

release 0.40.0 (cf8ebc3)

Source code(tar.gz)
Source code(zip)
v0.39.2(Feb 11, 2022)
Bug Fixes

Make QGramIndex use less memory; fix bug; improve tests&docs (#471) (48bac1c)

overflow in qgrams for k=32 (#478) (8048eb8)

Source code(tar.gz)
Source code(zip)
v0.39.1(Jan 13, 2022)
Bug Fixes

added code to ignore commented lines in a bed file (#474) (d17f823)

Source code(tar.gz)
Source code(zip)
v0.39.0(Oct 20, 2021)
Features

Backward search api (#457) (eb9d378)

Source code(tar.gz)
Source code(zip)
v0.38.0(Oct 4, 2021)
Features

adds iupac amino acid alpha (#444) (f4b7a7c)

Allow to compute bayesian model via the exploration of the marginal distribution. (#453) (d09033c)

Bug Fixes

backward search yielding potentially incorrect positions on FM-Index (#454) (#455) (3489e6a)

Source code(tar.gz)
Source code(zip)
v0.37.1(Aug 23, 2021)
Bug Fixes

One Simple Trick to significantly improve the speed and memory usage of Occ (#448) (9aa79cb)

sampled suffix array (#447) (00f9846)

Source code(tar.gz)
Source code(zip)
v0.37.0(Jul 9, 2021)
Features

add method to retrieve all event posteriors from Bayesian model. (e5feda8)

Source code(tar.gz)
Source code(zip)
v0.36.0(Jul 6, 2021)
Features

Baum-Welch algorithm for Discrete HMM (#432) (eb8b8cb)

Source code(tar.gz)
Source code(zip)
v0.35.0(Jul 5, 2021)
Improved buffer control in Fasta and Fastq API (@natir).

Fixed an indexing bug in ArrayBackedIntervalTree (@wabain).

Fixed a corner case where the FASTX parser could have looped infinitely (@morsecodist).

Fixed compiler warnings (@fxwiegand).

Improved documentation for FASTA index (@mbhall88).

Source code(tar.gz)
Source code(zip)
v0.34.0(May 4, 2021)
Bayesian model framework now relies on Hash instead of Ord for accessing events (@johanneskoester).

Added wavelet matrix datastructure (@Identi, @tedil).

Source code(tar.gz)
Source code(zip)
v0.33.0(Mar 10, 2021)
Fixed a floating point error in gcn_content (@tedil).

Improved error messages in io module (@fxwiegand).

Better memory usage of HomopolypairHMM (@tedil).

Improved documentation (@TianShi2001, @m0ssc0de, @dcroote).

Support for reading compact BED files (@manzt).

API improvements for Alphabet and GFF reader (@tshauck).

Switched to thiserror for error handling (@delehef).

Removed unsafe code (@huonw).

Fixed overflow in ShiftAnd algorithm (@dcroote).

Various additional test cases (@dcroote).

Extended API for SMEM computation on FMDIndex (@Identi).

Added a parser for Newick phylogenetic trees (@delehef).

Source code(tar.gz)
Source code(zip)
v0.32.0(Aug 6, 2020)
This release mostly comprises of the documentation improvements made by the recent Docathon. Lots of new doctests and a bunch of time and memory complexity annotations. Big thanks to @tedil, @dcroote, @TomKellyGenetics, @natir, @thomasmulvaney, @mbhall88, @jafors, @HenningTimm, @luizirber, @dlaehnemann, @johanneskoester. Further additions in this release are:

homopolymer-error-aware pairHMM implementation (thanks to @tedil).

SIMD-accelerated edit distance routines (thanks to @Daniel-Liu-c0deb0t).

BitEnc derives more traits (thanks to @FelixMoelder).

wider pinning of some dependencies' version numbers (thanks to @pmarks).

Source code(tar.gz)
Source code(zip)
v0.31.0(Jun 2, 2020)
Bugfix for pHMM implementation (thanks to @tedil).

Sorted array-backed interval trees (thanks to @tedil).

Source code(tar.gz)
Source code(zip)
v0.30.1(May 13, 2020)
Improved occ counting speed for FM index (thanks to @thomasmulvaney)

Various small bug fixes and code cleanups and linter fixes.

Source code(tar.gz)
Source code(zip)
v0.30.0(May 13, 2020)
Bayesian models now allow to access internals.

Various small bug fixes.

Source code(tar.gz)
Source code(zip)
bio-0.29.0(Sep 27, 2019)

Apart from some fixes, this version migrates error handling to the beautiful snafu crate. See changelog for details.
Source code(tar.gz)
Source code(zip)
0.24.0(Nov 26, 2018)

This release provides various improvements to API flexibility. Moreover, it provides dramatic speed improvements for LogProb and pairHMM implementations. For details see CHANGELOG.
Source code(tar.gz)
Source code(zip)
0.23.0(Nov 8, 2018)

For details see CHANGELOG.md
Source code(tar.gz)
Source code(zip)
0.20.3(Jun 7, 2018)

This release fixes a corner case bug in RankSelect.
Source code(tar.gz)
Source code(zip)
0.20.2(Jun 6, 2018)

This release fixes a potential overflow when calculating RankSelect::rank_1.
Source code(tar.gz)
Source code(zip)
0.20.1(Jun 4, 2018)

This release fixes a bug in the select implementation of RankSelect.
Source code(tar.gz)
Source code(zip)
0.20.0(Jun 1, 2018)

This release refactors the RankSelect API.
Source code(tar.gz)
Source code(zip)
0.18.0(May 4, 2018)

This release improves the API of the FASTA reader. For details see CHANGELOG.
Source code(tar.gz)
Source code(zip)
0.17.0(Feb 23, 2018)

This release provides a bug fix and convenience improvements to the API.
Source code(tar.gz)
Source code(zip)
0.16.0(Jan 4, 2018)

This release introduces a re-implementation of pairwise alignment. The new implementation provides more flexiblity in scoring and also allows for banded alignment. For more information see change log.
Source code(tar.gz)
Source code(zip)
0.15.0(Nov 28, 2017)

This release adds a pair HMM implementation. See the change log for more information.
Source code(tar.gz)
Source code(zip)

Owner

Rust-Bio

A bioinformatics library for the Rust language.

GitHub https://rust-bio.github.io

Radiate is a parallel genetic programming engine capable of evolving solutions to many problems as well as training learning algorithms.

Radiate Coming from Evolutionary Radiation. Evolutionary radiation is a rapid increase in the number of species with a common ancestor, characterized

109 Dec 18, 2022

Genetic Algorithms Library

genx genx provides modular building blocks to run simulations of optimization and search problems using Genetic Algorithms (GA). The vision for genx i

29 Aug 9, 2022

A library that can be used as a building block for high-performant graph algorithms.

graph A library that can be used as a building block for high-performant graph algorithms. Graph provides implementations for directed and undirected

222 Jan 1, 2023

ANISE provides an open-source and open-governed library and algorithmic specification for most computations for astrodynamics

ANISE provides an open-source and open-governed library and algorithmic specification for most computations for astrodynamics. It is heavily inspired by NAIF SPICE, and may be considered as an open-source modern rewrite of SPICE.