A lightweight platform-accelerated library for biological motif scanning using position weight matrices.

Martin Larralde

Last update: May 4, 2023

Related tags

Text processing bioinformatics genomics simd rust-library sequence-motif pssm sequence-analysis

Overview

🎼 🧬 `lightmotif`

A lightweight platform-accelerated library for biological motif scanning using position weight matrices.

🗺️ Overview

Motif scanning with position weight matrices (also known as position-specific scoring matrices) is a robust method for identifying motifs of fixed length inside a biological sequence. They can be used to identify transcription factor binding sites in DNA, or protease cleavage site in polypeptides. Position weight matrices are often viewed as sequence logos:

The lightmotif library provides a Rust crate to run very efficient searches for a motif encoded in a position weight matrix. The position scanning combines several techniques to allow high-throughput processing of sequences:

Compile-time definition of alphabets and matrix dimensions.
Sequence symbol encoding for fast table look-ups, as implemented in HMMER[1] or MEME[2]
Striped sequence matrices to process several positions in parallel, inspired by Michael Farrar[3].
Vectorized matrix row look-up using permute instructions of AVX2.

💡 Example

use lightmotif::*;

// Create a count matrix from an iterable of motif sequences
let counts = CountMatrix::<Dna, {Dna::K}>::from_sequences(&[
    EncodedSequence::encode("GTTGACCTTATCAAC").unwrap(),
    EncodedSequence::encode("GTTGATCCAGTCAAC").unwrap(),
]).unwrap();

// Create a PSSM with 0.1 pseudocounts and uniform background frequencies.
let pssm = counts.to_freq(0.1).to_scoring(None);

// Encode the target sequence into a striped matrix
let seq = "ATGTCCCAACAACGATACCCCGAGCCCATCGCCGTCATCGGCTCGGCATGCAGATTCCCAGGCG";
let encoded = EncodedSequence::<Dna>::encode(seq).unwrap();
let mut striped = encoded.to_striped::<32>();
striped.configure(&pssm);

// Use a pipeline to compute scores for every position of the matrix
let scores = Pipeline::<Dna, f32>::score(&striped, &pssm);

// Scores can be extracted into a Vec<f32>, or indexed directly.
let v = scores.to_vec();
assert_eq!(scores[0], -23.07094);
assert_eq!(v[0], -23.07094);

To use the AVX2 implementation, simply create a Pipeline<_, __m256> instead of the Pipeline<_, f32>. This is only supported when the library is compiled with the avx2 target feature, but it can be easily configured with Rust's #[cfg] attribute.

⏱️ Benchmarks

Both benchmarks use the MX000001 motif from PRODORIC [4], and the complete genome of an Escherichia coli K12 strain. Benchmarks were run on a i7-10710U CPU running @1.10GHz, compiled with --target-cpu=native.

Score every position of the genome with the motif weight matrix:

running 3 tests
test bench_avx2    ... bench:   6,948,169 ns/iter (+/- 16,477) = 668 MB/s
test bench_ssse3   ... bench:  29,079,674 ns/iter (+/- 875,880) = 159 MB/s
test bench_generic ... bench: 331,656,134 ns/iter (+/- 5,310,490) = 13 MB/s

Find the highest-scoring position for a motif in a 10kb sequence (compared to the PSSM algorithm implemented in bio::pattern_matching::pssm):

test bench_avx2    ... bench:      49,259 ns/iter (+/- 1,489) = 203 MB/s
test bench_bio     ... bench:   1,440,705 ns/iter (+/- 5,291) = 6 MB/s
test bench_generic ... bench:     706,361 ns/iter (+/- 1,726) = 14 MB/s
test bench_sssee   ... bench:      94,152 ns/iter (+/- 36) = 106 MB/s

💭 Feedback

⚠️ Issue Tracker

Found a bug ? Have an enhancement request ? Head over to the GitHub issue tracker if you need to report or ask something. If you are filing in on a bug, please include as much information as you can about the issue, and try to recreate the same bug in a simple, easily reproducible situation.

📋 Changelog

This project adheres to Semantic Versioning and provides a changelog in the Keep a Changelog format.

⚖️ License

This library is provided under the open-source MIT license.

This project was developed by Martin Larralde during his PhD project at the European Molecular Biology Laboratory in the Zeller team.

📚 References

[1] Eddy, Sean R. ‘Accelerated Profile HMM Searches’. PLOS Computational Biology 7, no. 10 (20 October 2011): e1002195. doi:10.1371/journal.pcbi.1002195.
[2] Grant, Charles E., Timothy L. Bailey, and William Stafford Noble. ‘FIMO: Scanning for Occurrences of a given Motif’. Bioinformatics 27, no. 7 (1 April 2011): 1017–18. doi:10.1093/bioinformatics/btr064.
[3] Farrar, Michael. ‘Striped Smith–Waterman Speeds Database Searches Six Times over Other SIMD Implementations’. Bioinformatics 23, no. 2 (15 January 2007): 156–61. doi:10.1093/bioinformatics/btl582.
[4] Dudek, Christian-Alexander, and Dieter Jahn. ‘PRODORIC: State-of-the-Art Database of Prokaryotic Gene Regulation’. Nucleic Acids Research 50, no. D1 (7 January 2022): D295–302. doi:10.1093/nar/gkab1110.

You might also like...

Which words can you spell using only element abbreviations from the periodic table?

Comments

Update actions/checkout in GitHub Actions workflows to v3
Updates the actions/checkout action used in the GitHub Actions workflows to its newest major version.

Changes in actions/checkout:

v3.5.2

Fix api endpoint for GHES

v3.5.1

Fix slow checkout on Windows

v3.5.0

Add new public key for known_hosts

v3.4.0

Upgrade codeql actions to v2

Upgrade dependencies

Upgrade @actions/io

v3.3.0

Implement branch list using callbacks from exec function

Add in explicit reference to private checkout options

Fix comment typos

v3.2.0

Add GitHub Action to perform release

Fix status badge

Replace datadog/squid with ubuntu/squid Docker image

Wrap pipeline commands for submoduleForeach in quotes

Update @actions/io to 1.1.2

Upgrading version to 3.2.0

v3.1.0

Use @actions/core saveState and getState

Add github-server-url input

v3.0.2

Add input set-safe-directory

v3.0.1

Fixed an issue where checkout failed to run in container jobs due to the new git setting safe.directory

Bumped various npm package versions

v3.0.0

Update to node 16
opened by striezel 0

Releases(v0.1.1)

v0.1.1(May 4, 2023)
Added

Helper crate to detect CPU features support at runtime.

Fixed

AVX2 code being imported on x86-64 platforms without checking for OS support.

AVX2-enabled extension always being compiled even on platforms with no AVX2 support.

Removed

built and pyo3-built build dependencies (causing issues with workspaces).

Source code(tar.gz)
Source code(zip)
v0.1.0(May 4, 2023)

Initial release.
Source code(tar.gz)
Source code(zip)

A lightweight platform-accelerated library for biological motif scanning using position weight matrices.

Related tags

Overview

🎼 🧬 lightmotif

🗺️ Overview

💡 Example

⏱️ Benchmarks

💭 Feedback

⚠️ Issue Tracker

📋 Changelog

⚖️ License

📚 References

You might also like...

Which words can you spell using only element abbreviations from the periodic table?

A crate using DeepSpeech bindings to convert mic audio from speech to text

Probabilistically split concatenated words using NLP based on English Wikipedia unigram frequencies.

A Markdown to HTML compiler and Syntax Highlighter, built using Rust's pulldown-cmark and tree-sitter-highlight crates.

What if we could check declarative macros before using them?

A command-line tool and library for generating regular expressions from user-provided test cases

An efficient and powerful Rust library for word wrapping text.

⏮ ⏯ ⏭ A Rust library to easily read forwards, backwards or randomly through the lines of huge files.

Natural language detection library for Rust. Try demo online: https://www.greyblake.com/whatlang/

Comments

Update actions/checkout in GitHub Actions workflows to v3

v3.5.2

v3.5.1

v3.5.0

v3.4.0

v3.3.0

v3.2.0

v3.1.0

v3.0.2

v3.0.1

v3.0.0

Releases(v0.1.1)

v0.1.1(May 4, 2023)

Added

Fixed

Removed

v0.1.0(May 4, 2023)

Owner

Martin Larralde

SIMD-accelerated UTF-8 validation for Rust.

Viterbi-based accelerated tokenizer (Python wrapper)

A lightweight library with vehicle tuning utilities.

Vaporetto: a fast and lightweight pointwise prediction based tokenizer

A lightweight and snappy crate to remove emojis from a string.

🛥 Vaporetto is a fast and lightweight pointwise prediction based tokenizer. This is a Python wrapper for Vaporetto.

Cross-platform live-reloading GFM compatible markdown viewer

Platform fighter, inspired by Super Smash Bros.

Cross-platform embeddable sandboxing

Implementation of sentence embeddings with BERT in Rust, using the Burn library.

🎼 🧬 `lightmotif`