A lightweight platform-accelerated library for biological motif scanning using position weight matrices.

Overview

🎼 🧬 lightmotif Star me

A lightweight platform-accelerated library for biological motif scanning using position weight matrices.

Actions Coverage License Crate Docs Source Mirror GitHub issues Changelog

🗺️ Overview

Motif scanning with position weight matrices (also known as position-specific scoring matrices) is a robust method for identifying motifs of fixed length inside a biological sequence. They can be used to identify transcription factor binding sites in DNA, or protease cleavage site in polypeptides. Position weight matrices are often viewed as sequence logos:

MX000274.svg

The lightmotif library provides a Rust crate to run very efficient searches for a motif encoded in a position weight matrix. The position scanning combines several techniques to allow high-throughput processing of sequences:

  • Compile-time definition of alphabets and matrix dimensions.
  • Sequence symbol encoding for fast table look-ups, as implemented in HMMER[1] or MEME[2]
  • Striped sequence matrices to process several positions in parallel, inspired by Michael Farrar[3].
  • Vectorized matrix row look-up using permute instructions of AVX2.

💡 Example

use lightmotif::*;

// Create a count matrix from an iterable of motif sequences
let counts = CountMatrix::<Dna, {Dna::K}>::from_sequences(&[
    EncodedSequence::encode("GTTGACCTTATCAAC").unwrap(),
    EncodedSequence::encode("GTTGATCCAGTCAAC").unwrap(),
]).unwrap();

// Create a PSSM with 0.1 pseudocounts and uniform background frequencies.
let pssm = counts.to_freq(0.1).to_scoring(None);

// Encode the target sequence into a striped matrix
let seq = "ATGTCCCAACAACGATACCCCGAGCCCATCGCCGTCATCGGCTCGGCATGCAGATTCCCAGGCG";
let encoded = EncodedSequence::<Dna>::encode(seq).unwrap();
let mut striped = encoded.to_striped::<32>();
striped.configure(&pssm);

// Use a pipeline to compute scores for every position of the matrix
let scores = Pipeline::<Dna, f32>::score(&striped, &pssm);

// Scores can be extracted into a Vec<f32>, or indexed directly.
let v = scores.to_vec();
assert_eq!(scores[0], -23.07094);
assert_eq!(v[0], -23.07094);

To use the AVX2 implementation, simply create a Pipeline<_, __m256> instead of the Pipeline<_, f32>. This is only supported when the library is compiled with the avx2 target feature, but it can be easily configured with Rust's #[cfg] attribute.

⏱️ Benchmarks

Both benchmarks use the MX000001 motif from PRODORIC[4], and the complete genome of an Escherichia coli K12 strain. Benchmarks were run on a i7-10710U CPU running @1.10GHz, compiled with --target-cpu=native.

  • Score every position of the genome with the motif weight matrix:

    running 3 tests
    test bench_avx2    ... bench:   6,948,169 ns/iter (+/- 16,477) = 668 MB/s
    test bench_ssse3   ... bench:  29,079,674 ns/iter (+/- 875,880) = 159 MB/s
    test bench_generic ... bench: 331,656,134 ns/iter (+/- 5,310,490) = 13 MB/s
  • Find the highest-scoring position for a motif in a 10kb sequence (compared to the PSSM algorithm implemented in bio::pattern_matching::pssm):

    test bench_avx2    ... bench:      49,259 ns/iter (+/- 1,489) = 203 MB/s
    test bench_bio     ... bench:   1,440,705 ns/iter (+/- 5,291) = 6 MB/s
    test bench_generic ... bench:     706,361 ns/iter (+/- 1,726) = 14 MB/s
    test bench_sssee   ... bench:      94,152 ns/iter (+/- 36) = 106 MB/s

💭 Feedback

⚠️ Issue Tracker

Found a bug ? Have an enhancement request ? Head over to the GitHub issue tracker if you need to report or ask something. If you are filing in on a bug, please include as much information as you can about the issue, and try to recreate the same bug in a simple, easily reproducible situation.

📋 Changelog

This project adheres to Semantic Versioning and provides a changelog in the Keep a Changelog format.

⚖️ License

This library is provided under the open-source MIT license.

This project was developed by Martin Larralde during his PhD project at the European Molecular Biology Laboratory in the Zeller team.

📚 References

  • [1] Eddy, Sean R. ‘Accelerated Profile HMM Searches’. PLOS Computational Biology 7, no. 10 (20 October 2011): e1002195. doi:10.1371/journal.pcbi.1002195.
  • [2] Grant, Charles E., Timothy L. Bailey, and William Stafford Noble. ‘FIMO: Scanning for Occurrences of a given Motif’. Bioinformatics 27, no. 7 (1 April 2011): 1017–18. doi:10.1093/bioinformatics/btr064.
  • [3] Farrar, Michael. ‘Striped Smith–Waterman Speeds Database Searches Six Times over Other SIMD Implementations’. Bioinformatics 23, no. 2 (15 January 2007): 156–61. doi:10.1093/bioinformatics/btl582.
  • [4] Dudek, Christian-Alexander, and Dieter Jahn. ‘PRODORIC: State-of-the-Art Database of Prokaryotic Gene Regulation’. Nucleic Acids Research 50, no. D1 (7 January 2022): D295–302. doi:10.1093/nar/gkab1110.
You might also like...
Which words can you spell using only element abbreviations from the periodic table?
Which words can you spell using only element abbreviations from the periodic table?

Periodic Words Have you ever wondered which words you can spell using only element abbreviations from the periodic table? Well thanks to this extremel

A crate using DeepSpeech bindings to convert mic audio from speech to text

DS-TRANSCRIBER Need an Offline Speech To Text converter? Records your mic, and returns a String containing what was said. Features Begins transcriptio

Probabilistically split concatenated words using NLP based on English Wikipedia unigram frequencies.
Probabilistically split concatenated words using NLP based on English Wikipedia unigram frequencies.

Untanglr Untanglr takes in a some mangled words and makes sense out of them so you dont have to. It goes through the input and splits it probabilistic

A Markdown to HTML compiler and Syntax Highlighter, built using Rust's pulldown-cmark and tree-sitter-highlight crates.

A blazingly fast( possibly the fastest) markdown to html parser and syntax highlighter built using Rust's pulldown-cmark and tree-sitter-highlight crate natively for Node's Foreign Function Interface.

What if we could check declarative macros before using them?
What if we could check declarative macros before using them?

expandable An opinionated attribute-macro based macro_rules! expansion checker. Textbook example rustc treats macro definitions as some opaque piece o

A command-line tool and library for generating regular expressions from user-provided test cases
A command-line tool and library for generating regular expressions from user-provided test cases

Table of Contents What does this tool do? Do I still need to learn to write regexes then? Current features How to install? 4.1 The command-line tool 4

An efficient and powerful Rust library for word wrapping text.

Textwrap Textwrap is a library for wrapping and indenting text. It is most often used by command-line programs to format dynamic output nicely so it l

⏮ ⏯ ⏭ A Rust library to easily read forwards, backwards or randomly through the lines of huge files.

EasyReader The main goal of this library is to allow long navigations through the lines of large files, freely moving forwards and backwards or gettin

Natural language detection library for Rust. Try demo online: https://www.greyblake.com/whatlang/
Natural language detection library for Rust. Try demo online: https://www.greyblake.com/whatlang/

Whatlang Natural language detection for Rust with focus on simplicity and performance. Content Features Get started Documentation Supported languages

Comments
  • Update actions/checkout in GitHub Actions workflows to v3

    Update actions/checkout in GitHub Actions workflows to v3

    Updates the actions/checkout action used in the GitHub Actions workflows to its newest major version.

    Changes in actions/checkout:

    v3.5.2

    • Fix api endpoint for GHES

    v3.5.1

    • Fix slow checkout on Windows

    v3.5.0

    • Add new public key for known_hosts

    v3.4.0

    • Upgrade codeql actions to v2
    • Upgrade dependencies
    • Upgrade @actions/io

    v3.3.0

    • Implement branch list using callbacks from exec function
    • Add in explicit reference to private checkout options
    • Fix comment typos

    v3.2.0

    • Add GitHub Action to perform release
    • Fix status badge
    • Replace datadog/squid with ubuntu/squid Docker image
    • Wrap pipeline commands for submoduleForeach in quotes
    • Update @actions/io to 1.1.2
    • Upgrading version to 3.2.0

    v3.1.0

    • Use @actions/core saveState and getState
    • Add github-server-url input

    v3.0.2

    • Add input set-safe-directory

    v3.0.1

    • Fixed an issue where checkout failed to run in container jobs due to the new git setting safe.directory
    • Bumped various npm package versions

    v3.0.0

    • Update to node 16
    opened by striezel 0
Releases(v0.1.1)
  • v0.1.1(May 4, 2023)

    Added

    • Helper crate to detect CPU features support at runtime.

    Fixed

    • AVX2 code being imported on x86-64 platforms without checking for OS support.
    • AVX2-enabled extension always being compiled even on platforms with no AVX2 support.

    Removed

    • built and pyo3-built build dependencies (causing issues with workspaces).
    Source code(tar.gz)
    Source code(zip)
  • v0.1.0(May 4, 2023)

Owner
Martin Larralde
PhD candidate in Bioinformatics, passionate about programming, SIMD-enthusiast, Pythonista, Rustacean. I write poems, and sometimes they are executable.
Martin Larralde
SIMD-accelerated UTF-8 validation for Rust.

simdutf8 – High-speed UTF-8 validation for Rust Blazingly fast API-compatible UTF-8 validation for Rust using SIMD extensions, based on the implementa

null 441 Jan 8, 2023
Viterbi-based accelerated tokenizer (Python wrapper)

?? python-vibrato ?? Vibrato is a fast implementation of tokenization (or morphological analysis) based on the Viterbi algorithm. This is a Python wra

null 20 Dec 29, 2022
A lightweight library with vehicle tuning utilities.

A lightweight library with vehicle tuning utilities. This includes utilities for communicating with OBD-II services, firmware downloading/flashing, and table modifications.

LibreTuner 6 Oct 3, 2022
Vaporetto: a fast and lightweight pointwise prediction based tokenizer

?? VAporetto: POintwise pREdicTion based TOkenizer Vaporetto is a fast and lightweight pointwise prediction based tokenizer. Overview This repository

null 184 Dec 22, 2022
A lightweight and snappy crate to remove emojis from a string.

A lightweight and snappy crate to remove emojis from a string.

Tejas Ravishankar 8 Jul 19, 2022
🛥 Vaporetto is a fast and lightweight pointwise prediction based tokenizer. This is a Python wrapper for Vaporetto.

?? python-vaporetto ?? Vaporetto is a fast and lightweight pointwise prediction based tokenizer. This is a Python wrapper for Vaporetto. Installation

null 17 Dec 22, 2022
Cross-platform live-reloading GFM compatible markdown viewer

A simple cross-platform markdown viewer Usage markdown-viewer use the system file dialog to choose a markdown file to view markdown-viewer my_file.md

Ben Richeson 5 Sep 21, 2022
Platform fighter, inspired by Super Smash Bros.

GUT CHAMPION Summary Gut Champion is a platformer fighter inspired by Super Smash Bros. The goal is to knock the enemy off stage. The more you hit you

Eino Korte 2 Sep 19, 2022
Cross-platform embeddable sandboxing

Birdcage This library is still under development and not ready to be used yet. About Birdcage is a cross-platform embeddable sandboxing library allowi

Phylum 36 Dec 13, 2022
Implementation of sentence embeddings with BERT in Rust, using the Burn library.

Sentence Transformers in Burn This library provides an implementation of the Sentence Transformers framework for computing text representations as vec

Tyler Vergho 4 Sep 4, 2023