Fast DNA manipulation for Python, written in Rust.

Overview

quickdna

PyPI

Quickdna is a simple, fast library for working with DNA sequences. It is up to 100x faster than Biopython for some translation tasks, in part because it uses a native Rust module (via PyO3) for the translation. However, it exposes an easy-to-use, type-annotated API that should still feel familiar for Biopython users.

# These are the two main library types. Unlike Biopython, DnaSequence and
# ProteinSequence are distinct, though they share a common BaseSequence base class
>>> from quickdna import DnaSequence, ProteinSequence

# Sequences can be constructed from strs or bytes, and are stored internally as
# ascii-encoded bytes.
>>> d = DnaSequence("taatcaagactattcaaccaa")

# Sequences can be sliced just like regular strings, and return new sequence instances.
>>> d[3:9]
DnaSequence(seq='tcaaga')

# many other Python operations are supported on sequences as well: len, iter,
# ==, hash, concatenation with +, * a constant, etc. These operations are typed
# when appropriate and will not allow you to concatenate a ProteinSequence to a
# DnaSequence, for example

# DNA sequences can be easily translated to protein sequences with `translate()`.
# If no table=... argument is given, NBCI table 1 will be used by default...
>>> d.translate()
ProteinSequence(seq='*SRLFNQ')

# ...but any of the NCBI tables can be specified. A ValueError will be thrown
# for an invalid table.
>>> d.translate(table=22)
ProteinSequence(seq='**RLFNQ')

# This exists too! It's somewhat faster than Biopython, but not as dramatically as
# `translate()`
>>> d[3:9].reverse_complement()
DnaSequence(seq='TCTTGA')

# This method will return a list of all (up to 6) possible translated reading frames:
# (seq[:], seq[1:], seq[2:], seq.reverse_complement()[:], ...)
>>> d.translate_all_frames()
(ProteinSequence(seq='*SRLFNQ'), ProteinSequence(seq='NQDYST'),
ProteinSequence(seq='IKTIQP'), ProteinSequence(seq='LVE*S*L'),
ProteinSequence(seq='WLNSLD'), ProteinSequence(seq='G*IVLI'))

# translate_all_frames will return less than 6 frames for sequences of len < 5
>>> len(DnaSequence("AAAA").translate_all_frames())
4
>>> len(DnaSequence("AA").translate_all_frames())
0

# There is a similar method, `translate_self_frames`, that only returns the
# (up to 3) translated frames for this direction, without the reverse complement

# The IUPAC ambiguity code 'N' is supported as well.
# Codons with N will translate to a specific amino acid if it is unambiguous,
# such as GGN -> G, or the ambiguous amino acid code 'X' if there are multiple
# possible translations.
>>> DnaSequence("GGNATN").translate()
ProteinSequence(seq='GX')

Benchmarks

For regular DNA translation tasks, quickdna is faster than Biopython. (See benchmarks/bench.py for source). Machines and workloads vary, however -- always benchmark!

task time comparison
translate_quickdna(small_genome) 0.00306ms / iter
translate_biopython(small_genome) 0.05834ms / iter 1908.90%
translate_quickdna(covid_genome) 0.02959ms / iter
translate_biopython(covid_genome) 3.54413ms / iter 11979.10%
reverse_complement_quickdna(small_genome) 0.00238ms / iter
reverse_complement_biopython(small_genome) 0.00398ms / iter 167.24%
reverse_complement_quickdna(covid_genome) 0.02409ms / iter
reverse_complement_biopython(covid_genome) 0.02928ms / iter 121.55%

Should you use quickdna?

  • Quickdna pros
    • It's quick!
    • It's simple and small.
    • It has type annotations, including a py.typed marker file for checkers like MyPy or VSCode's PyRight.
    • It makes a type distinction between DNA and protein sequences, preventing confusion.
  • Quickdna cons:
    • It's newer and less battle-tested than Biopython.
    • It's not yet 1.0 -- the API is liable to change in the future.
    • It doesn't support reading FASTA files or many of the other tasks Biopython can do, so you'll probably end up still using Biopython or something else to do those tasks.
    • It doesn't support the (rarer) IUPAC ambiguity codes like B for non-A nucleotides, instead only supporting the general N ambiguity code.
      • If support for these codes is important to you, please make an issue! It may be possible to support them, it just isn't a priority right now.

Installation

Quickdna has prebuilt wheels for Linux (manylinux2010), OSX, and Windows available on PyPi.

Development

Quickdna uses PyO3 and maturin to build and upload the wheels, and poetry for handling dependencies. This is handled via a Justfile, which requires Just, a command-runner similar to make.

Poetry

You can install poetry from https://python-poetry.org, and it will handle the other python dependencies.

Just

You can install Just with cargo install just, and then run it in the project directory to get a list of commands.

Flamegraphs

The just profile command requires cargo-flamegraph, please see that repository for installation instructions.

Comments
  • Update FASTA parser to be configurable, generic, track line numbers, and other changes

    Update FASTA parser to be configurable, generic, track line numbers, and other changes

    Apologies for the large diff! Much of it is comments and tests, though.

    These changes make the FASTA parser more flexible, and will allow us to handle FASTA files the way we need for SecureDNA (SDNA employees, refer to the "Notes" section in [[Alpha API endpoint]] on the internal wiki).

    Changes

    Adds 2 new settings: allow_preceding_comment and concatenate_headers. These are described in the doc comments, but briefly:

    • allow_preceding_comment = true will ignore any content preceding the first header, like before. If false, it will emit that content as a headerless (header = "") record.
    • concatenate_headers = true will glue successive headers with no content together with newlines. If false, it will use the old behavior and emit content = "" records. See the doc comment for a doctest that explains this.

    The return value of FastaParser is now a FastaRecord with header and contents fields, instead of a 2-tuple. It's also grown a line_range field which gives the line numbers of the record (including header, 1-indexed, [start, end)).

    Removes DnaFastaParser and SimpleFastaParser in favor of a single FastaParser type which is generic over FromStr types. Since String has an (infalliable) FromStr impl, SimpleFastaParser is now FastaParser<String>, and DnaFastaParser is now FastaParser<DnaSequence>. We didn't have this type before, but a hypothetical ProteinFastaParser is now FastaParser<ProteinSequence>, and you could even have FastaParser<u64> if you wanted that.

    FASTA parsing is now state-machine-driven, which should make a non-allocating API easier to add in the future if we want, and also simplified the parsing code since it was getting hairy handling the new special cases in the old loop. Luckily most of @mkysel 's parsing logic was relatively straightforward to port into the state machine without too many changes besides handling the new settings.

    The old tests are all redone to use the new API (and to test all 4 settings combinations, where it makes sense), and a bunch of new tests are added.

    Misc changes:

    parse now explicitly asks for a BufRead instead of taking Read and wrapping it internally in BufReader. I checked the ecosystem using SourceGraph and that seemed to be the way most Rust stuff works, so this brings our API in line with that.

    There's a new parse_str method that takes advantage of &[u8] having a BufRead impl. That's part of the reason for the other change, too, since before that would have needlessly allocated a BufReader buffer which would bother me :-)

    opened by vgel 7
  • Give FastaRecord a to_string() method

    Give FastaRecord a to_string() method

    This Display impl converts the parsed record back into FASTA >header\ncontents\n format. Implementing Display automatically gets us ToString and .to_string().

    (We want this because it's the easiest way to turn a Vec<FastaRecord> contained in an ELT back into a string to ASN.1-encode it.)

    opened by lynn 5
  • Add iterator-based API for interacting with DNA

    Add iterator-based API for interacting with DNA

    This provides a more polished version of the API shown in in #19, hopefully making it easier to work with DNA stored in more kinds of data structures. I've also included a benchmark for using this API to generate windows with fewer allocations, though it doesn't fully satisfy #20 because we still store AAs as &strs, preventing us from using Rust's standard windowing.

    opened by swooster 4
  • Handle all ambiguity codes + add

    Handle all ambiguity codes + add "strict" types

    Closes #18.

    Changes

    Types

    • Nucleotide no longer has N in it, and just represents one of ACTG.
    • NucleotideAmbiguous represents ACTG or one of the 11 ambiguity codes WMRYSKBVDHN.
    • There is a trait NucleotideLike for common behavior between the two, like "to ASCII" and "complement".
    • Similarly, Codon (3x Nucleotide) and CodonAmbiguous (3x NucleotideAmbiguous) are different types now.
    • The ambiguous types have possibilities() methods for iterating over the possible realizations.

    Translation

    • Ambiguity codes are properly handled instead of mapping them all to N.
    • DnaSequence is generic over the type of the contained nucleotides. So, DnaSequence::<Nucleotide>::from_str(s) is our "strict mode", and DnaSequence::<NucleotideAmbiguous>::from_str(s) is the lax mode.

    Tests

    • I added a test test_dna_parses_strict, which verifies that this "strict mode" indeed only accepts "aAtTcCgG \t".
    • I added a test test_translate_ambiguous, which verifies that TTRTTV maps to protein LX:
            // R means "A or G" and both {TTA,TTG} map to L (Leucine).
            // Thus, "TTR" should map to L.
            //
            // But V means "A or G or C", and TTC maps to F (Phenylalanine).
            // Thus, "TTV" is truly ambiguous and maps to X.
      
    opened by lynn 3
  • Update pyo3 requirement from 0.15.1 to 0.16.5

    Update pyo3 requirement from 0.15.1 to 0.16.5

    Updates the requirements on pyo3 to permit the latest version.

    Release notes

    Sourced from pyo3's releases.

    PyO3 0.16.5

    This release contains an FFI definition correction to resolve crashes for PyOxidizer on Python 3.10, and a new generate-import-lib feature to allow easier cross-compiling to Windows.

    Thank you to the following users for the improvements:

    @​cjermain @​davidhewitt @​indygreg @​messense

    Changelog

    Sourced from pyo3's changelog.

    [0.16.5] - 2022-05-15

    Added

    • Add an experimental generate-import-lib feature to support auto-generating non-abi3 python import libraries for Windows targets. #2364
    • Add FFI definition Py_ExitStatusException. #2374

    Changed

    • Deprecate experimental generate-abi3-import-lib feature in favor of the new generate-import-lib feature. #2364

    Fixed

    • Added missing warn_default_encoding field to PyConfig on 3.10+. The previously missing field could result in incorrect behavior or crashes. #2370
    • Fixed order of pathconfig_warnings and program_name fields of PyConfig on 3.10+. Previously, the order of the fields was swapped and this could lead to incorrect behavior or crashes. #2370

    [0.16.4] - 2022-04-14

    Added

    • Add PyTzInfoAccess trait for safe access to time zone information. #2263
    • Add an experimental generate-abi3-import-lib feature to auto-generate python3.dll import libraries for Windows. #2282
    • Add FFI definitions for PyDateTime_BaseTime and PyDateTime_BaseDateTime. #2294

    Changed

    • Improved performance of failing calls to FromPyObject::extract which is common when functions accept multiple distinct types. #2279
    • Default to "m" ABI tag when choosing libpython link name for CPython 3.7 on Unix. #2288
    • Allow to compile "abi3" extensions without a working build host Python interpreter. #2293

    Fixed

    • Crates depending on PyO3 can collect code coverage via LLVM instrumentation using stable Rust. #2286
    • Fix segfault when calling FFI methods PyDateTime_DATE_GET_TZINFO or PyDateTime_TIME_GET_TZINFO on datetime or time without a tzinfo. #2289
    • Fix directory names starting with the letter n breaking serialization of the interpreter configuration on Windows since PyO3 0.16.3. #2299

    [0.16.3] - 2022-04-05

    Packaging

    • Extend parking_lot dependency supported versions to include 0.12. #2239

    Added

    • Add methods to pyo3_build_config::InterpreterConfig to run Python scripts using the configured executable. #2092
    • Add as_bytes method to Py<PyBytes>. #2235
    • Add FFI definitions for PyType_FromModuleAndSpec, PyType_GetModule, PyType_GetModuleState and PyModule_AddType. #2250
    • Add pyo3_build_config::cross_compiling_from_to as a helper to detect when PyO3 is cross-compiling. #2253
    • Add #[pyclass(mapping)] option to leave sequence slots empty in container implementations. #2265
    • Add PyString::intern to enable usage of the Python's built-in string interning. #2268

    ... (truncated)

    Commits
    • 456a96d release: 0.16.5
    • da74187 Updating debugging docs with more info on rust-gdb (#2361)
    • c4414f3 Remove #[doc(hidden)] from trait impl items (#2365)
    • 11b97d3 Fix CI for hashbrown 0.12.1
    • 1149dcf Allow false positive clippy::unnecessary-wraps lint
    • f05cc91 Auto generate Windows import libraries when using a pyo3 config file
    • f60d24b Update changelog for #2364
    • 02e4f19 Change default python lib name for Windows when cross compiling
    • 7104a0c Add Windows non-abi3 cross compile test
    • 29b8731 Add support for generating non-abi3 python import libraries for Windows
    • Additional commits viewable in compare view

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    dependencies 
    opened by dependabot[bot] 3
  • Update fasta

    Update fasta

    Updates to fasta parsing

    • change DNAFastaParser to DnaFastaParser
    • change &mut R to R
    • handle BufReader internally, instead of passing a BufReader to parser
    opened by hwchen 1
  • explicitly prevent publishing in Cargo.toml

    explicitly prevent publishing in Cargo.toml

    We don't want to publish yet; say so explicitly in Cargo.toml (this will also help in case we use automated release tools like cargo-release)

    quickdna on  chore/no-publish [+1] > cargo publish --dry-run
    error: `quickdna` cannot be published.
    `package.publish` is set to `false` or an empty list in Cargo.toml and prevents publishing.
    
    opened by hwchen 0
  • Swooster/additional type inferences

    Swooster/additional type inferences

    When writing NucleotideIter, I failed to realize that a common use-case was code generic on NucleotideLike, so it's only easy-to-use with concrete NucleotideLike types. Using it with a generic NucleotideLike requires adding a bunch of cumbersome non-obvious constraints, which in turn affects callers of your code. In order to make it convenient to use NucleotideIter in generic contexts I'm adding the ability to make the following inferences:

    • Anything NucleotideLike is ToNucleotideLike (sadly, I can't also do this for &impl NucleotideLike types generically without specialization, but that's not a big deal in practice because we can work around it with .copied() iters)
    • Any codons derived from NucleotideLike types may be passed to TranslationTable::to_fn callables.
    opened by swooster 0
  • Add Python linter to CI

    Add Python linter to CI

    Add python linter (black) to CI. The python test suite is already being run as part of the maturin Rust build.

    I could not add pyright as there are a bunch of errors when invoked. It will be more work to enable pyright.

    opened by mkysel 0
  • Strict vs. ambiguous types for nucleotides

    Strict vs. ambiguous types for nucleotides

    Right now we have a single Nucleotide type that supports A/T/C/G/N (N being "any of ATCG").

    We'd like to move this type to be "strict" (only support ATCG), then have NucleotideAmbiguous which is any of the ambiguity codes (ATCGYRWSKMDVHBN).

    They'd have the same enum values so "converting" a sequence would just be a zero-copy check.

    Thought: assign ATCG the enum values {1,2,4,8}, and represent ambiguities as bitwise ORs.

    Consideration: CodonIdx becomes 12-bit instead of 9-bit; the translation tables will now each have length 4096, up from 293. (But this is surely still small enough to keep one cached.)

    opened by lynn 0
  • Allow empty spaces in DNA sequences

    Allow empty spaces in DNA sequences

    This PR allows empty spaces (which get ignored) in DNA sequences.

    Considerations:

    • I did consider putting the check in nucleotides, but those should still error out. Its only DNA sequences that can contain nucleotide sequences with spaces in them.

    Peptides accept empty spaces today and do not require a change. They also accept incorrect characters, but that is not the goal of this PR.

    opened by mkysel 0
  • chore: Release quickdna version 0.4.0

    chore: Release quickdna version 0.4.0

    Looked into also adding automatic changelog generation, but we need to format our commits so they can be parsed.

    cargo-release also did more in the past (it created a dev-version after the released version: v0.5.0-alpha after v0.4.0), but looks like now it's just the version bump + git tag.

    opened by hwchen 1
  • Remove FastaParseSettings::strict() and ::lax()

    Remove FastaParseSettings::strict() and ::lax()

    …so that we can unambiguously use "strict" and "lax" to instead refer to whether we allow N nucleotides or not.

    Instead, there is now a ::default() which disallows preceding file comments (i.e. treats them as sequences), but does concatenate headers. This is the parsing behavior we describe in our internal wiki. Any override of these defaults should be done explicitly when constructing the parser.

    opened by lynn 0
  • Make an AminoAcid type

    Make an AminoAcid type

    Right now a ProteinSequence is just an unchecked ASCII sequence.

    We should make an AminoAcid type like Nucleotide that represents only the valid amino acids ABCDEFGHIKLMNPQRSTVWYZ, and then AminoAcidAmbiguous which includes ambiguity code X. See #18.

    Thought: if we assign these by ASCII codes like enum AminoAcid { A = 65, B = 66, ... } then we can cast &[u8] to &[AminoAcid] after validation. But maybe that's a bit overzealous if we normally have to strip whitespace and uppercase the string anyway.

    Consideration: What about . (deletion) and * (terminator)? Should we treat those as ambiguity codes or as something else?

    opened by lynn 0
  • Zero-copy windowing

    Zero-copy windowing

    Right now one of the biggest sources of allocations in synth_client is that DnaSequence allocates a new Vec sequence for every window. If we could just return a reference to the owned sequence that would be ideal. See #19.

    opened by lynn 0
  • Allow non-owned Sequence types

    Allow non-owned Sequence types

    Currently DnaSequence owns a Vec<Nucleotide>. But we'd also like to be able to have a DnaSequence backed by a borrowed &[Nucleotide]. Same for ProteinSequence.

    Thought from @vgel: use generics, so DnaSequence<Vec<Nucleotide>> vs DnaSequence<&[Nucleotide]>, and then maybe nice aliases for these. Maybe even make DnaSequence<[Nucleotide; N]> work. This is what SmallVec does.

    Alternatively, follow the Rust tradition of two typenames for borrowed vs. owned, like str vs String and Path vs PathBuf.

    opened by lynn 8
  • Lockfile for python bindings

    Lockfile for python bindings

    As Martin pointed out in #14 , we should have dependabot and a lockfile for the Python bindings part of the crate, since that produces a built artifact (quickdna.so). But since we don't want that for the Rust part which is meant to be consumed as a regular Rust library (https://doc.rust-lang.org/cargo/guide/cargo-toml-vs-cargo-lock.html), we probably want two separate crates that aren't in a workspace so that only one has a lockfile, or something like that.

    opened by vgel 0
Owner
Secure DNA
Secure DNA
Provide types for angle manipulation in rust.

angulus Provides types for angle manipulation. Features serde : Serialization/deserialization support via serde. Example use angulus::{*, units::*};

Tristan Guichaoua 2 Sep 2, 2022
Sudoku Solver using bitmasks and bit-manipulation with Rust 🦀 and egui 🎨

sudoku-solver Download This Rust application implements a very memory efficent algorithm to solve sudoku and lets the user know when a unique solution

cameron 24 Apr 10, 2023
PNG manipulation library.

pngmanip A simple rust library for parsing and manipulating PNG images, primarily at the chunk level. The intended use case was for solving PNG based

Sam Leonard 1 Jan 7, 2022
This CLI utility facilitates effortless manipulation and exploration of TOML, YAML, JSON and RON files.

???????? This CLI utility facilitates effortless manipulation and exploration of TOML, YAML, JSON and RON files.

Moe 3 Apr 26, 2023
A library that allows for the arbitrary inspection and manipulation of the memory and code of a process on a Linux system.

raminspect raminspect is a crate that allows for the inspection and manipulation of the memory and code of a running process on a Linux system. It pro

Liam Germain 24 Sep 26, 2023
Rust Imaging Library's Python binding: A performant and high-level image processing library for Python written in Rust

ril-py Rust Imaging Library for Python: Python bindings for ril, a performant and high-level image processing library written in Rust. What's this? Th

Cryptex 13 Dec 6, 2022
PyO3 bindings and Python interface to skani, a method for fast fast genomic identity calculation using sparse chaining.

?? ⛓️ ?? Pyskani PyO3 bindings and Python interface to skani, a method for fast fast genomic identity calculation using sparse chaining. ??️ Overview

Martin Larralde 13 Mar 21, 2023
An extremely fast Python linter, written in Rust.

Ruff An extremely fast Python linter, written in Rust. Linting the CPython codebase from scratch. ⚡️ 10-100x faster than existing linters ?? Installab

Charlie Marsh 5.1k Dec 30, 2022
⚡ Blazing fast async/await HTTP client for Python written on Rust using reqwests

Reqsnaked Reqsnaked is a blazing fast async/await HTTP client for Python written on Rust using reqwests. Works 15% faster than aiohttp on average RAII

Yan Kurbatov 8 Mar 2, 2023
A fast python geohash library created by wrapping rust.

Pygeohash-Fast A Fast geohasher for python. Created by wrapping the rust geohash crate with pyo3. Huge shout out to the georust community :) Currently

Zach Paden 3 Aug 18, 2022
A fast, simple and lightweight Bloom filter library for Python, fully implemented in Rust.

rBloom A fast, simple and lightweight Bloom filter library for Python, fully implemented in Rust. It's designed to be as pythonic as possible, mimicki

Kenan Hanke 91 Feb 4, 2023
cpa is a cli tool for ultra fast setup of Rust & Python projects

CPA: Create-Python-App cpa is a cli tool for ultra fast setup of new Python & Rust projects. It automates the creation of config files like style & li

Yuki Sawa 56 Dec 3, 2023
🐎 A fast implementation of the Aho-Corasick algorithm using the compact double-array data structure. (Python wrapper for daachorse)

python-daachorse daachorse is a fast implementation of the Aho-Corasick algorithm using the compact double-array data structure. This is a Python wrap

Koichi Akabe 11 Nov 30, 2022
A fast static code analyzer & language server for Python

pylyzer ⚡ pylyzer is a static code analyzer / language server for Python written in Rust. Installation cargo (rust package manager) cargo install pyly

Shunsuke Shibayama 78 Jan 3, 2023
pyrevm Blazing-fast Python bindings to revm

pyrevm Blazing-fast Python bindings to revm Quickstart make install make test Example Usage Here we show how you can fork from Ethereum mainnet and s

Georgios Konstantopoulos 97 Apr 14, 2023
🚀 Blazing fast and Powerful Discord Token Grabber, no popo made with python

Rusty-Grabber ?? a blazing fast Discord Token Grabber, no popo made with python Fastest Token Grabber ever : Rusty-Grabber> time ./target/release/grab

bishop 5 Sep 1, 2023
This is a simple command line application to convert bibtex to json written in Rust and Python

bibtex-to-json This is a simple command line application to convert bibtex to json written in Rust and Python. Why? To enable you to convert very big

null 3 Mar 23, 2022
Fuzzy Index for Python, written in Rust. Works like error-tolerant dict, keyed by a human input.

FuzzDex FuzzDex is a fast Python library, written in Rust. It implements an in-memory fuzzy index that works like an error-tolerant dictionary keyed b

Tomasz bla Fortuna 8 Dec 15, 2022
📦 A Python package manager written in Rust inspired by Cargo.

huak About A Python package manager written in Rust. The Cargo for Python. ⚠️ Disclaimer: huak is currently in its proof-of-concept (PoC) phase. Huak

Chris Pryer 186 Jan 9, 2023