A bit-packed k-mer representation (and relevant utilities) for rust

Overview

K-mer class for rust

The purpose of this repository is to build a simple library that exposes a bit-packed k-mer class for use in rust-based bioinformatics projects. The class implementation is generic over the k-mer length by making use of const generics (which are currently a relatively new feature in rust).

Contributors

This is meant to be a community project, where I originally attempted to reach out to some potentially interested parties via twitter. If you'd like to help, please feel welcome to contribute!

Minimum supported Rust version

Currently the minimum supported Rust version is 1.51.0

Comments
  • Let user choose padding

    Let user choose padding

    Hi,

    Actually library, store kmer in an array of u64 it could be nice to let user choose the type of array. I work with very short kmer, use and u64 to store it, would have a significant cost.

    opened by natir 4
  • Encoding

    Encoding

    In this pull request I introduce a trait Encoder with 3 method encoding decoding and rev_comp, Kmer can use this Encoder method to perform encoding, decoding and reverse complement operation on array.

    I also introduce 2 encoder, the first one Naive support any encoding type, the counter part is /slower/ operation. Xor10 operation is faster but support only one encoding (A -> 00, C -> 01, T -> 10, G -> 11).

    We can add many other encoder, an encoder use smid instruction to convert 32 nucleotide in u64, @Daniel-Liu-c0deb0t create some algorithm to do this here nuc2bits. Usage of interface allow user to create Encoder fit perfectly to his needs.

    In this pull request Kmer constructor accept u8 slice and an object implement Encoder, we can move this Encoder as Kmer type argument, but the type declaration became a little nightmare:

    let kmer: Kmer<15, u16,  { word_for_k::<u16, 15>() }, Encoding::Naive::ACTG> = Kmer::new("ACTGAGAGAGACCAT");
    

    This type complexity could by simplify by type aliasing and also when Rust get a more complete constant generic interface.

    To simplify code base I also create trait Data to group all trait must be implemented by type use in array and add a method to_u8 required by Naive encoder, this trait and this method could be move or remove.

    My implementation could probably be improve in many point, but I think general structure are already nice, intresting and ready to be discuss.

    This pull request also containts PR #8

    opened by natir 2
  • Let user choose padding

    Let user choose padding

    It's a try to solve #7.

    I just had a P generic parameter to let user choose which type is use to store kmer.

    I also add dependencies to bit_field crate, this crate provide a trait to manipulate bit in object, this trait is implemented for u8, u16, u32, u64, u128, usize, i8, i16, i32, i64, i128 and isize. I think this could be very usefull for us.

    A nice side effect of use bit_field crate is P must implement bit_field::BitField so it's reduce number user choice of number.

    I also replace macro word_for_k by a const function just for esthetic reason.

    opened by natir 1
  • Small improvement of benchmark

    Small improvement of benchmark

    Run benchmark on sequence length between 2^i with i is between 8 and 16 to check behaviour of benchmark on different length (check linearity).

    Add reverse complement benchmark.

    Maybe benchmark code is too rustic, I can rewrite it in more readable way.

    opened by natir 0
  • add CI

    add CI

    Mostly copied from the niffler configs.

    There are a lot of jobs defined, but the main ones are

    • tests, for running tests on stable, beta, windows (stable) and mac (stable)
    • cross_testing, to make sure it works on some non-x86 targets (s390x is especially curious, since it is big endian)
    • coverage using tarpaulin (and running on nightly)
    • publish to make sure it is publishable on crates.io
    • MSRV check (targeting 1.51 for now)
    • wasm to check if it builds for webassembly
    • test_all_feature_combinations checks that all feature combinations work together, since they should be additive (more info)
    opened by luizirber 0
  • Internal storage order

    Internal storage order

    So after trying to write some code interfacing kmers with some k-mers parsed using needletail, I realized we are storing the k-mers in the opposite order. In kmers the leftmost nucleotide is stored in the highest order bit, in needletail and some of the other (C++) libraries I have used in the past it's the opposite.

    Are there any strong arguments for one of these schemes over the other (cc @natir, @Daniel-Liu-c0deb0t, @luizirber)? Is this something we should also consider making part of the policy of how the k-mers are constructed?

    opened by rob-p 10
  • Roadmap / TODO

    Roadmap / TODO

    This issue will provide a roadmap for the library, along with specific tasks (TODOs). Ideally we should break these tasks into short and long term tasks and, as the library becomes more mature, tie individual tasks to specific release candidates.

    documentation enhancement 
    opened by rob-p 3
  • Dealing with the (temporary) lack of constant computations in const generics

    Dealing with the (temporary) lack of constant computations in const generics

    So, it turns out that currently (as of rust v1.51) with MVP const generics, we are not allowed to do simple computations with generic parameters. For example, it would be desirable to have something like this to automatically compute the size of the storage array we want to use based on the value of K provided to the class. However, such a capability is gated behind a feature flag on the nightly branch of rustc. We should determine how we want to handle this from a design perspective. I see a few options:

    1. Go with > 1 const generic parameter, to avoid having to do const-generic arithmetic on the receiving end. Once that is available in stable, we can of course simplify the interface. This is a little bit onerous, since now we have to e.g. provide a helper (const) function or macro or some such so that the user doesn't have to think about the number of words that should be used for storage, which is, anyway, error prone.
    2. Throw caution to the wind and require nightly rustc with the feature gate to allow ourselves to do the arithmetic we want with const generic parameters.
    3. Some other and much more clever solution I've not yet considered.

    I'd appreciate others' thoughts on this.

    opened by rob-p 3
  • library name and logo?

    library name and logo?

    As a fundamental start point for any good software project, we need to decide on the name and generate a logo. I kept the repo name simple, but I was imagining we'd write it out as something like kme-rs or kme.rs or something cute like that ;P. I just wanted to open discussion on this very important topic here to invite opinions and thoughts.

    opened by rob-p 6
  • Encoding Type

    Encoding Type

    I'm assuming we'd be using 2 bits per bp. There's a couple of common encodings, but I want to suggest

    A -> 00
    C -> 01
    T -> 10
    G -> 11
    

    There's a couple of benefits:

    1. These are the 2nd and 3rd bits of the ASCII encoding of the corresponding base pairs. Conversion from byte strings would be easy.
    2. Complement by using XOR ...0101010

    On a slightly unrelated note, I've worked on some sequence manipulation stuff that use SIMD (eg., here, here for a library that was abandoned). Many of these ideas could be applicable here as well. I'm assuming that we want scalar ops only here because SIMD registers are probably too wide (128 or 256 bits) for handling kmers that are relatively short.

    opened by Daniel-Liu-c0deb0t 7
Owner
COMBINE lab
COMputational BIology and Network Evolution lab
COMBINE lab
Bril: A Compiler Intermediate Representation for Learning

Bril: A Compiler Intermediate Representation for Learning Bril (the Big Red Intermediate Language) is a compiler IR made for teaching CS 6120, a grad

Lesley Lai 0 Dec 5, 2022
CBOR: Concise Binary Object Representation

CBOR 0x(4+4)9 0x49 โ€œThe Concise Binary Object Representation (CBOR) is a data format whose design goals include the possibility of extremely small cod

quininer 37 Dec 27, 2022
This crate allows to generate a flat binary with the memory representation of an ELF.

flatelf Library This crate allows to generate a flat binary with the memory representation of an ELF. It also allows to generate a FLATELF with the fo

Roi Martin 3 Sep 29, 2022
Simple bit-level protocol definitions in Rust.

bin-proto Simple & fast structured bit-level binary co/dec in Rust. An improved and modernized fork of protocol. A more efficient but (slightly) less

null 16 Jun 13, 2024
Rust library of custom number malarkey, including variable-bit-width integers

Numberwang The Numberwang crate is a library of custom number types and functionality, including variable-bit-width integers. It is named after the fi

Dan Williams 3 Nov 12, 2024
๐Ÿ“˜ Utilities for the Fibonacci Number and Sequence

Fibora Port of fibonacci-deno for Rust. Utilities for the Fibonacci Number and Sequence. Usage This package exposes two Functions, fibonacci and fibon

Eliaz Bobadilla 5 Apr 6, 2022
Utilities for converting Vega-Lite specs from the command line and Python

VlConvert VlConvert provides a Rust library, CLI utility, and Python library for converting Vega-Lite chart specifications into static images (SVG or

Vega 24 Feb 13, 2023
๐Ÿšƒ lib for CLI utilities, printing, and error handling

axocli Common code for setting up a CLI App and handling errors/printing. Example See examples/axoapp.rs for a walkthrough/example. Some various inter

axo 5 Apr 4, 2023
Isn't it time to be a bit nicer to rustc?

politeness-macro Aren't we all too rude to computers? Isn't it time to bring a bit more politeness into our programming? Shouldn't we be a bit nicer t

Rin 6 Mar 11, 2022
Bitpack a boolean into a pointer using bit magic.

ptr-bool tl;dr: a pointer and boolean with the same size as a pointer. A convenience crate used to bitpack a boolean and pointer into the same eight b

Zack 2 Oct 24, 2022
fanum tax 64-bit integers with LEB128

rizz64 Fanum* tax 64-bit integers. * Fanum is a popular streamer who taxes his friends by taking bites of their food. This crate provides an efficient

Matthew Kim 11 May 22, 2024
In this repository you can find modules with code and comments that explain rust syntax and all about Rust lang.

Learn Rust What is this? In this repository you can find modules with code and comments that explain rust syntax and all about Rust lang. This is usef

Domagoj Ratko 5 Nov 5, 2022
A comprehensive and FREE Online Rust hacking tutorial utilizing the x64, ARM64 and ARM32 architectures going step-by-step into the world of reverse engineering Rust from scratch.

FREE Reverse Engineering Self-Study Course HERE Hacking Rust A comprehensive and FREE Online Rust hacking tutorial utilizing the x64, ARM64 and ARM32

Kevin Thomas 98 Jun 21, 2023
An API for getting questions from http://either.io implemented fully in Rust, using reqwest and some regex magic. Provides asynchronous and blocking clients respectively.

eithers_rust An API for getting questions from http://either.io implemented fully in Rust, using reqwest and some regex magic. Provides asynchronous a

null 2 Oct 24, 2021
Fast and simple datetime, date, time and duration parsing for rust.

speedate Fast and simple datetime, date, time and duration parsing for rust. speedate is a laxโ€  RFC 3339 date and time parser, in other words, it pars

Samuel Colvin 43 Nov 25, 2022
A simpler and 5x faster alternative to HashMap in Rust, which doesn't use hashing and doesn't use heap

At least 5x faster alternative of HashMap, for very small maps. It is also faster than FxHashMap, hashbrown, ArrayMap, and nohash-hasher. The smaller

Yegor Bugayenko 12 Apr 19, 2023
Safe, efficient, and ergonomic bindings to Wolfram LibraryLink and the Wolfram Language

wolfram-library-link Bindings to the Wolfram LibraryLink interface, making it possible to call Rust code from the Wolfram Language. This library is us

Wolfram Research, Inc. 28 Dec 6, 2022
This blog provides detailed status updates and useful information about Theseus OS and its development

The Theseus OS Blog This blog provides detailed status updates and useful information about Theseus OS and its development. Attribution This blog was

Theseus OS 1 Apr 14, 2022