A bit-packed k-mer representation (and relevant utilities) for rust

COMBINE lab

Last update: Dec 15, 2022

Related tags

Learning Resources kmers

Overview

K-mer class for rust

The purpose of this repository is to build a simple library that exposes a bit-packed k-mer class for use in rust-based bioinformatics projects. The class implementation is generic over the k-mer length by making use of const generics (which are currently a relatively new feature in rust).

Contributors

This is meant to be a community project, where I originally attempted to reach out to some potentially interested parties via twitter. If you'd like to help, please feel welcome to contribute!

Minimum supported Rust version

Currently the minimum supported Rust version is 1.51.0

Comments

Let user choose padding

Hi,

Actually library, store kmer in an array of u64 it could be nice to let user choose the type of array. I work with very short kmer, use and u64 to store it, would have a significant cost.

opened by natir 4
Encoding
In this pull request I introduce a trait Encoder with 3 method encoding decoding and rev_comp, Kmer can use this Encoder method to perform encoding, decoding and reverse complement operation on array.

I also introduce 2 encoder, the first one Naive support any encoding type, the counter part is /slower/ operation. Xor10 operation is faster but support only one encoding (A -> 00, C -> 01, T -> 10, G -> 11).

We can add many other encoder, an encoder use smid instruction to convert 32 nucleotide in u64, @Daniel-Liu-c0deb0t create some algorithm to do this here nuc2bits. Usage of interface allow user to create Encoder fit perfectly to his needs.

In this pull request Kmer constructor accept u8 slice and an object implement Encoder, we can move this Encoder as Kmer type argument, but the type declaration became a little nightmare:

let kmer: Kmer<15, u16, { word_for_k::<u16, 15>() }, Encoding::Naive::ACTG> = Kmer::new("ACTGAGAGAGACCAT");

This type complexity could by simplify by type aliasing and also when Rust get a more complete constant generic interface.

To simplify code base I also create trait Data to group all trait must be implemented by type use in array and add a method to_u8 required by Naive encoder, this trait and this method could be move or remove.

My implementation could probably be improve in many point, but I think general structure are already nice, intresting and ready to be discuss.

This pull request also containts PR #8
opened by natir 2
Let user choose padding

It's a try to solve #7.

I just had a P generic parameter to let user choose which type is use to store kmer.

I also add dependencies to bit_field crate, this crate provide a trait to manipulate bit in object, this trait is implemented for u8, u16, u32, u64, u128, usize, i8, i16, i32, i64, i128 and isize. I think this could be very usefull for us.

A nice side effect of use bit_field crate is P must implement bit_field::BitField so it's reduce number user choice of number.

I also replace macro word_for_k by a const function just for esthetic reason.

opened by natir 1
Small improvement of benchmark

Run benchmark on sequence length between 2^i with i is between 8 and 16 to check behaviour of benchmark on different length (check linearity).

Add reverse complement benchmark.

Maybe benchmark code is too rustic, I can rewrite it in more readable way.

opened by natir 0
add CI
Mostly copied from the niffler configs.

There are a lot of jobs defined, but the main ones are

tests, for running tests on stable, beta, windows (stable) and mac (stable)

cross_testing, to make sure it works on some non-x86 targets (s390x is especially curious, since it is big endian)

coverage using tarpaulin (and running on nightly)

publish to make sure it is publishable on crates.io

MSRV check (targeting 1.51 for now)

wasm to check if it builds for webassembly

test_all_feature_combinations checks that all feature combinations work together, since they should be additive (more info)
opened by luizirber 0
Internal storage order

So after trying to write some code interfacing kmers with some k-mers parsed using needletail, I realized we are storing the k-mers in the opposite order. In kmers the leftmost nucleotide is stored in the highest order bit, in needletail and some of the other (C++) libraries I have used in the past it's the opposite.

Are there any strong arguments for one of these schemes over the other (cc @natir, @Daniel-Liu-c0deb0t, @luizirber)? Is this something we should also consider making part of the policy of how the k-mers are constructed?

opened by rob-p 10
Roadmap / TODO

This issue will provide a roadmap for the library, along with specific tasks (TODOs). Ideally we should break these tasks into short and long term tasks and, as the library becomes more mature, tie individual tasks to specific release candidates.
documentation enhancement

opened by rob-p 3
Dealing with the (temporary) lack of constant computations in const generics
So, it turns out that currently (as of rust v1.51) with MVP const generics, we are not allowed to do simple computations with generic parameters. For example, it would be desirable to have something like this to automatically compute the size of the storage array we want to use based on the value of K provided to the class. However, such a capability is gated behind a feature flag on the nightly branch of rustc. We should determine how we want to handle this from a design perspective. I see a few options:

Go with > 1 const generic parameter, to avoid having to do const-generic arithmetic on the receiving end. Once that is available in stable, we can of course simplify the interface. This is a little bit onerous, since now we have to e.g. provide a helper (const) function or macro or some such so that the user doesn't have to think about the number of words that should be used for storage, which is, anyway, error prone.

Throw caution to the wind and require nightly rustc with the feature gate to allow ourselves to do the arithmetic we want with const generic parameters.

Some other and much more clever solution I've not yet considered.

I'd appreciate others' thoughts on this.
opened by rob-p 3
library name and logo?

As a fundamental start point for any good software project, we need to decide on the name and generate a logo. I kept the repo name simple, but I was imagining we'd write it out as something like kme-rs or kme.rs or something cute like that ;P. I just wanted to open discussion on this very important topic here to invite opinions and thoughts.

opened by rob-p 6
Encoding Type
I'm assuming we'd be using 2 bits per bp. There's a couple of common encodings, but I want to suggest

A -> 00 C -> 01 T -> 10 G -> 11

There's a couple of benefits:

These are the 2nd and 3rd bits of the ASCII encoding of the corresponding base pairs. Conversion from byte strings would be easy.

Complement by using XOR ...0101010

On a slightly unrelated note, I've worked on some sequence manipulation stuff that use SIMD (eg., here, here for a library that was abandoned). Many of these ideas could be applicable here as well. I'm assuming that we want scalar ops only here because SIMD registers are probably too wide (128 or 256 bits) for handling kmers that are relatively short.
opened by Daniel-Liu-c0deb0t 7

Owner

COMBINE lab

COMputational BIology and Network Evolution lab

GitHub

Bril: A Compiler Intermediate Representation for Learning

Bril: A Compiler Intermediate Representation for Learning Bril (the Big Red Intermediate Language) is a compiler IR made for teaching CS 6120, a grad

0 Dec 5, 2022

CBOR: Concise Binary Object Representation

CBOR 0x(4+4)9 0x49 “The Concise Binary Object Representation (CBOR) is a data format whose design goals include the possibility of extremely small cod

37 Dec 27, 2022

This crate allows to generate a flat binary with the memory representation of an ELF.

flatelf Library This crate allows to generate a flat binary with the memory representation of an ELF. It also allows to generate a FLATELF with the fo

3 Sep 29, 2022

Simple bit-level protocol definitions in Rust.

bin-proto Simple & fast structured bit-level binary co/dec in Rust. An improved and modernized fork of protocol. A more efficient but (slightly) less

16 Jun 13, 2024

Rust library of custom number malarkey, including variable-bit-width integers

Numberwang The Numberwang crate is a library of custom number types and functionality, including variable-bit-width integers. It is named after the fi

3 Nov 12, 2024

📘 Utilities for the Fibonacci Number and Sequence

Fibora Port of fibonacci-deno for Rust. Utilities for the Fibonacci Number and Sequence. Usage This package exposes two Functions, fibonacci and fibon

5 Apr 6, 2022

Utilities for converting Vega-Lite specs from the command line and Python

VlConvert VlConvert provides a Rust library, CLI utility, and Python library for converting Vega-Lite chart specifications into static images (SVG or

24 Feb 13, 2023

🚃 lib for CLI utilities, printing, and error handling

axocli Common code for setting up a CLI App and handling errors/printing. Example See examples/axoapp.rs for a walkthrough/example. Some various inter

5 Apr 4, 2023

Isn't it time to be a bit nicer to rustc?

politeness-macro Aren't we all too rude to computers? Isn't it time to bring a bit more politeness into our programming? Shouldn't we be a bit nicer t

6 Mar 11, 2022

Bitpack a boolean into a pointer using bit magic.

ptr-bool tl;dr: a pointer and boolean with the same size as a pointer. A convenience crate used to bitpack a boolean and pointer into the same eight b

2 Oct 24, 2022

fanum tax 64-bit integers with LEB128

rizz64 Fanum* tax 64-bit integers. * Fanum is a popular streamer who taxes his friends by taking bites of their food. This crate provides an efficient

11 May 22, 2024

Intro: we are creating a software system for a pizza restaurant, one of the modules is supposed to handle the management of various pizza recipes and how the orders are put together, and a big part of the module will be the control of food types, the potential allergens in recipes, and calories counting.

rust_pizzeria Intro: we are creating a software system for a pizza restaurant, one of the modules is supposed to handle the management of various pizz

1 Oct 26, 2021

In this repository you can find modules with code and comments that explain rust syntax and all about Rust lang.

Learn Rust What is this? In this repository you can find modules with code and comments that explain rust syntax and all about Rust lang. This is usef

5 Nov 5, 2022

A comprehensive and FREE Online Rust hacking tutorial utilizing the x64, ARM64 and ARM32 architectures going step-by-step into the world of reverse engineering Rust from scratch.

FREE Reverse Engineering Self-Study Course HERE Hacking Rust A comprehensive and FREE Online Rust hacking tutorial utilizing the x64, ARM64 and ARM32

98 Jun 21, 2023

An API for getting questions from http://either.io implemented fully in Rust, using reqwest and some regex magic. Provides asynchronous and blocking clients respectively.

eithers_rust An API for getting questions from http://either.io implemented fully in Rust, using reqwest and some regex magic. Provides asynchronous a

2 Oct 24, 2021

A bit-packed k-mer representation (and relevant utilities) for rust

Related tags

Overview

K-mer class for rust

Contributors

Minimum supported Rust version

Comments

Let user choose padding

Encoding

Let user choose padding

Small improvement of benchmark

add CI

Internal storage order

Roadmap / TODO

Dealing with the (temporary) lack of constant computations in const generics

library name and logo?

Encoding Type

Owner

COMBINE lab

Bril: A Compiler Intermediate Representation for Learning

CBOR: Concise Binary Object Representation

This crate allows to generate a flat binary with the memory representation of an ELF.

Simple bit-level protocol definitions in Rust.

Rust library of custom number malarkey, including variable-bit-width integers

📘 Utilities for the Fibonacci Number and Sequence

Utilities for converting Vega-Lite specs from the command line and Python

🚃 lib for CLI utilities, printing, and error handling

Isn't it time to be a bit nicer to rustc?

Bitpack a boolean into a pointer using bit magic.

fanum tax 64-bit integers with LEB128

In this repository you can find modules with code and comments that explain rust syntax and all about Rust lang.

A comprehensive and FREE Online Rust hacking tutorial utilizing the x64, ARM64 and ARM32 architectures going step-by-step into the world of reverse engineering Rust from scratch.

An API for getting questions from http://either.io implemented fully in Rust, using reqwest and some regex magic. Provides asynchronous and blocking clients respectively.

Fast and simple datetime, date, time and duration parsing for rust.

A simpler and 5x faster alternative to HashMap in Rust, which doesn't use hashing and doesn't use heap

Safe, efficient, and ergonomic bindings to Wolfram LibraryLink and the Wolfram Language

This blog provides detailed status updates and useful information about Theseus OS and its development