A Rust library for building modular, fast and compact indexes over genomic data

Related tags

Command-line mazu
Overview

mazu

A Rust library for building modular, fast and compact indexes over genomic data

Mazu (媽祖)... revered as a tutelary deity of seafarers, including fishermen and sailors...

Disclaimer --- This library is in alpha and is under active development.

Highlights

  1. Query ready indexes via plug-and-play k-mer-to-unitig and unitig-to-occurrence mappings.
  2. Load (only) compatibility with pufferfish, deserialize pufferfish indices and work with them in Rust.
  3. Streaming queries for generic indexes for free with .as_streaming()
  4. An easy test-bed for new compression algorithms for unitig-occurrences and k-mer dictionaries.
  5. No more CMake.

Examples

// Load a pufferfish index from C++ implementation
let p = to_abs_path(YEAST_CHR01_INDEX);
let pi = DenseIndex::deserialize_from_cpp(p).unwrap();
// Extract unitigs and build a SSHash
let unitig_set = pi.as_ref().clone();
let sshash = SSHash::from_unitig_set(unitig_set, 15, 32, WyHashState::default()).unwrap();

// Drop in an SSHash for a new index
let pi = ModIndex::from_parts(
    pi.base.clone(),
    sshash,
    pi.as_u2pos().clone(),
    pi.as_refseqs().clone(),
);

// Generic implementations take care of query and validation
pi.validate_self();

// Attach a streaming cache and drive the index.
let driver = pi.as_streaming();
driver.validate_self();
You might also like...
Sample and plot power consumption, average frequency and cpu die temperatures over time.
Sample and plot power consumption, average frequency and cpu die temperatures over time.

sense Sense is a small tool to gather data on cpu temperature, power usage and clock frequency and plot graphs during some load. Dependencies Sense is

Over-simplified, featherweight, open-source and easy-to-use authentication and authorization server.

concess ⚠️ Early Development: This is not production ready, yet. Do not use it for anything important. Introduction concess is a over-simplified, feat

A library for building declarative text-based user interfaces
A library for building declarative text-based user interfaces

Intuitive docs.rs Documentation Intuitive is a component-based library for creating text-based user interfaces (TUIs) easily. It is heavily inspired b

A safe and idiomatic wrapper over shared memory APIs in rust with proper cleanups.

shmem-bind A safe and idiomatic wrapper over shared memory APIs in rust with proper cleanups. Quick start: check the message-passing example for bette

 miniserve - a CLI tool to serve files and dirs over HTTP
miniserve - a CLI tool to serve files and dirs over HTTP

🌟 For when you really just want to serve some files over HTTP right now!

Unopinionated low level API bindings focused on soundness, safety, and stronger types over raw FFI.

🔥 firehazard 🔥 Create a fire hazard by locking down your (Microsoft) Windows so nobody can escape (your security sandbox.) Unopinionated low level A

Rust API Server: A versatile template for building RESTful interfaces, designed for simplicity in setup and configuration using the Rust programming language.
Rust API Server: A versatile template for building RESTful interfaces, designed for simplicity in setup and configuration using the Rust programming language.

RUST API SERVER Introduction Welcome to the Rust API Server! This server provides a simple REST interface for your applications. This README will guid

A fast and robust MLOps tool for managing data and pipelines

xvc A Fast and Robust MLOps Swiss-Army Knife in Rust ⌛ When to use xvc? Machine Learning Engineers: When you manage large quantities of unstructured d

A simple lexer which creates over 75 various tokens based on the rust programming language.

Documentation. This complete Lexer/Lexical Scanner produces tokens for a string or a file path entry. The output is a Vector for the user to handle ac

Comments
  • Design the degree to which `mazu` should handle versioning and provenance information

    Design the degree to which `mazu` should handle versioning and provenance information

    This is a lib and not a CLI, and can support basic known instantiations of index types. e.g. pufferfish + piscem. ModIndex should report / store only basic information, and have the info be split corresponding to k2u and u2pos data structures. CLIs that use mazu and need fine grain control over metadata should use the newtype idiom i.e. type MyIndex(ModIndex<...>) and handle metadata / provenance info as appropriate for MyIndex.

    opened by theJasonFan 1
  • Implement Fulgor

    Implement Fulgor

    Use similar abstraction as ModIndex and write a struct like:

    ModCCIndex<H: K2U, T> {
        k2u: H,
        colors: T,
        u2cc: BitVector,
    }
    

    or

    ModCCIndex<H: K2U, T> {
        k2u: H,
        u2color: T, // containing bit-vector
    }
    

    And make it easy to implement / try different color compression schemes.

    opened by theJasonFan 0
  • Close gap to SSHash implementation

    Close gap to SSHash implementation

    Current implementation has a pufferfish style hash for k-mers with minimizers that occur more than some threshold number of times. Which is different from the actual SSHash skew index that is more optimized.

    • [ ] Point hits to skew index to super-k-mer (minimizer) IDs instead of positions on global unitig sequence. This would use fewer bits per k-mer in skew index. Current impl points k-mers directly to a position on useq.
    • [ ] Use a level-by-level bucketing scheme. Current implementation uses one single bucket for all skew k-mers.
    • [ ] #3
    opened by theJasonFan 0
  • To-dos

    To-dos

    To-dos:

    Must

    • [ ] Basic usage / lib documentation (use rustdoc)
    • [ ] Stabilize upstream dependencies, i.e., merge thejasonfan form of kme.rs back to combine-lab
    • [ ] fix or remove num_bits (size on heap) measurements
    • [ ] Note new canonical minimimizer definition

    Nice-to-haves

    • [ ] pufferfish2 sparse unitig table sampling
    • [ ] RefSeqCollection with PolyNs. We can build from FASTA files but we can't 2-bit encode and represent RefSeqs with PolyN stretches right now.
    • [ ] Officially supported modular serialization so constituent index components are easy to measure. But for now, we can let CLIs that use mazu handle provenance and serialization formats.

    New algorithms / features

    • [ ] Standardized index info as JSON/YAML for supported "default / official" indexes.
    • [ ] Generic (or more boilerplate) for K2U implementations to cache and to skip rank/select queries (to retrieve unitig length and start position) for consecutive k-mers mapping to the same unitig.

    Extras:

    • [ ] Pufferfish (C++) to mazu conversion for Sparse and Dense indexes.
    • [ ] #2
    • [ ] Basic benchmarks for some atomic data structures EF, rank/sel, WM etc.
    • [ ] #5

    Benchmarking / Questions:

    • [ ] Compare mazu PolyN handling vs. pufferfish's pseudorandom insertion for Ns
    • [ ] Note and benchmark differences to SSHash (canonical minimizer parsing + a more naive skew index)
    opened by theJasonFan 0
Owner
COMBINE lab
COMputational BIology and Network Evolution lab
COMBINE lab
🐎 A fast implementation of the Aho-Corasick algorithm using the compact double-array data structure. (Python wrapper for daachorse)

python-daachorse daachorse is a fast implementation of the Aho-Corasick algorithm using the compact double-array data structure. This is a Python wrap

Koichi Akabe 11 Nov 30, 2022
A todo list app that indexes your app to find TODO:'s

forgot A todo list app that indexes your app to find TODO:'s Usage to list all your todos forgot list list all your todos ignoring search in ./target,

null 2 Oct 6, 2022
Nodium is an easy-to-use data analysis and automation platform built using Rust, designed to be versatile and modular.

Nodium is an easy-to-use data analysis and automation platform built using Rust, designed to be versatile and modular. Nodium aims to provide a user-friendly visual node-based interface for various tasks.

roggen 19 May 2, 2023
🎙 A compact library for working with user output

?? Storyteller A library for working with user output Table of contents ?? Introduction ?? Visualized introduction ?? Example source code ❓ Origins ??

Martijn Gribnau 30 Dec 7, 2022
A more compact and intuitive ASCII table in your terminal: an alternative to "man 7 ascii" and "ascii"

asciit A more compact and intuitive ASCII table in your terminal: an alternative to man 7 ascii and ascii. Colored numbers and letters are much more e

Qichen Liu 刘启辰 5 Nov 16, 2023
A compact implementation of connect four written in rust.

connect-four A compact implementation of connect four written in rust. Run the game At the moment there no pre-built binaries - but you can build it l

Maximilian Schulke 12 Jul 31, 2022
Social media style compact number formatting for rust.

prettty-num Format integers into a compact social media style format, similar to using Intl.NumberFormat("en", { notation: "compact" }); as a number f

null 5 Aug 17, 2024
A fully modular window manager, extremely extensibile and easily approachable.

AquariWM is a fully modular window manager, allowing extreme extensibility while remaining easily approachable. Installation AquariWM is currently in

AquariWM Window Manager 8 Nov 14, 2022
A Rust library for building interactive prompts

inquire is a library for building interactive prompts on terminals. Demo Source Usage Put this line in your Cargo.toml, under [dependencies]. inquire

Mikael Mello 426 Dec 26, 2022
Concurrent and multi-stage data ingestion and data processing with Rust+Tokio

TokioSky Build concurrent and multi-stage data ingestion and data processing pipelines with Rust+Tokio. TokioSky allows developers to consume data eff

DanyalMh 29 Dec 11, 2022