MetagenOmic read Re-Assigner and abundance quantifier

Andrew Zheng

Last update: Nov 18, 2022

Related tags

Overview

Mora

Mora is an read re-assigner that re-assigns query reads to a unique reference.

Main steps of Mora:

Calculate the expected abundance levels of the references based on the input SAM file.
Assign each query that had at least one valid mapping to a reference based on their mapping scores and the expected abundance levels.
Output the results into a txt file.

For more details, please consult the (preprint) paper.

Requirements

Rust and Cargo need to be installed and added to PATH.

Installation

To be able to use the full pipeline, run the following commands. If you only want to run Mora as a Rust program with a SAM/BAM file as input, you do not need to run the bash install.sh command.

git clone https://github.com/AfZheng126/MORA.git
cd MORA
bash install.sh
cargo build --release

Running Mora

After everything in the config files (see below) is updated according to your directories, run

snakemake --snakefile MORA --cores 24 --resources mem_mb=140000

Running Mora as a Rust Program

If you already have a SAM file that has mappings scores stored in the AS:i: optional field, you can directly run the Rust program and skip the indexing and mapping steps. To do this and get outputs without taxonomic information, run the following commands in the Mora directory.

cargo run --release -- -s samfile -o output

If you are runing from another directory and the specific binary is wanted, run

target/release/mora -s samifile -o output

A sample sam file is provided in the samples directory. To use it, use the following command.

taget/release/mora -s samples/test.sam -o test.txt

For more options and customization, run

target/release/mora -h

Config File

The parameters of the config.yaml file used for the snakemake pipline are listed below:

Parameter	Description
BINARIES	Binary folder directory (default: binaries) - do not edit
REFERENCES	Directory to reference fasta file
SAMPLES_DIR	Directory to folder containing query fasta files
RESULTS	Directory to write the results
FILES_EXT	Query files extension, i.e. .fq, .fq.gz etc
MAPPING_MODE	Algorithm for the initial mapping - (pufferfish, bowtie2, minimap2)
STRATEGY	"PE" for paired-end samples or "SE" for single-end samples
TYPE	RNA or DNA host-specific samples - right now only supports DNA
MIN_CNT	Minimum number of counts for a reference to be considered valid
MIN_SCORE_DIFFERENCE	Minimum score difference for a query to be assgined second
MAX_ABUNDANCE_DIFFERENCE	Maximum difference allowed between the initial abundance estimation and the abundances created from assignments
SEGMENT_SIZE	Size to split references into bins
ABUNDANCE_OUTPUT	Whether to output estimated abundance levels
TAXONOMY	Directory of taxonomic information to write results with taxonomic classes (NA to not include taxonomic information in the results)
MEM_MB	Amount of memory to be allocated to snakemake
TPS	Number of threads to be used per sample

Query Files

The program requires a list of query files. These can be .fasta, .fq, or even compressed files. If the query files are pair-end quries, their name must be of the form *_1.fq and *_2.fq, where the file extension can something else. The directory of these query files must be written into the config file.

Reference File

If a reference file is provided, its directory must also be written into the config file. If there is no reference file, you can download the fasta file representing the complete representative and reference bacterial genomes from NCBI RefSeq database by following the instructions from the Microbial reference preparation from the Agamemnon Wiki. The index will be built when you run the program, so you don't have to manually do it.

Taxonomic Information

Normally, the output of the program is two columns telling you which reference each query came from. If taxonomic information about the assigned reference is wanted as well, extra files must be made. To do this, navigate to the scripts directory and run

bash taxonomy.sh reference.fa

where reference.fa is your reference files. After this is done, update TAXONOMY in the config file to Taxonomy.

Use Case

Sample data and the results from the Mora paper can be found here. To run the data, simply update the configuration file with where you download the data and run it with the snakemake command.

Downloads and provides debug symbols and source code for nix derivations to gdb and other debuginfod-capable debuggers as needed.

nixseparatedebuginfod Downloads and provides debug symbols and source code for nix derivations to gdb and other debuginfod-capable debuggers as needed

16 Mar 6, 2023

Tooling and library for generation, validation and verification of supply chain metadata documents and frameworks

Spector Spector is both tooling and a library for the generation, validation and verification of supply chain metadata documents and frameworks. Many

13 May 4, 2023

A comprehensive collection of resources and learning materials for Rust programming, empowering developers to explore and master the modern, safe, and blazingly fast language.

🦀 Awesome Rust Lang ⛰️ Project Description : Welcome to the Awesome Rust Lang repository! This is a comprehensive collection of resources for Rust, a

16 May 29, 2023

ratlab is a programming platform designed loosely for hobbyist and masochist to analyse and design stuff and things that transform our world?

ratlab A programming language developed by Quinn Horton and Jay Hunter. ratlab is a programming platform designed loosely for hobbyists and masochists

10 Sep 4, 2023

REC2 (Rusty External Command and Control) is client and server tool allowing auditor to execute command from VirusTotal and Mastodon APIs written in Rust. 🦀

Information: REC2 is an old personal project (early 2023) that I didn't continue development on. It's part of a list of projects that helped me to lea

104 Oct 7, 2023

Command-Line program that takes images and produces the copy of the image with a thin frame and palette made of the 10 most frequent colors.

paleatra v.0.0.1 Command-Line program that takes an image and produces the copy of the image with a thin frame and palette made of the 10 most frequen

24 Dec 29, 2022

MetagenOmic read Re-Assigner and abundance quantifier

Related tags

Overview

Mora

Requirements

Installation

Running Mora

Running Mora as a Rust Program

Config File

Query Files

Reference File

Taxonomic Information

Use Case

You might also like...

Downloads and provides debug symbols and source code for nix derivations to gdb and other debuginfod-capable debuggers as needed.

Tooling and library for generation, validation and verification of supply chain metadata documents and frameworks

A comprehensive collection of resources and learning materials for Rust programming, empowering developers to explore and master the modern, safe, and blazingly fast language.

ratlab is a programming platform designed loosely for hobbyist and masochist to analyse and design stuff and things that transform our world?

REC2 (Rusty External Command and Control) is client and server tool allowing auditor to execute command from VirusTotal and Mastodon APIs written in Rust. 🦀

A lightweight and high-performance order-book designed to process level 2 and trades data. Available in Rust and Python

Execution of and interaction with external processes and pipelines

create and test the style and formatting of text in your terminal applications

Command-Line program that takes images and produces the copy of the image with a thin frame and palette made of the 10 most frequent colors.

Owner

Andrew Zheng

Scriptable tool to read and write UEFI variables from EFI shell. View, save, edit and restore hidden UEFI (BIOS) Setup settings faster than with the OEM menu forms.

A ln scraper to read light novels and watch anime in your terminal (Written in rust)

ask.sh: AI terminal assistant that can read and write your terminal directly!

Check the reproducibility status of your Arch Linux packages (read-only mirror)

A Rust synchronisation primitive for "Multiplexed Concurrent Single-Threaded Read" access

A PoC for the CVE-2022-44268 - ImageMagick arbitrary file read

Tight Model format is a lossy 3D model format focused on reducing file size as much as posible without decreasing visual quality of the viewed model or read speeds.

Warp is a blazingly fast, Rust-based terminal that makes you and your team more productive at running, debugging, and deploying code and infrastructure.

interactcli-rs is a command-line program framework used to solve the problem of the integration of command-line and interactive modes, including functions such as unification of command-line interactive modes and sub-command prompts. The framework integrates clap and shellwords.

Sets of libraries and tools to write applications and libraries mixing OCaml and Rust