Mora
Mora is an read re-assigner that re-assigns query reads to a unique reference.
Main steps of Mora:
- Calculate the expected abundance levels of the references based on the input SAM file.
- Assign each query that had at least one valid mapping to a reference based on their mapping scores and the expected abundance levels.
- Output the results into a txt file.
For more details, please consult the (preprint) paper.
Requirements
Rust and Cargo need to be installed and added to PATH.
Installation
To be able to use the full pipeline, run the following commands. If you only want to run Mora as a Rust program with a SAM/BAM file as input, you do not need to run the bash install.sh command.
git clone https://github.com/AfZheng126/MORA.git
cd MORA
bash install.sh
cargo build --release
Running Mora
After everything in the config files (see below) is updated according to your directories, run
snakemake --snakefile MORA --cores 24 --resources mem_mb=140000
Running Mora as a Rust Program
If you already have a SAM file that has mappings scores stored in the AS:i: optional field, you can directly run the Rust program and skip the indexing and mapping steps. To do this and get outputs without taxonomic information, run the following commands in the Mora directory.
cargo run --release -- -s samfile -o output
If you are runing from another directory and the specific binary is wanted, run
target/release/mora -s samifile -o output
A sample sam file is provided in the samples directory. To use it, use the following command.
taget/release/mora -s samples/test.sam -o test.txt
For more options and customization, run
target/release/mora -h
Config File
The parameters of the config.yaml file used for the snakemake pipline are listed below:
Parameter | Description |
---|---|
BINARIES | Binary folder directory (default: binaries) - do not edit |
REFERENCES | Directory to reference fasta file |
SAMPLES_DIR | Directory to folder containing query fasta files |
RESULTS | Directory to write the results |
FILES_EXT | Query files extension, i.e. .fq, .fq.gz etc |
MAPPING_MODE | Algorithm for the initial mapping - (pufferfish, bowtie2, minimap2) |
STRATEGY | "PE" for paired-end samples or "SE" for single-end samples |
TYPE | RNA or DNA host-specific samples - right now only supports DNA |
MIN_CNT | Minimum number of counts for a reference to be considered valid |
MIN_SCORE_DIFFERENCE | Minimum score difference for a query to be assgined second |
MAX_ABUNDANCE_DIFFERENCE | Maximum difference allowed between the initial abundance estimation and the abundances created from assignments |
SEGMENT_SIZE | Size to split references into bins |
ABUNDANCE_OUTPUT | Whether to output estimated abundance levels |
TAXONOMY | Directory of taxonomic information to write results with taxonomic classes (NA to not include taxonomic information in the results) |
MEM_MB | Amount of memory to be allocated to snakemake |
TPS | Number of threads to be used per sample |
Query Files
The program requires a list of query files. These can be .fasta, .fq, or even compressed files. If the query files are pair-end quries, their name must be of the form *_1.fq and *_2.fq, where the file extension can something else. The directory of these query files must be written into the config file.
Reference File
If a reference file is provided, its directory must also be written into the config file. If there is no reference file, you can download the fasta file representing the complete representative and reference bacterial genomes from NCBI RefSeq database by following the instructions from the Microbial reference preparation from the Agamemnon Wiki. The index will be built when you run the program, so you don't have to manually do it.
Taxonomic Information
Normally, the output of the program is two columns telling you which reference each query came from. If taxonomic information about the assigned reference is wanted as well, extra files must be made. To do this, navigate to the scripts directory and run
bash taxonomy.sh reference.fa
where reference.fa is your reference files. After this is done, update TAXONOMY in the config file to Taxonomy.
Use Case
Sample data and the results from the Mora paper can be found here. To run the data, simply update the configuration file with where you download the data and run it with the snakemake command.