Bioinformatics tool for counting guides in CRISPR-screen studies.

Overview

guide-counter

Build Status Version info Bioconda
A better, faster way to count guides in CRISPR screens.

Overview

guide-counter is a tool for processing FASTQ files from CRISPR screen experiments to generate a matrix of per-sample guide counts. It can be used as a faster, more accurate, drop in replacement for mageck count. By default guide-counter will look for guide seqeunces in the reads with 0 or 1 mismatches vs. the expected guides, but can be run in exact matching mode.

Why guide-counter?

If you have any experience analyzing CRISPR screens you've almost certainly tried mageck. It's widely used, highly cited and generally works well. Surprisingly though, mageck count is both rather slow and misses counting a non-trivial amount of the data.

As an example, we ran data from the Sanson et al paper through both tools. The dataset consists of:

Sample Reads Gzipped FASTQ Size
Plasmid 9,821,128 377M
RepA 76,471,324 2.3G
RepB 85,301,059 2.5G
RepC 75,356,900 2.2G

The following plot shows the amount of data recovered per sample by each of three different analyses:

Read Counts from analyzing Sanson et al. data

And the following plot shows the runtime for each of the three analyses performed using a single CPU core/thread on an Intel Core i9 powered MacBook Pro laptop:

Runtimes from analyzing Sanson et al. data

Installation

Installation can be done using conda:

conda install -c bioconda guide-counter

or with cargo if installed:

cargo install guide-counter

Example Workflow

The following shows an example of running guide-counter followed by mageck test on data from the Sanson et al. 2018 paper:

guide-counter count \
  --input plasmid.fq.gz RepA.fq.gz RepB.fq.gz RepC.fq.gz \
  --control-pattern control \
  --essential-genes metadata/training_essentials.txt \
  --nonessential-genes metadata/training_nonessential.txt \
  --library metadata/broadgpp-brunello-library-corrected.txt.gz  \
  --output sanson
  
mageck test \
  --count-table sanson.counts.txt \
  --control-id plasmid \
  --treatment-id RepA,RepB,RepC \
  --norm-method median \
  --output-prefix sanson.test
  

Inputs

The full usage for guide-counter count is reproduced below; this section describes a few of the key inputs in more detail:

Input Option Required Description
--input Yes FASTQ files one per sample. Files may be gzipped or uncompressed.
--samples No Names for the samples, matched positionally to the FASTQs. If not provided then the input file names minus any `.[fq
--essential-genes No An optional file of known essential genes. May be gzipped or uncompressed. May be either just gene names, one per line, or tab-delimited with the gene in the first column. If given, guides will be labeled as essential for matching genes, and mean coverage of guides for essential genes computed.
--nonessential-genes No An optional file of known nonessential genes. May be gzipped or uncompressed. May be either just gene names, one per line, or tab-delimited with the gene in the first column. If given, guides will be labeled as nonessential for matching genes, and mean coverage of guides for nonessential genes computed.
--control-guides No An optional file of guide IDs for control guides. May be gzipped or uncompressed. May be either just guide IDs, one per line, or tab-delimited data with the guide ID in the first column. If given, matching guides will be labeled as controls, and mean coverage of control guides computed. May be used alone or in conjunction with --control-pattern.
--control-pattern No An optional regular expression which is applied (case insensitive) to both guide IDs and gene names, and when a match is found, guides are labeled as controls. For example --control-pattern control works well for many human libraries.

Outputs

The output files are generated:

  1. {output}.counts.txt - a standard count matrix with columns for the guide ID and gene, then one column per sample with raw/unnormalized guide counts.
  2. {output}.-extended-counts.txt - an extended version of the counts matrix which includes a guide_type column which will have one of [Essential, Nonessential, Control, Other] per guide as determined based on the gene lists and control information provided.
  3. {output}.stats.txt - a file of computed statistics, one row per input sample/FASTQ.

The columns in the stats file are:

Column Description
file The path to the input FASTQ file used to generate the stats.
label The label or sample name given to the sample.
total_guides The total number of guides in the guide library (not sample dependent).
total_reads The total number of reads in the input FASTQ file.
mapped_reads The number of reads that could be mapped to a guide.
frac_mapped The fraction of reads (0-1) that could be mapped to a guide.
mean_reads_per_guide The mean number of reads mapped to each guide in the library.
mean_reads_essential The mean number of reads mapped to guides for essential genes.
mean_reads_nonessential The mean number of reads mapped to guides for nonessential genes.
mean_reads_control The mean number of reads mapped to control guides.
mean_reads_other The mean number of reads mapped to other guides (guides not flagged as essential, nonessential or control).
zero_read_guides

Usage

Usage for guide-counter count:

guide-counter-count

Counts the guides observed in a CRISPR screen, starting from one or more FASTQs.  FASTQs are one per
sample and currently only single-end FASTQ inputs are supported.

A set of sample IDs may be provided using `--samples id1 id2 ..`.  If provided it must have the same
number of values as input FASTQs.  If not provided the FASTQ names are used minus any fastq/fq/gz
suffixes.

Automatically determines the range of valid offsets within the sequencing reads where the guide
sequences are located, independently for each FASTQ input.  The first `offset-sample-size` reads
from each FASTQ are examined to determine the offsets at which guides are found. When processing the
full FASTQ, checks only those offsets that accounted for at least `offset-min-fraction` of the first
`offset-sample-size` reads.

Matching by default allows for one mismatch (and no indels) between the read sub-sequence and the
expected guide sequences.  Exact matching may be enabled by specifying the `--exact-match` option.

Two output files are generated.  The first is named `{output}.counts.txt` and contains columns for
the guide id, the gene targeted by the guide and one count column per input FASTQ with raw/un-
normalized counts.  The second is named `{output}.stats.txt` and contains basic QC statistics per
input FASTQ on the matching process.

USAGE:
    guide-counter count [OPTIONS] --input <INPUT>... --library <LIBRARY> --output <OUTPUT>

OPTIONS:
    -c, --control-guides <CONTROL_GUIDES>
            Optional path to file with list control guide IDs.  IDs should appear one per line and
            are case sensitive

    -C, --control-pattern <CONTROL_PATTERN>
            Optional regular expression pattern used to ID control guides. Pattern is matched, case
            insensitive, to guide IDs and Gene names

    -e, --essential-genes <ESSENTIAL_GENES>
            Optional path to file with list of essential genes.  Gene names should appear one per
            line and are case sensitive

    -f, --offset-min-fraction <OFFSET_MIN_FRACTION>
            After sampling the first `offset_sample_size` reads, use offsets that

            [default: 0.005]

    -h, --help
            Print help information

    -i, --input <INPUT>...
            Input fastq file(s)

    -l, --library <LIBRARY>
            Path to the guide library metadata.  May be a tab- or comma-separated file.  Must have a
            header line, and the first three fields must be (in order): i) the ID of the guide, ii)
            the base sequence of the guide, iii) the gene the guide targets

    -n, --nonessential-genes <NONESSENTIAL_GENES>
            Optional path to file with list of nonessential genes.  Gene names should appear one per
            line and are case sensitive

    -N, --offset-sample-size <OFFSET_SAMPLE_SIZE>
            The number of reads to be examined when determining the offsets at which guides may be
            found in the input reads

            [default: 100000]

    -o, --output <OUTPUT>
            Path prefix to use for all output files

    -s, --samples <SAMPLES>...
            Sample names corresponding to the input fastqs. If provided must be the same length as
            input.  Otherwise will be inferred from input file names

    -x, --exact-match
            Perform exact matching only, don't allow mismatches between reads and guides
You might also like...
qfetch is a tool that fetches info about your linux install.

qfetch qfetch is a tool that fetches info about your linux install. Status Dependencies /proc/meminfo with the following fields: MemTotal in the 1st l

Goodname is a tool to assist you with cool naming of your methods and software

Goodname is a tool to assist you with cool naming of your methods and software. Given a brief description of your method or software, this tool enumerates name candidates forming subsequences of the description (i.e., abbreviation).

A tool and library to losslessly join multiple .mp4 files shot with same camera and settings

mp4-merge A tool and library to losslessly join multiple .mp4 files shot with same camera and settings. This is useful to merge multiple files that ar

A tool that helps you to turn in one command a Rust crate into a Haskell Cabal library!
A tool that helps you to turn in one command a Rust crate into a Haskell Cabal library!

cabal-pack A tool that helps you to turn in one command a Rust crate into a Haskell Cabal library! To generate bindings, you need to annotate the Rust

An open source WCH-Link library/command line tool written in Rust.

wlink - WCH-Link command line tool NOTE: This tool is still in development and not ready for production use. Known Issue: Only support binary firmware

Simple CLI tool to create dummy accounts with referral links to give yourself free Plus.
Simple CLI tool to create dummy accounts with referral links to give yourself free Plus.

Free Duolingo Plus A simple CLI tool to create dummy accounts with referral links to give yourself free Plus (max 24/41 weeks depending on whether you

Tool to convert variable and function names in C/C++ source code to snake_case

FixNameCase Tool to convert variable and function names in C/C++ source code to snake_case. Hidden files and files listed in .gitignore are untouched.

OP-Up is a hive tool for testing OP-Stack-compatible software modules

op-up Warning This is a work in progress. OP-Up is a hive tool for testing OP-Stack-compatible software modules. This project was born out of the need

Easy c̵̰͠r̵̛̠ö̴̪s̶̩̒s̵̭̀-t̶̲͝h̶̯̚r̵̺͐e̷̖̽ḁ̴̍d̶̖̔ ȓ̵͙ė̶͎ḟ̴͙e̸̖͛r̶̖͗ë̶̱́ṉ̵̒ĉ̷̥e̷͚̍ s̷̹͌h̷̲̉a̵̭͋r̷̫̊ḭ̵̊n̷̬͂g̵̦̃ f̶̻̊ơ̵̜ṟ̸̈́ R̵̞̋ù̵̺s̷̖̅ţ̸͗!̸̼͋

Rust S̵̓i̸̓n̵̉ I̴n̴f̶e̸r̵n̷a̴l mutability! Howdy, friendly Rust developer! Ever had a value get m̵̯̅ð̶͊v̴̮̾ê̴̼͘d away right under your nose just when

A project management tool for data science and bioinformatics. If you want it, Kerblam it!
A project management tool for data science and bioinformatics. If you want it, Kerblam it!

Warning kerblam run and kerblam package are complete but still untested. Please do use them, but be careful. Always have a backup of your data and cod

Taking the best of Substrate Recipes and applying them to a new framework for structuring a collection of how-to guides.

Attention: This repository has been archived and is no longer being maintained. It has been replaced by the Substrate How-to Guides. Please use the Su

Book - Actix user guides

User guides Actix User Guide Actix API Documentation (Development) Actix API Documentation (Releases) Actix Web User Guide Actix Web API Documentation

A simple made in Rust crack, automatic for Winrar, activated from shared virtual memory, for studies.
A simple made in Rust crack, automatic for Winrar, activated from shared virtual memory, for studies.

Simple Winrar Crack in Rust What does it do ? A simple project that allows you to modify the license check used by WinRaR, "RegKey" from virtual memor

An interactive Bayesian Probability Calculator CLI that guides users through updating beliefs based on new evidence.

Bayesian Probability Calculator CLI Welcome to the Bayesian Probability Calculator CLI! This command-line tool is designed to help you update your bel

some AV / EDR / analysis studies
some AV / EDR / analysis studies

binary some AV / EDR / analysis related experiences fault_test: trigger a access violation, catch with a custom handler and continue the normal execut

 This repository brings together my studies in the Rust language.
This repository brings together my studies in the Rust language.

Studying_Rust This repository brings together my studies in the Rust language. Study schedule in 90 days start date: 7/24 end date: 10/24 Each topic w

This library provides implementations of many algorithms and data structures that are useful for bioinformatics.

This library provides implementations of many algorithms and data structures that are useful for bioinformatics. All provided implementations are rigorously tested via continuous integration.

Bioinformatics plugin for nushell.

Nushell bio A bioinformatics plugin for nushell. The aim initially is to create a bunch of parsers for all of the common bioinformatics file formats a

A free and open-source DNA Sequencing/Visualization software for bioinformatics research.
A free and open-source DNA Sequencing/Visualization software for bioinformatics research.

DNArchery 🧬 A free and open-source cross-platform DNA Sequencing/Visualization Software for bioinformatics research. A toolkit for instantly performi

Comments
  • range end index 20 out of range for slice of length 14

    range end index 20 out of range for slice of length 14

    Hi!

    I'm having an issue while trying to run guide-counter count for the first time.

    Here's the command I'm trying to run:

    guide-counter count --input \
    sample1.fastq \
    sample2.fastq \
    sample3.fastq \
    sample4.fastq \
    sample5.fastq \
    sample6.fastq \
    sample7.fastq \
    sample8.fastq \
    sample9.fastq \
    sample10.fastq \
    sample11.fastq \
    --control-pattern control \
    --library /lib/library.txt \
    --output guide_counter
    

    And here's the answer:

    [2022-03-18T10:42:51Z INFO  guide_counter::guide] Loaded library with 77440 guides for 19115 genes; 0=essential, 0=nonessential, 1000=control, 76440=other.
    [2022-03-18T10:42:51Z INFO  guide_counter::commands::count] Building lookup.
    [2022-03-18T10:42:53Z INFO  guide_counter::commands::count] Lookup built with 4719224 entries.
    thread 'main' panicked at 'range end index 20 out of range for slice of length 14', src/commands/count.rs:258:38
    stack backtrace:
       0:     0x55b229a3b1fd - std::backtrace_rs::backtrace::libunwind::trace::h09f7e4e089375279
                                   at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/std/src/../../backtrace/src/backtrace/libunwind.rs:93:5
       1:     0x55b229a3b1fd - std::backtrace_rs::backtrace::trace_unsynchronized::h1ec96f1c7087094e
                                   at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
       2:     0x55b229a3b1fd - std::sys_common::backtrace::_print_fmt::h317b71fc9a5cf964
                                   at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/std/src/sys_common/backtrace.rs:67:5
       3:     0x55b229a3b1fd - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::he3555b48e7dfe7f0
                                   at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/std/src/sys_common/backtrace.rs:46:22
       4:     0x55b2299bf41c - core::fmt::write::h513b07ca38f4fb1b
                                   at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/core/src/fmt/mod.rs:1149:17
       5:     0x55b229a39cf4 - std::io::Write::write_fmt::haf8c932b52111354
                                   at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/std/src/io/mod.rs:1697:15
       6:     0x55b229a3a3d0 - std::sys_common::backtrace::_print::h195c38364780a303
                                   at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/std/src/sys_common/backtrace.rs:49:5
       7:     0x55b229a3a3d0 - std::sys_common::backtrace::print::hc09dfdea923b6730
                                   at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/std/src/sys_common/backtrace.rs:36:9
       8:     0x55b229a3a3d0 - std::panicking::default_hook::{{closure}}::hb2e38ec0d91046a3
                                   at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/std/src/panicking.rs:211:50
       9:     0x55b229a397ca - std::panicking::default_hook::h60284635b0ad54a8
                                   at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/std/src/panicking.rs:228:9
      10:     0x55b229a397ca - std::panicking::rust_panic_with_hook::ha677a669fb275654
                                   at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/std/src/panicking.rs:606:17
      11:     0x55b229a5e298 - std::panicking::begin_panic_handler::{{closure}}::h976246fb95d93c31
                                   at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/std/src/panicking.rs:502:13
      12:     0x55b229a5e216 - std::sys_common::backtrace::__rust_end_short_backtrace::h38077ee5b7b9f99a
                                   at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/std/src/sys_common/backtrace.rs:139:18
      13:     0x55b229a5e1d2 - rust_begin_unwind
                                   at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/std/src/panicking.rs:498:5
      14:     0x55b2299315f0 - core::panicking::panic_fmt::h35f3a62252ba0fd2
                                   at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/core/src/panicking.rs:107:14
      15:     0x55b2299316f1 - core::slice::index::slice_end_index_len_fail::h735e748f7023a8c4
                                   at /rustc/db9d1b20bba1968c1ec1fc49616d4742c1725b4b/library/core/src/slice/index.rs:41:5
      16:     0x55b229955428 - guide_counter::commands::count::Count::determine_prefixes::{{closure}}::h3e1e1356877818da
      17:     0x55b22994ff13 - <guide_counter::commands::count::Count as guide_counter::commands::command::Command>::execute::h32ae3439f4cdfe2c
      18:     0x55b22995ea0c - guide_counter::main::h56f2ffd0731cd338
      19:     0x55b229937f73 - std::sys_common::backtrace::__rust_begin_short_backtrace::h231446ae02769ddb
      20:     0x55b229963e07 - main
      21:     0x7f1f3df6ed0a - __libc_start_main
      22:     0x55b2299360e9 - <unknown> 
    

    The library is Brunello human and it's formatted like:

    sg_1	CATCTTCTTTCACCTGAACG	A1BG
    sg_2	CTCCGGGGAGAACTCCGGCG	A1BG
    sg_3	TCTCCATGGTGCATCAGCAC	A1BG
    sg_4	TGGAAGTCCACTCCACTCAG	A1BG
    sg_5	ACTGCATCTGTGCAAACGGG	A2M
    sg_6	ATGTCTCATGAACTACCCTG	A2M
    sg_7	TGAAATGAAACTTCACACTG	A2M
    sg_8	TTACTCATATAGGATCCCAA	A2M 
    

    Do you know what could be the problem?

    Thanks! Pierre

    opened by p-levy 4
  • Update clap to 3.0 rc9

    Update clap to 3.0 rc9

    Somewhere between “beta5” and “rc9" clap 3.0.0 hid the derive macros behind a feature flag. Given that cargo eagerly updates dependencies that are expected to be compatible, this breaks cargo install. This updates to rc9 and adds the feature flag.

    opened by tfenne 0
  • Add a new output with correlations of guide counts between samples

    Add a new output with correlations of guide counts between samples

    One other nice QC to have is a set of correlations between samples. This might look something like this:

    sample1   sample2   r2     control_r2   essential_r2   nonessential_r2   other_r2
    plasmid   day7      0.76   0.94         0.62           0.92              0.87
    ...
    
    opened by tfenne 0
Releases(v0.1.3)
  • v0.1.3(Mar 22, 2022)

    Release 0.1.3 is a minor release to fix a bug in the handling of input reads that are shorter than the length of guides in the guide library. Prior to this release if one of the input FASTQs contained reads that are shorter than the length of the guides in use, guide-counter would fail and exit with an error. Starting with 0.1.3 reads shorter than a guide-length are ignored.

    Source code(tar.gz)
    Source code(zip)
  • v0.1.2(Dec 29, 2021)

    Minor bugfix to renable multi-value parameters (i.e. --input 1.fq 2.fq) vs. having to specify the option many times (e.g. --input 1.fq --input2.fq). No other changes.

    Source code(tar.gz)
    Source code(zip)
  • v0.1.1(Dec 29, 2021)

    Bugfix release to fix broken cargo install due to derive macros in clap-3.0.0 RCs moving behind a feature flag. No functional differences.

    Source code(tar.gz)
    Source code(zip)
  • v0.1.0(Dec 28, 2021)

Owner
Fulcrum Genomics
Fulcrum Genomics
RcLite: small, fast, and memory-friendly reference counting for Rust

RcLite: small, fast, and memory-friendly reference counting RcLite is a lightweight reference-counting solution for Rust that serves as an alternative

Khashayar Fereidani 147 Apr 14, 2023
A cli tool to write your idea in terminal

Ideas ideas is a cli tools to write your idea in your terminal. Demo Features tagged idea, contains tips, idea, todo status switch ascii icon write yo

王祎 12 Jun 22, 2022
🎨🦀 A system information tool for Rustaceans

?? ?? ferris-fetch Inspired by gofetch ?? ??️ Cross platfrom System Info Tool for Rustaceans ?? Installation ?? cargo install ferris-fetch Contributin

Ilya Revenko 107 Dec 19, 2022
A tool to make grocery lists written in Rust

grusterylist: makes grocery lists, written in Rust grusterylist uses and can add to local libraries of user-added recipes and grocery items to put tog

null 3 Jun 17, 2022
A tool to deserialize data from an input encoding, transform it and serialize it back into an output encoding.

dts A simple tool to deserialize data from an input encoding, transform it and serialize it back into an output encoding. Requires rust >= 1.56.0. Ins

null 11 Dec 14, 2022
Rust command-line tool to encrypt and decrypt files or directories with age

Bottle A Rust command-line tool that can compress and encrypt (and decrypt and extract) files or directories using age, gzip, and tar. Bottle has no c

Sam Schlinkert 1 Aug 1, 2022
A tool to use docker / podman / oci containers with rust

contain-rs A tool to use docker / podman / oci containers with rust TODO improve error types improve error reporting handle std error for child proces

Felix Marezki 1 Jan 1, 2023
A tool & library to help you with the compiler course.

Compiler Course Helper Support: eliminate left recursion (require grammar with no cycles or ϵ-production) calculate nullable, first sets, follow, sets

水夕 4 May 2, 2022
rusty-riscy is a performance testing and system resource monitoring tool written in Rust to benchmark RISC-V processors.

rusty-riscy rusty-riscy is a performance testing and system resource monitoring tool written in Rust to benchmark RISC-V processors. Objectives To cre

Suhas KV 4 May 3, 2022