A tool to filter sites in a FASTA-format whole-genome pseudo-alignment

Ryan Wick

Last update: Apr 2, 2023

Related tags

Command-line Core-SNP-filter

Overview

Core-SNP-filter

This is a tool to filter sites (i.e. columns) in a FASTA-format whole-genome pseudo-alignment based on:

Whether the site contains variation or not.
How conserved the site is, i.e. contains an unambiguous base in a sufficient fraction of the sequences.

I wrote Core-SNP-filter because I was using Snippy, and the snippy-core command produces a core.full.aln file (contains all sites regardless of variation and conservation) and core.aln (only contains invariant sites with 100% conservation). I wanted a tool that could produce a core SNP alignment, but with more flexibility, e.g. including sites with ≥95% conservation.

Core-SNP-filter is efficient. On a small input alignment (2 Mbp in length, 100 sequences), it runs in seconds. On a large input alignment (5 Mbp in length, 5000 sequences, 25 GB file size), it takes less than 10 minutes and only uses ~50 MB of RAM.

Important caveat: Core-SNP-filter is only appropriate for DNA alignments, not protein alignments.

Usage

The executable named coresnpfilter takes a FASTA file as input. This must be an aligned FASTA file, i.e. all sequences must be the same length. The characters in the FASTA sequences can be bases (e.g. A or c), gaps (-) or any other ASCII character (e.g. N for ambiguous bases or X for masked bases). The input FASTA can be gzipped, and line breaks (multiple lines per sequence) are okay.

There are two main options:

-e/--exclude_invariant: if used, all invariant sites in the alignment are removed. A site counts as invariant if the number of unique unambiguous bases (A, C, G or T) at that site is one or zero. For example, a site with only A is invariant, but a site with both A and C is not invariant. Gaps and other characters do not count, e.g. a site with only A, N and - is invariant. Case does not matter, e.g. a site with only A and a is invariant.
-c/--core: at least this fraction of the sequences must contain an unambiguous base (A, C, G or T) at a site for the site to be included. The default is 0.0, i.e. sites are not filtered based on core fraction. If 1.0 is given, all sites with gaps or other characters will be removed, leaving an alignment containing only unambiguous bases. A more relaxed value of 0.95 will ensure that each site contains mostly unambiguous bases, but up to 5% of the bases can be gaps or other characters.

Core-SNP-filter outputs a FASTA alignment to stdout. The output will have the same number of sequences as the input, but (depending on the options used) the length of the sequences will likely be shorter. The header lines (names and descriptions) of the output will be the same as the input, and there will be no line breaks in the sequences (each sequence gets one line). Some basic information (input file, input sequence length, number of sequences and output sequence length) is printed to stderr.

Some example commands:

# Exclude invariant sites:
coresnpfilter -e core.full.aln > filtered.aln

# With a strict core threshold (same as Snippy's core.aln):
coresnpfilter -e -c 1.0 core.full.aln > filtered.aln

# With a slightly more relaxed core threshold:
coresnpfilter -e -c 0.95 core.full.aln > filtered.aln

# Use gzipped files to save disk space:
coresnpfilter -e -c 0.95 core.full.aln.gz | gzip > filtered.aln.gz

# Running without any options will work, but the output will be the same as the input:
coresnpfilter core.full.aln > filtered.aln

Full help text:

Core SNP filter

Usage: coresnpfilter [OPTIONS] 

Arguments:
    Input alignment

Options:
  -c, --core         Restrict to core genome (0.0 to 1.0, default = 0.0) [default: 0.0]
  -e, --exclude_invariant  Exclude invariant sites
      --verbose            Verbose output
  -h, --help               Print help
  -V, --version            Print version

Installation from pre-built binaries

Core-SNP-filter compiles to a single executable binary (coresnpfilter), which makes installation easy!

You can find pre-built binaries for common OSs/CPUs on the releases page. If you use one of these OSs/CPUs, download the appropriate binary for your system and put the coresnpfilter file in a directory that's in your PATH variable, e.g. /usr/local/bin/ or ~/.local/bin/.

Alternatively, you don't need to install Core-SNP-filter at all. Instead, just run it from wherever the coresnpfilter executable happens to be, like this: /some/path/to/coresnpfilter --help.

Installation from source

If you are using incompatible hardware or a different OS, then you'll have to build Core-SNP-filter from source. Install Rust if you don't already have it. Then clone and build Core-SNP-filter like this:

git clone https://github.com/rrwick/Core-SNP-filter.git
cd Core-SNP-filter
cargo build --release

You'll find the freshly built executable in target/release/coresnpfilter, which you can then move to an appropriate location that's in your PATH variable.

Demo dataset

This repo's demo.fasta.gz file is a pseudo-alignment made from 40 Klebsiella samples (the original was ~5 Mbp long but I subsetted it down to 100 kbp to save space). It has many gaps, invariant sites and Ns, so Core-SNP-filter can make it a lot smaller.

For example, you can use Core-SNP-filter to create an invariant-free 95%-core alignment:

coresnpfilter -e -c 0.95 demo.fasta.gz > demo_core.fasta

The stderr output will look like this:

Core-SNP-filter
  input file:                    demo.fasta.gz
  number of sequences:           40
  input sequence length:         10000
  invariant-A sites removed:     1394
  invariant-C sites removed:     1763
  invariant-G sites removed:     1849
  invariant-T sites removed:     1378
  other invariant sites removed: 322
  non-core sites removed:        2143
  output sequence length:        1151

You can then build a tree with a program such as IQ-TREE:

iqtree2 -s demo_core.fasta -T 4

Verbose output

Using the --verbose option will make Core-SNP-filter print a per-site table to stderr. Save it to file like this:

coresnpfilter -e -c 0.95 --verbose core.full.aln 1> filtered.aln 2> core_snp_table.tsv

This is mainly for debugging purposes, so you probably don't want to use it. But if you do, the columns are:

pos: 1-based index of the input alignment site
a: whether any sequence at this site contains A or a
c: whether any sequence at this site contains C or c
g: whether any sequence at this site contains G or g
t: whether any sequence at this site contains T or t
count: the number of sequences at this site which contain an unambiguous base
frac: the fraction of sequences at this site which contain an unambiguous base
var: whether there is any variation at this site (i.e. two or more of the a/c/g/t columns are true)
keep: whether the site passed the filter and is included in the output

Boolean columns use 0 for no and 1 for yes.

For example, you can use this table to see which sites in your input alignment have made it into the output alignment:

awk '{if ($9==1) print $1;}' core_snp_table.tsv

License

GNU General Public License, version 3

You might also like...

Filter, Sort & Delete Duplicate Files Recursively

Deduplicator Find, Sort, Filter & Delete duplicate files Usage Usage: deduplicator [OPTIONS] [scan_dir_path] Arguments: [scan_dir_path] Run Dedupl

108 Jan 27, 2023

A fast, simple and lightweight Bloom filter library for Python, fully implemented in Rust.

rBloom A fast, simple and lightweight Bloom filter library for Python, fully implemented in Rust. It's designed to be as pythonic as possible, mimicki

91 Feb 4, 2023

Rust crate for interacting with the Windows Packet Filter driver.

NDISAPI-RS NDISAPI-RS is a Rust crate for interacting with the Windows Packet Filter driver. It provides an easy-to-use, safe, and efficient interface

6 Jun 15, 2023

The fastest bloom filter in Rust. No accuracy compromises. Use any hasher.

b100m-filter The fastest bloom filter in Rust. No accuracy compromises. Use any hasher. Usage # Cargo.toml [dependencies] b100m-filter = "0.3.0" use b

4 Nov 19, 2023

CLI tool that make it easier to perform multiple lighthouse runs towards a single target and output the result in a plotable format.

Lighthouse Aggregator CLI tool that make it easier to perform multiple lighthouse runs towards a single target and output the result in a "plotable" f

1 Jan 12, 2022

Releases(v0.1.1)

v0.1.1(Mar 26, 2023)
Added new information to the stderr output: how many invariant and non-core sites were removed. Otherwise, it functions the same as the previous release.

Tarballs of pre-built executable binaries are attached:

coresnpfilter-linux-x86_64-musl-v0.1.1.tar.gz: for Linux systems with x86-64 processors

coresnpfilter-macos-aarch64-v0.1.1.tar.gz: for Macs with Apple silicon processors

coresnpfilter-macos-x86_64-v0.1.1.tar.gz: for Macs with x86-64 Intel processors

If none of the above work for you, you'll need to build Core-SNP-filter from source (see the README).
Source code(tar.gz)
Source code(zip)
coresnpfilter-linux-x86_64-musl-v0.1.1.tar.gz(1.33 MB)
coresnpfilter-macos-aarch64-v0.1.1.tar.gz(474.46 KB)
coresnpfilter-macos-x86_64-v0.1.1.tar.gz(480.24 KB)
v0.1.0(Mar 25, 2023)
First release!

Tarballs of pre-built executable binaries are attached:

coresnpfilter-linux-x86_64-musl-v0.1.0.tar.gz: for Linux systems with x86-64 processors

coresnpfilter-macos-aarch64-v0.1.0.tar.gz: for Macs with Apple silicon processors

coresnpfilter-macos-x86_64-v0.1.0.tar.gz: for Macs with x86-64 Intel processors

If none of the above work for you, you'll need to build Core-SNP-filter from source (see the README).
Source code(tar.gz)
Source code(zip)
coresnpfilter-linux-x86_64-musl-v0.1.0.tar.gz(1.33 MB)
coresnpfilter-macos-aarch64-v0.1.0.tar.gz(473.87 KB)
coresnpfilter-macos-x86_64-v0.1.0.tar.gz(480.13 KB)

A tool to filter sites in a FASTA-format whole-genome pseudo-alignment

Related tags

Overview

Core-SNP-filter

Usage

Installation from pre-built binaries

Installation from source

Demo dataset

Verbose output

License

You might also like...

Filter, Sort & Delete Duplicate Files Recursively

A fast, simple and lightweight Bloom filter library for Python, fully implemented in Rust.

Rust crate for interacting with the Windows Packet Filter driver.

The fastest bloom filter in Rust. No accuracy compromises. Use any hasher.

CLI tool that make it easier to perform multiple lighthouse runs towards a single target and output the result in a plotable format.

UniSBOM is a tool to build a software bill of materials on any platform with a unified data format.

a command-line tool that transforms a Git repository into a minimal format for ChatGPT queries

A tool to format codeblocks inside markdown and org documents.

⚗️ Superfast CLI interface for the conventional commits commit format

Releases(v0.1.1)

v0.1.1(Mar 26, 2023)

v0.1.0(Mar 25, 2023)

Owner

Ryan Wick

Rust crate `needleman_wunsch` of the `fasebare` package: reading FASTA sequences, Needleman-Wunsch alignment

Parallel iteration of FASTA/FASTQ files, for when sequence order doesn't matter but speed does

A command-line downloader for sites archived on the Wayback Machine

Mini Rust CLI to deploy sites to Netlify using their API

Just a collection of tiny Rust projects I've did. None warrant a whole repo rn

Sero is a web server that allows you to easily host your static sites without pain. The idea was inspired by surge.sh but gives you full control.

Simple terminal alignment viewer

Tight Model format is a lossy 3D model format focused on reducing file size as much as posible without decreasing visual quality of the viewed model or read speeds.

Alexander Mongus is a state-of-the-art filter to sneak amogus characters in pictures

Yet Another Kalman Filter Implementation. As well as Lie Theory (Lie group and algebra) on SE(3). [no_std] is supported by default.