Software

Antonie

Antonie is an integrated, robust, reliable and fast processor of DNA reads, mostly from Next Generation Sequencing platforms (typically Illumina, but we strive to be multiplatform). It is currently focussed on prokaryotic and other small genomes.

Antonie is free open source software, and we welcome contributions!

Initial focus is on automatically & quickly producing the most useful results on prokaryotic sized genomes. A second goal is to make the program robust against bad input: out of the box it should refuse to draw conclusions based on low quality or unnaturally distributed data.

Antonie is named after Antonie van Leeuwenhoek, the Delft inventor of microscopes and the discoverer of bacteria.

Downloading

Antonie is actively being developed, latest sources can be found from GitHub (see below). However, for your convenience, we regularly provide RPM, DEB, OSX and Windows versions of Antonie on http://ds9a.nl/antonie/packages/

Capabilities

Currently, Antonie can map the FASTQ output of sequencers to a FASTA reference genome. It records the mapping as a sorted and indexed BAM file. In addition, it can also exclude known contaminants, like for example PhiX. Finally, if GFF3 annotation of the reference genome is available, features found by Antonie will be annotated.

Antonie performs similar functions as for example bowtie, except somewhat faster for small genomes, while also performing some of the analysis usually performed further downstream, for example by fastqc or gatk.

So, the input of Antonie is:

  • FASTQ or
  • FASTQ.gz
  • FASTA
  • GFF3

The output of Antonie is:

  • A JSON-compatible file with analysis, graphs, data, log, annotations
  • A pretty webpage displaying the JSON data (sample)
  • A sorted and indexed BAM file, mapping the reads to the reference genome

The analysis includes calls for:

  • SNPs ('undercovered regions')
  • Indels
  • Metagenomically variable loci

In addition, there are graphs of:

  • Distribution of reported Phred scores (global, per read position)
  • Distribution of actually measured Phred scores
  • Q-Q plot of empirical versus reported Phred scores
  • K-mer variability per read position
  • GC-content per read position
  • GC-content distribution of reads (versus genome wide)
  • Duplication count of reads

So as a formula:

FASTQ + FASTA + GFF3 -> JSON + BAM + BAM.BAI -> PRETTY HTML

First graphs

Second graphs

Third graphs

Fourth graphs

Fifth graphs

More details and code

For more details, please head to https://github.com/beaumontlab/antonie/blob/master/README.md