Titus and all the labbies are awesome mentors, and I'm making some progress while learning how do deal with this brave new world.
One thing I'm doing is checking how good is a Gallus gallus Moleculo sequencing dataset we have, which is being used for the chicken genome sequence improvement project. The first question was: How many of Moleculo raw reads align to the reference genome, and how much is the coverage?
First, let's download the reference genome.
!wget -c ftp://hgdownload.cse.ucsc.edu/goldenPath/galGal4/bigZips/galGal4.fa.masked.gz
With the reference genome available, we need to prepare it for the BWA algorithms by constructing its FM-index. The command to do this is
!bwa index galGal4.fa.masked.gz
SAMtools also require preprocessing of the original FASTA file:
!samtools faidx galGal4.fa.masked.gz
I had 10 files with Moleculo reads, varying from 500 bp to 16 Kbp. In this example let's assume all reads are in the same file, reads.fastq, but in the original analysis I ran the next commands inside a Bash for-loop.
Let's align reference and reads using the BWA-MEM algorithm:
!bwa mem galGal4.fa.masked.gz reads.fastq > reads.fastq.sam
Next we are going to optimize the reads.fastq.sam file, transforming it into the BAM format (a binary version of SAM). We also sort based on leftmost coordinates and index the file for faster access:
!samtools import galGal4.fa.masked.gz.fai reads.fastq.sam reads.fastq.bam !samtools sort reads.fastq.bam reads.fastq.sorted !samtools index reads.fastq.sorted.bam
We can query our alignments using the view commands from samtools. How many reads didn't align with the reference?
!samtools view -c -f 4 reads.fastq.sorted.bam
In my case I got 7,985 reads that didn't align to the reference, from a total of 1,579,060 reads.
!samtools view -c reads.fastq.sorted.bam
There were 4,411,380 possible alignments.
But how good is the coverage of these alignments in the reference genome? To do this calculation I refactored calc-blast-cover, resulting in bam-coverage. The idea is to create arrays initialized to zero with the size of the sequence for each sequence in the reference. We go over the alignments and set the array position to 1 if there is an alignment matching this position. After doing this we can calculate the coverage by summing the array and dividing by the sequence size.
To make it easier to use I've started a new project (oh no! Another bioinformatics scripts collection!) with my modifications of the original script. This project is called bioinfo and it is available in PyPI. Basic dependencies are just docopt (which is awesome, BTW) and tqdm (same). Additional dependencies are needed based on which command you intend to run. For example, bam_coverage needs PySAM and screed. At first this seems counter-intuitive, because the user need to explicitly install new packages, but this way avoids another problem: installing all the packages in the world just to run a subset of the program. I intend to give an informative message when the user try to run a command and dependencies are missing.
If you've never used Python before, a good way to have a working environment is to use Anaconda.
Since bioinfo is available in PyPI, you can install it with
!pip install bioinfo
To see available commands and options in bioinfo you can run
Running the bam_coverage command over the alignment generated by BWA-MEM:
!bioinfo bam_coverage galGal4.fa.masked reads.fastq.sorted.bam 200 reads.fastq --mapq=30
The same command is available as a function in bioinfo and can be run inside a Python script:
from bioinfo import bam_coverage ## same call, using the module. bam_coverage("galGal4.fa.masked", "reads.fastq.sorted.bam", 200, "reads.fastq", 30)
If you don't want to install bioinfo (why not?!??!), you can just download bam-coverage:
!wget -c https://github.com/luizirber/bioinfo/blob/master/bioinfo/bam_coverage.py
And pass the options to the script:
!python bam_coverage.py galGal4.fa.masked reads.fastq.sorted.bam 200 reads.fastq 45
The result I got for my analysis was a 82.3% coverage of the reference genome by the alignments generated with BWA-MEM.
reading query 1579060 [elapsed: 01:16, 20721.82 iters/sec] creating empty lists 15932 [elapsed: 01:08, 232.67 iters/sec] building coverage 4411380 [elapsed: 34:36, 2123.96 iters/sec] Summing stats |##########| 15932/15932 100% [elapsed: 00:07 left: 00:00, 2008.90 iters/sec] total bases in reference: 1046932099 total ref bases covered : 861340070 fraction : 0.822727730693 reference : galGal4.fa.masked alignment file : ../moleculo/LR6000017-DNA_A01-LRAAA-AllReads.sorted.bam query sequences : ../moleculo/LR6000017-DNA_A01-LRAAA-AllReads.fastq