Notebook

PyRAD manuscript supplementary notebook¶

Scripts to fully reproduce all analyses¶

D.A.R. Eaton (Nov. 12, 2013)¶

This IPython notebook contains a combination of Python, bash and R scripts that describe the full analysis used in the
pyRAD manuscript. If you wish to execute the scripts yourself you can copy and paste commands from the input cells,
or you can download this notebook as a .pynb and re-execute all scripts in an IPython notebook of your own.

If you are unfamiliar with IPython notebooks (you are reading one now) you can read more about them here, however I'll
explain briefly: The default language in any coding cell is Python, however other languages can be executed within cells
as well. An easy way to do this is to designate the language at the beginning of the cell using the %% character. Cells
which begin with %%bash are executed like a shell script, and those with "%%R" execute R scripts, including the ability
to embed high quality graphics in the output below cells, which is demonstrated near the end of this notebook. Below I use
the ls -l unix command to check what is in my current working directory.

If you plan to execute scripts in this notebook you will need the following Python modules:

Numpy
Scipy
Egglib

In addition you will need PyRAD v.1.64 and Stacks 1.1 or higher (the MYSQL installation is not required). You will also need
two directories containing the necessary scripts and input files, which are located in a zipped directory in the supplemental
materials. Unzip this archive and move the contents into your current working directory.

You will need to change the location of pyRAD and Stacks in all of the following scripts to reflect its location on your machine.

In [1]:

!ls -l

total 400
-rw-rw-r-- 1 deren deren  43695 Dec  2 18:31 PyRAD_STACKS.ipynb
drwxrwxr-x 2 deren deren   4096 Dec  2 18:21 REALDATA
drwxrwxr-x 2 deren deren   4096 Dec  2 18:22 SIMbig
-rw-rw-rw- 1 deren deren   7687 Nov 20 14:14 simradsMANU.py
drwxrwxr-x 2 deren deren   4096 Dec  2 18:23 SIMsmall

At this point you see there is a .pynb file, which is this Ipython notebook, and also three directories and a python script for
simulating RADseq data under a coalescent model. This will also create a barcodes output files for each data set. The SIMsmall
directory contains the params input files for three PyRAD analyses.

In [2]:

!ls -l SIMsmall/

total 24
-rw-rw-r-- 1 deren deren 4369 Dec  2 18:23 params_high.txt
-rw-rw-r-- 1 deren deren 4370 Dec  2 18:23 params_low.txt
-rw-rw-r-- 1 deren deren 4365 Dec  2 18:24 params_no.txt

Simulating RAD sequences¶

The script simradsMANU.py (included with supplemental materials) uses the egglib module to simulate sequences under a coalescent
model and then trims them to emulate RADseq-like data. To begin, I simulate three different data sets which I use to compare the
performance of PyRAD and Stacks on data containing indels.

The first argument to the script is the rate of indel deletions, and the second is whether mutations should arise in the restriction
recognition site (locus dropout), which I do not use here (set to 0). The third argument is the number of loci to simulate and the
fourth is the number of individuals to sample per species. The final argument is the name of the output file. Data are simulated on
a species tree as described in the manuscript text. All other parameters used in the simulation are kept constant, because I am
primarily interested in examining the effect of indel variation.

In [3]:

%%bash
## you must have the egglib Python module installed to do this step. 
python simradsMANU.py 0.00 0 10000 1 SIMsmall/simRADs_no.fastq
python simradsMANU.py 0.02 0 10000 1 SIMsmall/simRADs_low.fastq
python simradsMANU.py 0.05 0 10000 1 SIMsmall/simRADs_high.fastq

	simulating 10000 RAD loci across 12 tip taxa with 1 samples per taxon
	indels arise at frequency of 0.000000
	mutations in restriction site = False
	theta=4Nu= 0.0014
	creating new barcode map
. . . . . . . . . .
	simulating 10000 RAD loci across 12 tip taxa with 1 samples per taxon
	indels arise at frequency of 0.020000
	mutations in restriction site = False
	theta=4Nu= 0.0014
	creating new barcode map
. . . . . . . . . .
	simulating 10000 RAD loci across 12 tip taxa with 1 samples per taxon
	indels arise at frequency of 0.050000
	mutations in restriction site = False
	theta=4Nu= 0.0014
	creating new barcode map
. . . . . . . . . .

The raw simulated data are in fastQ format as shown below. With uniformly high quality scores.

In [4]:

!head -n12 SIMsmall/simRADs_no.fastq

@lane1_fakedata_R1_0 1:N:0:
ATGGATTGCAGTGATATATTCCCATTTCTGGTTCTGGAGATACCCTCGTCCGGGAATCTAACTATCTGTGTCTTTACGGCTGAGGTTGTCCCATTGTATTTCCAGTCTTAG
+
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
@lane1_fakedata_R1_1 1:N:0:
ATGGATTGCAGTGATATATTCCCATTTCTGGTTCTGGAGATACCCTCGTCCGGGAATCTAACTATCTGTGTCTTTACGGCTGAGGTTGTCCCATTGTATTTCCAGTCTTAG
+
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
@lane1_fakedata_R1_2 1:N:0:
ATGGATTGCAGTGATATATTCCCATTTCTGGTTCTGGAGATACCCTCGTCCGGGAATCTAACTATCTGTGTCTTTACGGCTGAGGTTGTCCCATTGTATTTCCAGTCTTAG
+
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB

Small simulated data analysis¶

PyRAD analysis¶

Now I run the PyRAD analysis. To begin, I create a different working directory for each analysis, then execute PyRAD on the three
different params files which point to one of the three working directories, respectively. The following cells will each take some time
to run. The time command is used to measure how long. And each is run in a separate cell. The params files designate to use parallel
processing on 12 processors.

In [5]:

%%bash
## create working directories for each PyRAD analysis
mkdir SIMsmall/Py_no
mkdir SIMsmall/Py_low
mkdir SIMsmall/Py_high

No Indels¶

In [6]:

%time ! ~/pyrad_v.1.62/pyRAD -p SIMsmall/params_no.txt   2> SIMsmall/log.pyno

usearch_i86linux32 v6.0.307, 4.0Gb RAM (49.5Gb total), 24 cores
(C) Copyright 2010-12 Robert C. Edgar, all rights reserved.
http://drive5.com/usearch

Licensed to: daeaton@uchicago.edu

	ingroup ['1A0', '1B0', '1C0', '1D0', '2E0', '2F0', '2G0', '2H0', '3I0', '3J0', '3K0', '3L0']
	addon []
	exclude []
	............
	final stats written to:
	 /home/deren/Dropbox/PYRADMANU/SIMsmall/Py_no/stats/c85d6m2pN.stats
	output files written to:
	 /home/deren/Dropbox/PYRADMANU/SIMsmall/Py_no/outfiles/ directory

CPU times: user 1.92 s, sys: 0.34 s, total: 2.26 s
Wall time: 1128.01 s

Low Indels¶

In [7]:

%time ! ~/pyrad_v.1.62/pyRAD -p SIMsmall/params_low.txt  2> SIMsmall/log.pylow

usearch_i86linux32 v6.0.307, 4.0Gb RAM (49.5Gb total), 24 cores
(C) Copyright 2010-12 Robert C. Edgar, all rights reserved.
http://drive5.com/usearch

Licensed to: daeaton@uchicago.edu

	ingroup ['1A0', '1B0', '1C0', '1D0', '2E0', '2F0', '2G0', '2H0', '3I0', '3J0', '3K0', '3L0']
	addon []
	exclude []
	............
	final stats written to:
	 /home/deren/Dropbox/PYRADMANU/SIMsmall/Py_low/stats/c85d6m2pN.stats
	output files written to:
	 /home/deren/Dropbox/PYRADMANU/SIMsmall/Py_low/outfiles/ directory

CPU times: user 1.81 s, sys: 0.30 s, total: 2.10 s
Wall time: 1112.46 s

High Indels¶

In [8]:

%time ! ~/pyrad_v.1.62/pyRAD -p SIMsmall/params_high.txt 2> SIMsmall/log.pyhigh

usearch_i86linux32 v6.0.307, 4.0Gb RAM (49.5Gb total), 24 cores
(C) Copyright 2010-12 Robert C. Edgar, all rights reserved.
http://drive5.com/usearch

Licensed to: daeaton@uchicago.edu

	ingroup ['1A0', '1B0', '1C0', '1D0', '2E0', '2F0', '2G0', '2H0', '3I0', '3J0', '3K0', '3L0']
	addon []
	exclude []
	............
	final stats written to:
	 /home/deren/Dropbox/PYRADMANU/SIMsmall/Py_high/stats/c85d6m2pN.stats
	output files written to:
	 /home/deren/Dropbox/PYRADMANU/SIMsmall/Py_high/outfiles/ directory

CPU times: user 1.73 s, sys: 0.33 s, total: 2.06 s
Wall time: 1133.35 s

In [9]:

%%bash
## The following script removes unneeded files leftover from the analysis to save memory space.
rm -r SIMsmall/Py_no/fastq/   SIMsmall/Py_no/edits/ 
rm -r SIMsmall/Py_low/fastq/  SIMsmall/Py_low/edits/  
rm -r SIMsmall/Py_high/fastq/ SIMsmall/Py_high/edits/

Stacks Analysis¶

Now I run the Stacks analysis. To begin, I create three output directories, one for each analysis.

In [10]:

%%bash
## create working directories for each stacks analysis
mkdir SIMsmall/stackf_no
mkdir SIMsmall/stackf_low
mkdir SIMsmall/stackf_high

I then use process_tags to sort reads by their barcodes and then the denovo_map.pl pipeline to run the three Stacks analyses.
Parameter setting used are analagous (as close as possible) to those used in the PyRAD analysis. Because the barcodes were randomly
generated from the simulation and they must be entered by hand in the stacks script I use the commands below to generate a Stacks input
script as a .sh file. I show the file and then execute it.

In [11]:

%%bash
## create a barcode file for stacks analyses
cut -f 2 SIMsmall/simRADs_no.fastq.barcodes > SIMsmall/simRADs_no.stacks.barcodes
cut -f 2 SIMsmall/simRADs_low.fastq.barcodes > SIMsmall/simRADs_low.stacks.barcodes
cut -f 2 SIMsmall/simRADs_high.fastq.barcodes > SIMsmall/simRADs_high.stacks.barcodes

assemble denovo RADseq data sets using Stacks for each data set.
mindepth = m = 2
clust.within = M = 13 differences
secondary clust = N = M+2
across sample clust = n = 15
These settings are (close to being) analagous to the .85 clust similarity used in PyRAD given that read length are 100 bp.

No Indels¶

In [12]:

%%bash
process_radtags -f SIMsmall/simRADs_no.fastq -o SIMsmall/stackf_no/ \
                       -b SIMsmall/simRADs_no.stacks.barcodes -e pstI -E phred33

Using Phred+33 encoding for quality scores.
Found 1 input file(s).
Searching for single-end, inlined barcodes.
Loaded 12 barcodes (6bp).
Processing file 1 of 1 [simRADs_no.fastq]
  2600000 total reads; -200000 ambiguous barcodes; -0 ambiguous RAD-Tags; +0 recovered; -0 low quality reads; 2400000 retained reads.
Closing files, flushing buffers...
Outputing details to log: 'SIMsmall/stackf_no/process_radtags.log'

2600000 total sequences;
  200000 ambiguous barcode drops;
  0 low quality read drops;
  0 ambiguous RAD-Tag drops;
2400000 retained reads.

The function below creates the shell script (".sh") to run a stacks analysis.

In [13]:

import glob   
def makestackslist(dset):
    infiles = ["-s "+ff+" " for ff in glob.glob("SIMsmall/stackf_"+dset+"/*.fq")]
    denovo = "denovo_map.pl -m 2 -M 13 -n 15 -T 12 -b 1 -B 1 -S -X "+\
             "'populations:-m 6' -D 'SIMsmall_no' -o SIMsmall/stackf_"+dset+" "+"".join(infiles)
    with open("SIMsmall/stacks_"+dset+".sh",'w') as outfile:
        outfile.write(denovo)
        

In [14]:

makestackslist('no')

For example, the stacks_no.sh script below will run stacks on the data set with no indels.

In [15]:

!cat SIMsmall/stacks_no.sh

denovo_map.pl -m 2 -M 13 -n 15 -T 12 -b 1 -B 1 -S -X 'populations:-m 6' -D 'SIMsmall_no' -o SIMsmall/stackf_no -s SIMsmall/stackf_no/sample_AAGAGG.fq -s SIMsmall/stackf_no/sample_GAGAGA.fq -s SIMsmall/stackf_no/sample_GGTTTA.fq -s SIMsmall/stackf_no/sample_ATAGTA.fq -s SIMsmall/stackf_no/sample_TAAATG.fq -s SIMsmall/stackf_no/sample_TGAAAT.fq -s SIMsmall/stackf_no/sample_GAGAAT.fq -s SIMsmall/stackf_no/sample_GAGTAT.fq -s SIMsmall/stackf_no/sample_GATAGA.fq -s SIMsmall/stackf_no/sample_ATGGAT.fq -s SIMsmall/stackf_no/sample_TGTAGA.fq -s SIMsmall/stackf_no/sample_ATGGTT.fq

In [16]:

%time ! sh SIMsmall/stacks_no.sh 2> SIMsmall/log.stackno

Identifying unique stacks; file   1 of  12 [sample_AAGAGG]
Identifying unique stacks; file   2 of  12 [sample_GAGAGA]
Identifying unique stacks; file   3 of  12 [sample_GGTTTA]
Identifying unique stacks; file   4 of  12 [sample_ATAGTA]
Identifying unique stacks; file   5 of  12 [sample_TAAATG]
Identifying unique stacks; file   6 of  12 [sample_TGAAAT]
Identifying unique stacks; file   7 of  12 [sample_GAGAAT]
Identifying unique stacks; file   8 of  12 [sample_GAGTAT]
Identifying unique stacks; file   9 of  12 [sample_GATAGA]
Identifying unique stacks; file  10 of  12 [sample_ATGGAT]
Identifying unique stacks; file  11 of  12 [sample_TGTAGA]
Identifying unique stacks; file  12 of  12 [sample_ATGGTT]
CPU times: user 7.00 s, sys: 0.99 s, total: 7.99 s
Wall time: 2490.63 s

Low Indels¶

In [17]:

%%bash
process_radtags -f SIMsmall/simRADs_low.fastq -o SIMsmall/stackf_low/ \
                       -b SIMsmall/simRADs_low.stacks.barcodes -e pstI -E phred33

Using Phred+33 encoding for quality scores.
Found 1 input file(s).
Searching for single-end, inlined barcodes.
Loaded 12 barcodes (6bp).
Processing file 1 of 1 [simRADs_low.fastq]
  2600000 total reads; -200000 ambiguous barcodes; -0 ambiguous RAD-Tags; +0 recovered; -0 low quality reads; 2400000 retained reads.
Closing files, flushing buffers...
Outputing details to log: 'SIMsmall/stackf_low/process_radtags.log'

2600000 total sequences;
  200000 ambiguous barcode drops;
  0 low quality read drops;
  0 ambiguous RAD-Tag drops;
2400000 retained reads.

In [18]:

makestackslist("low")

In [19]:

!cat SIMsmall/stacks_low.sh

denovo_map.pl -m 2 -M 13 -n 15 -T 12 -b 1 -B 1 -S -X 'populations:-m 6' -D 'SIMsmall_no' -o SIMsmall/stackf_low -s SIMsmall/stackf_low/sample_ATGTGG.fq -s SIMsmall/stackf_low/sample_AAAAAG.fq -s SIMsmall/stackf_low/sample_GTTGGA.fq -s SIMsmall/stackf_low/sample_GTATGT.fq -s SIMsmall/stackf_low/sample_GGTAGG.fq -s SIMsmall/stackf_low/sample_TAGTAA.fq -s SIMsmall/stackf_low/sample_GGTAGT.fq -s SIMsmall/stackf_low/sample_GGTATG.fq -s SIMsmall/stackf_low/sample_GGGTTG.fq -s SIMsmall/stackf_low/sample_GGGGAG.fq -s SIMsmall/stackf_low/sample_AAGATT.fq -s SIMsmall/stackf_low/sample_TTTGTG.fq

In [20]:

%time ! sh SIMsmall/stacks_low.sh 2> SIMsmall/log.stacklow

Identifying unique stacks; file   1 of  12 [sample_ATGTGG]
Identifying unique stacks; file   2 of  12 [sample_AAAAAG]
Identifying unique stacks; file   3 of  12 [sample_GTTGGA]
Identifying unique stacks; file   4 of  12 [sample_GTATGT]
Identifying unique stacks; file   5 of  12 [sample_GGTAGG]
Identifying unique stacks; file   6 of  12 [sample_TAGTAA]
Identifying unique stacks; file   7 of  12 [sample_GGTAGT]
Identifying unique stacks; file   8 of  12 [sample_GGTATG]
Identifying unique stacks; file   9 of  12 [sample_GGGTTG]
Identifying unique stacks; file  10 of  12 [sample_GGGGAG]
Identifying unique stacks; file  11 of  12 [sample_AAGATT]
Identifying unique stacks; file  12 of  12 [sample_TTTGTG]
CPU times: user 7.28 s, sys: 1.13 s, total: 8.42 s
Wall time: 2485.87 s

High Indels¶

In [21]:

%%bash
process_radtags -f SIMsmall/simRADs_high.fastq -o SIMsmall/stackf_high/ \
                       -b SIMsmall/simRADs_high.stacks.barcodes -e pstI -E phred33

Using Phred+33 encoding for quality scores.
Found 1 input file(s).
Searching for single-end, inlined barcodes.
Loaded 12 barcodes (6bp).
Processing file 1 of 1 [simRADs_high.fastq]
  2600000 total reads; -200000 ambiguous barcodes; -0 ambiguous RAD-Tags; +0 recovered; -0 low quality reads; 2400000 retained reads.
Closing files, flushing buffers...
Outputing details to log: 'SIMsmall/stackf_high/process_radtags.log'

2600000 total sequences;
  200000 ambiguous barcode drops;
  0 low quality read drops;
  0 ambiguous RAD-Tag drops;
2400000 retained reads.

In [22]:

makestackslist("high")

In [23]:

!cat SIMsmall/stacks_high.sh

denovo_map.pl -m 2 -M 13 -n 15 -T 12 -b 1 -B 1 -S -X 'populations:-m 6' -D 'SIMsmall_no' -o SIMsmall/stackf_high -s SIMsmall/stackf_high/sample_TAGATT.fq -s SIMsmall/stackf_high/sample_GGATAT.fq -s SIMsmall/stackf_high/sample_AGAATG.fq -s SIMsmall/stackf_high/sample_TTAGGG.fq -s SIMsmall/stackf_high/sample_ATATGT.fq -s SIMsmall/stackf_high/sample_TTAAAG.fq -s SIMsmall/stackf_high/sample_TTGAAG.fq -s SIMsmall/stackf_high/sample_AGATAT.fq -s SIMsmall/stackf_high/sample_TTAAAA.fq -s SIMsmall/stackf_high/sample_GAAATG.fq -s SIMsmall/stackf_high/sample_ATAGGA.fq -s SIMsmall/stackf_high/sample_GGAAAT.fq

In [24]:

%time ! sh SIMsmall/stacks_high.sh 2> SIMsmall/log.stackhigh

Identifying unique stacks; file   1 of  12 [sample_TAGATT]
Identifying unique stacks; file   2 of  12 [sample_GGATAT]
Identifying unique stacks; file   3 of  12 [sample_AGAATG]
Identifying unique stacks; file   4 of  12 [sample_TTAGGG]
Identifying unique stacks; file   5 of  12 [sample_ATATGT]
Identifying unique stacks; file   6 of  12 [sample_TTAAAG]
Identifying unique stacks; file   7 of  12 [sample_TTGAAG]
Identifying unique stacks; file   8 of  12 [sample_AGATAT]
Identifying unique stacks; file   9 of  12 [sample_TTAAAA]
Identifying unique stacks; file  10 of  12 [sample_GAAATG]
Identifying unique stacks; file  11 of  12 [sample_ATAGGA]
Identifying unique stacks; file  12 of  12 [sample_GGAAAT]
CPU times: user 7.86 s, sys: 1.15 s, total: 9.00 s
Wall time: 2777.69 s

Simulation results¶

I create a list measuring taxon coverage for each Stacks analysis, and similarly for each PyRAD analysis below. This measures the
total number of loci that have data for at least N taxa for each value of N.

In [25]:

## Stacks results ##
lines = open("SIMsmall/stackf_high/batch_1.haplotypes.tsv").readlines()
cnts = [int(field.strip().split("\t")[1]) for field in lines[1:]]
shigh = [cnts.count(i) for i in range(1,13)]
SHIGH = [int(sum(shigh)-sum(shigh[:i-1])) for i in range(1,13)]

lines = open("SIMsmall/stackf_low/batch_1.haplotypes.tsv").readlines()
cnts = [int(field.strip().split("\t")[1]) for field in lines[1:]]
slow = [cnts.count(i) for i in range(1,13)]
SLOW = [int(sum(slow)-sum(slow[:i-1])) for i in range(1,13)]

lines = open("SIMsmall/stackf_no/batch_1.haplotypes.tsv").readlines()
cnts = [int(field.strip().split("\t")[1]) for field in lines[1:]]
sno = [cnts.count(i) for i in range(1,13)]
SNO = [int(sum(sno)-sum(sno[:i-1])) for i in range(1,13)]

In [26]:

## PyRAD results ##
infile = open("SIMsmall/Py_no/stats/c85d6m2pN.stats").readlines()
pno = [int(infile[j].strip().split("\t")[1]) for j in range(26,38)]
PNO = [sum(pno)-sum(pno[:i-1]) for i in range(1,13)]

infile = open("SIMsmall/Py_low/stats/c85d6m2pN.stats").readlines()
plow = [int(infile[j].strip().split("\t")[1]) for j in range(26,38)]
PLOW = [sum(plow)-sum(plow[:i-1]) for i in range(1,13)]

infile = open("SIMsmall/Py_high/stats/c85d6m2pN.stats").readlines()
phigh = [int(infile[j].strip().split("\t")[1]) for j in range(26,38)]
PHIGH = [sum(phigh)-sum(phigh[:i-1]) for i in range(1,13)]

In [27]:

## convert PyRAD lists to R objects ##
%load_ext rmagic 
%Rpush PNO PLOW PHIGH SNO SLOW SHIGH

In [28]:

%%R
RES = cbind(PNO,PLOW,PHIGH,SNO,SLOW,SHIGH)
print(RES)

        PNO PLOW PHIGH   SNO  SLOW SHIGH
 [1,] 10000 9997 10049 10000 12085 16041
 [2,] 10000 9997 10033 10000 10918 12683
 [3,] 10000 9997 10021 10000 10707 11986
 [4,] 10000 9996 10010 10000 10466 11100
 [5,] 10000 9995  9988 10000 10062 10162
 [6,] 10000 9995  9982 10000  9991  9929
 [7,] 10000 9993  9974 10000  9966  9781
 [8,] 10000 9993  9968 10000  9882  9481
 [9,] 10000 9992  9945 10000  9523  8651
[10,] 10000 9991  9934 10000  9301  8000
[11,] 10000 9991  9923 10000  9110  7432
[12,] 10000 9991  9911  9996  8301  5755

Plotting the results¶

In [29]:

%%R
makeplot <- function() {
    par(mfcol=c(1,3),
    cex.axis=1.1,
    cex.main=1.1,
    cex.lab=1.1,
    mar=c(4,4,3,0), 
    oma=c(1,2,0,2))

    plot(RES[,1], ylim=c(0,18000),xlim=c(1,12),
        xlab="Minimum taxon coverage",
        ylab="Number of loci recovered",
        main="No indels")

    points(RES[,1], pch=20, col='black', cex=2.5)
    lines(RES[,1], lwd=3)
    points(RES[,4], pch=20, col='darkgrey', cex=1.35)
    lines(RES[,4], lwd=2.5, col='darkgrey')

    plot(RES[,2], ylim=c(0,18000),xlim=c(1,12),
        xlab="Minimum taxon coverage",
        ylab="",#loci with at least N taxa",
        main="Low indels")

    points(RES[,2], pch=20, col='black', cex=2.5)
    lines(RES[,2], lwd=3)
    points(RES[,5], pch=20, col='darkgrey', cex=1.35)
    lines(RES[,5], lwd=2.5, col='darkgrey')

    plot(RES[,3], ylim=c(0,18000),xlim=c(1,12),
        xlab="Minimum taxon coverage",
        ylab="",#loci with at least N taxa",
        main="High indels")

    points(RES[,3], pch=20, col='black', cex=2.5)
    lines(RES[,3], lwd=3)
    points(RES[,6], pch=20, col='darkgrey', cex=1.35)
    lines(RES[,6], lwd=2.5, col='darkgrey')
}

In [30]:

%%R
pdf("SIMsmall/FIG_1.pdf",
    width=6.5, height=2.8,
    family="Times")
makeplot()
dev.off()

In [31]:

%R makeplot()

Statistics¶

In [32]:

import glob
RES = []
for run in ["no", "low", "high"]:
    handle = "SIMsmall/stackf_"+run+"/sample_*.tags.tsv"
    c = [open(ff,'r').read().count("consensus") for ff in glob.glob(handle)]
    RES.append(['stacks',run,float(sum(c))/len(c)])
    
    handle = "SIMsmall/Py_"+run+"/stats/s5.consens.txt"
    c = map(float,[line.split("\t")[3] for line in open(handle,'r').readlines()[1:-7]]) 
    RES.append(['pyrad',run,float(sum(c))/len(c)])

for res in RES:
    print 'mean N loci per sample from %s indel %s run  \t%f' % (res[1],res[0],res[2])

mean N loci per sample from no indel stacks run  	10000.000000
mean N loci per sample from no indel pyrad run  	10000.000000
mean N loci per sample from low indel stacks run  	10027.083333
mean N loci per sample from low indel pyrad run  	10000.500000
mean N loci per sample from high indel stacks run  	10087.083333
mean N loci per sample from high indel pyrad run  	10002.000000

In [33]:

infile = open("SIMsmall/stackf_high/batch_1.haplotypes.tsv").readlines()
tcov = [line.split("\t")[1] for line in infile[1:]]
print mean(map(int,tcov)), 'mean coverage stacks highindel'

infile = open("SIMsmall/stackf_low/batch_1.haplotypes.tsv").readlines()
tcov = [line.split("\t")[1] for line in infile[1:]]
print mean(map(int,tcov)), 'mean coverage stacks lowindel'

infile = open("SIMsmall/stackf_no/batch_1.haplotypes.tsv").readlines()
tcov = [line.split("\t")[1] for line in infile[1:]]
print mean(map(int,tcov)), 'mean coverage stacks noindel'

infile = open("SIMsmall/Py_high/stats/c85d6m2pN.stats").readlines()[26:38]
tcov = []
for line in infile:
    a = line.split("\t")
    tcov += [int(a[0])]*int(a[1]) 
print mean(tcov), 'mean coverage pyrad highindel'

infile = open("SIMsmall/Py_low/stats/c85d6m2pN.stats").readlines()[26:38]
tcov = []
for line in infile:
    a = line.split("\t")
    tcov += [int(a[0])]*int(a[1]) 
print mean(tcov), 'mean coverage pyrad lowhindel'

infile = open("SIMsmall/Py_no/stats/c85d6m2pN.stats").readlines()[26:38]
tcov = []
for line in infile:
    a = line.split("\t")
    tcov += [int(a[0])]*int(a[1]) 
print mean(tcov), 'mean coverage pyrad noindel'

7.54323296553 mean coverage stacks highindel
9.95548200248 mean coverage stacks lowindel
11.9996 mean coverage stacks noindel
11.9154144691 mean coverage pyrad highindel
11.9963989197 mean coverage pyrad lowhindel
12.0 mean coverage pyrad noindel

Empirical Analysis¶

An empirical analysis was conducted on the published data set from Eaton & Ree (2013), available de-multiplexed in fastQ format at the
ncbi sequence read archive (SRA072507). This step takes many hours using 12 processors, as described in the paper. The data are already
sorted, so PyRAD is started from step 2. After trimming the barcode and adapter read lengths are 69bp. At .85 similarity this is equivalent
to 10 base differences. The params.txt file is available in the supplementary materials. Because Stacks and PyRAD use different quality
filtering methods I employed filtering by PyRAD only, and the data are converted back to fastQ format to be read into Stacks. This way
quality filtering does not affect the results of comparison between the two programs.

Initial filtering¶

In [34]:

%time ! ~/pyrad_v.1.62/pyRAD -p REALDATA/paramsPAR.txt -s 2


      ------------------------------------------------------------------
          pyRAD : RAD/GBS for phylogenetics & introgression analyses
      ------------------------------------------------------------------

	sorted .fastq from /home/deren/Dropbox/raw_cyatho/*.fq being used
	step 2: editing raw reads 
	.............CPU times: user 0.37 s, sys: 0.04 s, total: 0.40 s
Wall time: 157.25 s

To remove the effect of different error filtering of the two programs the filtered data from PyRAD were used for both analyses. Because
PyRAD converts the data from fastq to fasta at this step, discarding read quality scores, the data were converted back to fastq with
arbitrarily high quality scores (b) added to for each base for use in Stacks.

In [35]:

! mkdir REALDATA/edits4stacks

In [36]:

for ff in glob.glob("REALDATA/edits/*.edit"):
    dat = open(ff).readlines()
    outfile = open("REALDATA/edits4stacks/"+ff.split("/")[-1].replace(".edit",".fq"),'w')
    writing = []
    for line in dat:
        if ">" in line:
            x,y = line.split("_")[2:4]
            nn = "@GRC13_0027_FC:4:1:"+x+":"+y+"#0/1"
            writing.append(nn.strip())
        else:
            seq = "TGCAG"+line.strip()
            writing.append(seq)
            writing.append("+")
            writing.append("b"*len(seq)+"\n")
            outfile.write("\n".join(writing))
            writing = []
    outfile.close()
    

PyRAD analysis¶

PyRAD analysis is done with and without paralog filtering

In [37]:

%time ! ~/pyrad_v.1.62/pyRAD -p REALDATA/paramsNOPAR.txt -s 34567 2> REALDATA/log.py.txt

usearch_i86linux32 v6.0.307, 4.0Gb RAM (49.5Gb total), 24 cores
(C) Copyright 2010-12 Robert C. Edgar, all rights reserved.
http://drive5.com/usearch

Licensed to: daeaton@uchicago.edu

	ingroup ['FieldMuseum_Ree_29154', 'FieldMuseum_Ree_30556', 'FieldMuseum_Ree_30686', 'FieldMuseum_Ree_32082', 'FieldMuseum_Ree_33405', 'FieldMuseum_Ree_33413', 'FieldMuseum_Ree_33588', 'FieldMuseum_Ree_35855', 'FieldMuseum_Ree_38362', 'FieldMuseum_Ree_39618', 'FieldMuseum_Ree_40578', 'FieldMuseum_Ree_41478', 'FieldMuseum_Ree_41954']
	addon []
	exclude []
	............
	final stats written to:
	 /home/deren/Dropbox/PYRADMANU/REALDATA/stats/c85d6m2pN.stats
	output files written to:
	 /home/deren/Dropbox/PYRADMANU/REALDATA/outfiles/ directory

CPU times: user 208.31 s, sys: 32.33 s, total: 240.64 s
Wall time: 127436.99 s

In [38]:

%%bash
mv REALDATA/stats REALDATA/stats.nopar
mkdir REALDATA/stats
rm REALDATA/clust.85/*.consens

~/pyrad_v.1.62/pyRAD -p REALDATA/paramsPAR.txt -s 4567 2> REALDATA/log.py2.txt

usearch_i86linux32 v6.0.307, 4.0Gb RAM (49.5Gb total), 24 cores
(C) Copyright 2010-12 Robert C. Edgar, all rights reserved.
http://drive5.com/usearch

Licensed to: daeaton@uchicago.edu

	ingroup ['FieldMuseum_Ree_29154', 'FieldMuseum_Ree_30556', 'FieldMuseum_Ree_30686', 'FieldMuseum_Ree_32082', 'FieldMuseum_Ree_33405', 'FieldMuseum_Ree_33413', 'FieldMuseum_Ree_33588', 'FieldMuseum_Ree_35855', 'FieldMuseum_Ree_38362', 'FieldMuseum_Ree_39618', 'FieldMuseum_Ree_40578', 'FieldMuseum_Ree_41478', 'FieldMuseum_Ree_41954']
	addon []
	exclude []
	............
	final stats written to:
	 /home/deren/Dropbox/PYRADMANU/REALDATA/stats/c85d6m2p3H3N3.stats
	output files written to:
	 /home/deren/Dropbox/PYRADMANU/REALDATA/outfiles/ directory

Stacks analysis¶

In [39]:

!mkdir REALDATA/stacks85

In [40]:

import glob
infiles = ["-s "+ff+" " for ff in glob.glob("REALDATA/edits4stacks/*")]
denovo = "denovo_map.pl -m 2 -M 10 -N 10 -n 10 -T 12 -b 1 -B 1 -S -X "+\
         "'populations:-m 6' -X 'ustacks:--max_locus_stacks 2' -D 'stacks85' "+\
         "-o REALDATA/stacks85 "+"".join(infiles)
with open("REALDATA/stacks85/stacks.sh",'w') as outfile:
    outfile.write(denovo)

In [41]:

!cat REALDATA/stacks85/stacks.sh

denovo_map.pl -m 2 -M 10 -N 10 -n 10 -T 12 -b 1 -B 1 -S -X 'populations:-m 6' -X 'ustacks:--max_locus_stacks 2' -D 'stacks85' -o REALDATA/stacks85 -s REALDATA/edits4stacks/FieldMuseum_Ree_41478.fq -s REALDATA/edits4stacks/FieldMuseum_Ree_35855.fq -s REALDATA/edits4stacks/FieldMuseum_Ree_30556.fq -s REALDATA/edits4stacks/FieldMuseum_Ree_33588.fq -s REALDATA/edits4stacks/FieldMuseum_Ree_39618.fq -s REALDATA/edits4stacks/FieldMuseum_Ree_38362.fq -s REALDATA/edits4stacks/FieldMuseum_Ree_40578.fq -s REALDATA/edits4stacks/FieldMuseum_Ree_29154.fq -s REALDATA/edits4stacks/FieldMuseum_Ree_33413.fq -s REALDATA/edits4stacks/FieldMuseum_Ree_33405.fq -s REALDATA/edits4stacks/FieldMuseum_Ree_41954.fq -s REALDATA/edits4stacks/FieldMuseum_Ree_32082.fq -s REALDATA/edits4stacks/FieldMuseum_Ree_30686.fq

In [42]:

%time !sh REALDATA/stacks85/stacks.sh 2> REALDATA/log.stacks.txt

Identifying unique stacks; file   1 of  13 [FieldMuseum_Ree_41478]
Identifying unique stacks; file   2 of  13 [FieldMuseum_Ree_35855]
Identifying unique stacks; file   3 of  13 [FieldMuseum_Ree_30556]
Identifying unique stacks; file   4 of  13 [FieldMuseum_Ree_33588]
Identifying unique stacks; file   5 of  13 [FieldMuseum_Ree_39618]
Identifying unique stacks; file   6 of  13 [FieldMuseum_Ree_38362]
Identifying unique stacks; file   7 of  13 [FieldMuseum_Ree_40578]
Identifying unique stacks; file   8 of  13 [FieldMuseum_Ree_29154]
Identifying unique stacks; file   9 of  13 [FieldMuseum_Ree_33413]
Identifying unique stacks; file  10 of  13 [FieldMuseum_Ree_33405]
Identifying unique stacks; file  11 of  13 [FieldMuseum_Ree_41954]
Identifying unique stacks; file  12 of  13 [FieldMuseum_Ree_32082]
Identifying unique stacks; file  13 of  13 [FieldMuseum_Ree_30686]
CPU times: user 800.48 s, sys: 143.01 s, total: 943.49 s
Wall time: 266088.77 s

Compiling results¶

In [43]:

infile = open("REALDATA/stats.nopar/c85d6m2pN.stats").readlines()
pno = [int(infile[j].strip().split("\t")[1]) for j in range(27,40)]
PEMPN = [sum(pno)-sum(pno[:i-1]) for i in range(1,14)]

infile = open("REALDATA/stats/c85d6m2p3H3N3.stats").readlines()
fno = [int(infile[j].strip().split("\t")[1]) for j in range(27,40)]
PEMPF = [sum(fno)-sum(fno[:i-1]) for i in range(1,14)]

Stacks does not have the same paralog filtering options that pyrad does and so I implement three additional filters on the data.
First, I remove loci that have more than three alleles in a locus, and second I remove loci with more than three heterozygous sites,
and lastly, I remove stacks from the final output that are heterozygous at the same site across more than four samples.

In [44]:

handle = "REALDATA/stacks85/batch_1.haplotypes.tsv"
lines = open(handle).readlines()
FILTER = []
NOFILTER = []
for line in lines[1:]:
    dat = line.strip().split("\t")[1:]
    cnt = int(dat[0])
    ocnt = int(dat[0])
    hets = []
    for loc in dat:
        " remove loci with more than 2 alleles"
        if loc.count("/") > 1:
            dat.remove(loc)
            cnt -= 1
        else:
            if "/" in loc:
                a,b = loc.split("/")  
                " remove loci with more than 4 hetero sites "
                if len(a) > 3:
                    dat.remove(loc)
                    cnt -= 1
                else:
                    " remove stack if a hetero site is shared across more than 4 samples"
                    h = [i for i in range(len(a)) if a[i]!=b[i]]
                    for i in h:
                        hets.append(i)
    if [hets.count(i) for i in set(hets)]:
        if max([hets.count(i) for i in set(hets)]) > 4:
            cnt = 0
    FILTER.append(cnt)
    NOFILTER.append(ocnt)

filt = [FILTER.count(i) for i in range(1,14)]
nofilt = [NOFILTER.count(i) for i in range(1,14)]
SEMPF = [int(sum(filt)-sum(filt[:i-1])) for i in range(1,14)]
SEMPN = [int(sum(nofilt)-sum(nofilt[:i-1])) for i in range(1,14)]

print "mean coverage stacks no filter", mean(NOFILTER)
print "%fullcov stacks no filter",float([NOFILTER.count(13)][0])/sum([FILTER.count(i) for i in range(1,14)])*100
print ""
print "mean coverage stacks paralog filtered", mean(FILTER)
print "%fullcov stacks filtered",float([FILTER.count(13)][0])/sum([FILTER.count(i) for i in range(1,14)])*100

mean coverage stacks no filter 2.65351217823
%fullcov stacks no filter 0.93914895677

mean coverage stacks paralog filtered 2.44040946562
%fullcov stacks filtered 0.543641380646

In [45]:

def meancovpyrad(lines):
    tcov = []
    for line in lines:
        a = line.split("\t")
        tcov += [int(a[0])]*int(a[1])
    full = 100*float(tcov.count(13))/len(tcov)
    return mean(tcov), full


infile = open("REALDATA/stats.nopar/c85d6m2pN.stats").readlines()[27:40]
nofiltpycov, fullnofiltpy = meancovpyrad(infile)
infile = open("REALDATA/stats/c85d6m2p3H3N3.stats").readlines()[27:40]
filtpycov, fullfiltpy = meancovpyrad(infile)
print "mean coverage pyrad no filter", nofiltpycov
print "%fullcov pyrad no filter", fullnofiltpy
print ""
print "mean coverage pyrad paralog filtered", filtpycov
print "%fullcov pyrad no filter", fullfiltpy

mean coverage pyrad no filter 2.99135473853
%fullcov pyrad no filter 1.35862030721

mean coverage pyrad paralog filtered 2.96574310049
%fullcov pyrad no filter 0.985513914028

In [46]:

%load_ext rmagic 
%Rpush SEMPF SEMPN PEMPF PEMPN

In [47]:

%%R 
DIFFN <- c(SEMPN-PEMPN)
DIFFF <- c(SEMPF-PEMPF)

In [48]:

%%R
REALRES = cbind(PEMPN,SEMPN,PEMPF,SEMPF,DIFFN,DIFFF)
print(REALRES)

       PEMPN  SEMPN  PEMPF  SEMPF DIFFN DIFFF
 [1,] 180330 213742 165193 206570 33412 41377
 [2,]  78326  86858  72317  82252  8532  9935
 [3,]  51231  52562  47542  48868  1331  1326
 [4,]  42819  42429  39505  39032  -390  -473
 [5,]  37216  35963  34146  32694 -1253 -1452
 [6,]  33022  31270  30151  28009 -1752 -2142
 [7,]  29293  27159  26532  23780 -2134 -2752
 [8,]  25804  23508  23162  20024 -2296 -3138
 [9,]  22171  19807  19470  16182 -2364 -3288
[10,]  17596  15489  15031  11873 -2107 -3158
[11,]  12330  10649  10078   7533 -1681 -2545
[12,]   6843   5791   5165   3678 -1052 -1487
[13,]   2450   1940   1628   1123  -510  -505

Plot of the difference in taxon coverage for the empirical data set¶

The difference in the number of loci recovered by the two programs is shown. Above the zero line (grey) Stacks recovers more loci and
below the line (black) pyRAD recovers more loci. PyRAD returns more loci at high coverage than does Stacks, which instead recovers many
more singleton and small coverage loci. Plotted on log scale. A better way to visualize the difference in performance between the two programs.

In [49]:

%%R
diffplot <- function() {
    par(cex.axis=1.1,
    cex.main=1.1,
    cex.lab=1.6,
    mar=c(4,4,3,0), 
    oma=c(1,2,0,2))
    D <- log(abs(DIFFN)) * c(-1,-1,-1,rep(1,10))
    E <- log(abs(DIFFF)) * c(-1,-1,-1,rep(1,10))
    names(D) <- seq(1,13)
    df.bar <- barplot(D, col=c(rep('grey',3),rep('black',10)),
              ylim=c(-12,10), space=0.5,
              ylab="Log difference in number of loci recovered",
              xlab="Minimum taxon coverage",axes=F)
    box()
    abline(h=0,lwd=2,lty=2)
    points(df.bar, E, bg=c(rep('grey',3),rep('black',10)),
           cex=2, pch=21)
    axis(2,c(-10,-5,0,5,10),c("10","5","0","5","10"))
}    
diffplot()

In [50]:

%%R
pdf("diffplot.pdf", width=7, height=6, family='Times')
diffplot()
dev.off()

Estimate proportion of indels in pyRAD assembled empirical data set¶

In [51]:

def counts(infile, mincov):
    lines = [i.split("\n") for i in infile]
    INDS = []
    for loc in lines[:-1]:
        seqs = []
        for line in loc:
            if ">" in line:
                seqs.append(line.split(" ")[-1])
        if len(seqs) >= mincov:
            N = numpy.array([list(i) for i in seqs])
            cols = iter(N.transpose())
            inds = 0
            col = cols.next()
            while 1:
                if "-" in col:
                    try: col = cols.next()
                    except StopIteration: break
                    if "-" not in col:
                        inds += 1
                else:
                    try: col = cols.next()
                    except StopIteration: break
            INDS.append(inds)
    i = [i for i in INDS if i!=0]
    return float(len(i))/len(INDS)

In [52]:

#infile1 = open("REALDATA/outfiles/c85d6m2pN.loci").read().split("//")
infile1 = open("REALDATA/outfiles/c85d6m2p3H3N3.loci").read().split("//")
infile2 = open("SIMsmall/Py_high/outfiles/c85d6m2pN.loci").read().split("//")
infile3 = open("SIMsmall/Py_low/outfiles/c85d6m2pN.loci").read().split("//")

In [53]:

for i in [13,12,10,8,6,4]:
    dat = counts(infile1,i)
    print 'cyatho',i,"%0.3f" % dat
indelesthigh = counts(infile2,12)
indelestlow = counts(infile3,12)
print 'simhigh', 12, "%0.3f"  %  indelesthigh
print 'simlow', 12, "%0.3f"  %  indelestlow

cyatho 13 0.141
cyatho 12 0.156
cyatho 10 0.214
cyatho 8 0.260
cyatho 6 0.287
cyatho 4 0.306
simhigh 12 0.467
simlow 12 0.195

Large Simulation Analysis¶

Simulate the large RADseq data set on the same species tree as before, with no indel variation, but allowing mutations to arise in the
restriction recognition site (25bp region in this case to simulate large dropout). I simulate 500K loci to create a data set that has
many fewer loci within any individual sample due to locus dropout (~150K) such that within-sample clustering is fast, but many loci to
cluster across samples (500K) such that across-sample clustering is slow. I compare run times to analyze these data in Stacks and PyRAD
including when implementing hierarchical clustering in PyRAD.

In [54]:

!python simradsMANU.py 0.0 50 500000 1 SIMbig/big12.fastq

	simulating 500000 RAD loci across 12 tip taxa with 1 samples per taxon
	indels arise at frequency of 0.000000
	mutations in restriction site at length (freq) = 50
	theta=4Nu= 0.0014
	creating new barcode map

In [64]:

%%bash
mkdir SIMbig/big0/
mkdir SIMbig/big4/

In [65]:

%time ! ~/pyrad_v.1.62/pyRAD -p SIMbig/paramsbig.txt -s 12345 2> SIMbig/log.big

CPU times: user 10.67 s, sys: 1.66 s, total: 12.33 s
Wall time: 7884.25 s

In [66]:

%time ! ~/pyrad_v.1.62/pyRAD -p SIMbig/paramsbig.txt -s67 2> SIMbig/log.big

usearch_i86linux32 v6.0.307, 4.0Gb RAM (49.5Gb total), 24 cores
(C) Copyright 2010-12 Robert C. Edgar, all rights reserved.
http://drive5.com/usearch

Licensed to: daeaton@uchicago.edu

	ingroup ['1A0', '1B0', '1C0', '1D0', '2E0', '2F0', '2G0', '2H0', '3I0', '3J0', '3K0', '3L0']
	addon []
	exclude []
	............
	final stats written to:
	 SIMbig/big0/stats/big.c85d6m2pN.stats
	output files written to:
	 SIMbig/big0/outfiles/ directory

CPU times: user 44.70 s, sys: 7.13 s, total: 51.83 s
Wall time: 33415.28 s

In [67]:

%%bash
mv SIMbig/big0/clust.85/ SIMbig/big4/
mkdir SIMbig/big4/stats

In [68]:

%time ! ~/pyrad_v.1.62/pyRAD -p SIMbig/paramsbigH4.txt  -s 67 2> SIMbig/log.bigH4

1 4
2 4
3 4
usearch_i86linux32 v6.0.307, 4.0Gb RAM (49.5Gb total), 24 cores
(C) Copyright 2010-12 Robert C. Edgar, all rights reserved.
http://drive5.com/usearch

Licensed to: daeaton@uchicago.edu

	ingroup ['1A0', '1B0', '1C0', '1D0', '2E0', '2F0', '2G0', '2H0', '3I0', '3J0', '3K0', '3L0']
	addon []
	exclude []
	. . . . . .
	final stats written to:
	 SIMbig/big4/stats/big.c85d6m2pN.stats
	output files written to:
	 SIMbig/big4/outfiles/ directory

CPU times: user 8.23 s, sys: 1.12 s, total: 9.36 s
Wall time: 6755.55 s

In [69]:

%%bash
mkdir SIMbig/stackf
cut -f 2 SIMbig/big12.fastq.barcodes > SIMbig/big12.stacks.barcodes

In [71]:

! process_radtags -f SIMbig/big12.fastq -o SIMbig/stackf -b SIMbig/big12.stacks.barcodes -e pstI -E phred33

Using Phred+33 encoding for quality scores.
Found 1 input file(s).
Searching for single-end, inlined barcodes.
Loaded 12 barcodes (6bp).
  22173620 total reads; -10000000 ambiguous barcodes; -0 ambiguous RAD-Tags; +0 recovered; -0 low quality reads; 12173620 retained reads.
Closing files, flushing buffers...
Outputing details to log: 'SIMbig/stackf/process_radtags.log'

22173620 total sequences;
  10000000 ambiguous barcode drops;
  0 low quality read drops;
  0 ambiguous RAD-Tag drops;
12173620 retained reads.

Because the barcodes were made randomly and every single of the 120 input files has to be written in by hand for the stacks command
denovo_map.pl I wrote a script here to find the sample names, as output by process_radtags, and create an input file for denovo_map.pl
written to a bash script called stacks.sh

In [73]:

makestackslist("nom")

In [74]:

infiles = ["-s "+ff+" " for ff in glob.glob("SIMbig/stackf/*.fq")]
denovo = "denovo_map.pl -m 2 -M 13 -n 15 -T 12 -b 1 -B 1 -S -X "+\
         "'populations:-m 6' -D 'SimRADsBIG' -o SIMbig/stackf "+"".join(infiles)
with open("SIMbig/stacks.sh",'w') as outfile:
    outfile.write(denovo)

In [75]:

%time ! sh SIMbig/stacks.sh

Found 12 sample file(s).
Identifying unique stacks; file   1 of  12 [sample_GTTAGG]
  /usr/local/bin//ustacks -t fastq -f SIMbig/stackf/sample_GTTAGG.fq -o SIMbig/stackf -i 1 -m 2 -M 13 -p 12 2>&1
Identifying unique stacks; file   2 of  12 [sample_ATGTGT]
  /usr/local/bin//ustacks -t fastq -f SIMbig/stackf/sample_ATGTGT.fq -o SIMbig/stackf -i 2 -m 2 -M 13 -p 12 2>&1
Identifying unique stacks; file   3 of  12 [sample_TGGTGG]
  /usr/local/bin//ustacks -t fastq -f SIMbig/stackf/sample_TGGTGG.fq -o SIMbig/stackf -i 3 -m 2 -M 13 -p 12 2>&1
Identifying unique stacks; file   4 of  12 [sample_TGATGA]
  /usr/local/bin//ustacks -t fastq -f SIMbig/stackf/sample_TGATGA.fq -o SIMbig/stackf -i 4 -m 2 -M 13 -p 12 2>&1
Identifying unique stacks; file   5 of  12 [sample_ATTTGG]
  /usr/local/bin//ustacks -t fastq -f SIMbig/stackf/sample_ATTTGG.fq -o SIMbig/stackf -i 5 -m 2 -M 13 -p 12 2>&1
Identifying unique stacks; file   6 of  12 [sample_TGGGTG]
  /usr/local/bin//ustacks -t fastq -f SIMbig/stackf/sample_TGGGTG.fq -o SIMbig/stackf -i 6 -m 2 -M 13 -p 12 2>&1
Identifying unique stacks; file   7 of  12 [sample_ATTGAG]
  /usr/local/bin//ustacks -t fastq -f SIMbig/stackf/sample_ATTGAG.fq -o SIMbig/stackf -i 7 -m 2 -M 13 -p 12 2>&1
Identifying unique stacks; file   8 of  12 [sample_GGAGAA]
  /usr/local/bin//ustacks -t fastq -f SIMbig/stackf/sample_GGAGAA.fq -o SIMbig/stackf -i 8 -m 2 -M 13 -p 12 2>&1
Identifying unique stacks; file   9 of  12 [sample_AGTTAT]
  /usr/local/bin//ustacks -t fastq -f SIMbig/stackf/sample_AGTTAT.fq -o SIMbig/stackf -i 9 -m 2 -M 13 -p 12 2>&1
Identifying unique stacks; file  10 of  12 [sample_TAGTTA]
  /usr/local/bin//ustacks -t fastq -f SIMbig/stackf/sample_TAGTTA.fq -o SIMbig/stackf -i 10 -m 2 -M 13 -p 12 2>&1
Identifying unique stacks; file  11 of  12 [sample_GAAATA]
  /usr/local/bin//ustacks -t fastq -f SIMbig/stackf/sample_GAAATA.fq -o SIMbig/stackf -i 11 -m 2 -M 13 -p 12 2>&1
Identifying unique stacks; file  12 of  12 [sample_AGTTAA]
  /usr/local/bin//ustacks -t fastq -f SIMbig/stackf/sample_AGTTAA.fq -o SIMbig/stackf -i 12 -m 2 -M 13 -p 12 2>&1
Generating catalog...
  /usr/local/bin//cstacks -b 1 -o SIMbig/stackf -s SIMbig/stackf/sample_GTTAGG -s SIMbig/stackf/sample_ATGTGT -s SIMbig/stackf/sample_TGGTGG -s SIMbig/stackf/sample_TGATGA -s SIMbig/stackf/sample_ATTTGG -s SIMbig/stackf/sample_TGGGTG -s SIMbig/stackf/sample_ATTGAG -s SIMbig/stackf/sample_GGAGAA -s SIMbig/stackf/sample_AGTTAT -s SIMbig/stackf/sample_TAGTTA -s SIMbig/stackf/sample_GAAATA -s SIMbig/stackf/sample_AGTTAA  -n 15 -p 12 2>&1
Matching RAD-Tags to catalog; file   1 of  12 [sample_GTTAGG]
  /usr/local/bin//sstacks -b 1 -c SIMbig/stackf/batch_1 -s SIMbig/stackf/sample_GTTAGG -o SIMbig/stackf -p 12 2>&1
Matching RAD-Tags to catalog; file   2 of  12 [sample_ATGTGT]
  /usr/local/bin//sstacks -b 1 -c SIMbig/stackf/batch_1 -s SIMbig/stackf/sample_ATGTGT -o SIMbig/stackf -p 12 2>&1
Matching RAD-Tags to catalog; file   3 of  12 [sample_TGGTGG]
  /usr/local/bin//sstacks -b 1 -c SIMbig/stackf/batch_1 -s SIMbig/stackf/sample_TGGTGG -o SIMbig/stackf -p 12 2>&1
Matching RAD-Tags to catalog; file   4 of  12 [sample_TGATGA]
  /usr/local/bin//sstacks -b 1 -c SIMbig/stackf/batch_1 -s SIMbig/stackf/sample_TGATGA -o SIMbig/stackf -p 12 2>&1
Matching RAD-Tags to catalog; file   5 of  12 [sample_ATTTGG]
  /usr/local/bin//sstacks -b 1 -c SIMbig/stackf/batch_1 -s SIMbig/stackf/sample_ATTTGG -o SIMbig/stackf -p 12 2>&1
Matching RAD-Tags to catalog; file   6 of  12 [sample_TGGGTG]
  /usr/local/bin//sstacks -b 1 -c SIMbig/stackf/batch_1 -s SIMbig/stackf/sample_TGGGTG -o SIMbig/stackf -p 12 2>&1
Matching RAD-Tags to catalog; file   7 of  12 [sample_ATTGAG]
  /usr/local/bin//sstacks -b 1 -c SIMbig/stackf/batch_1 -s SIMbig/stackf/sample_ATTGAG -o SIMbig/stackf -p 12 2>&1
Matching RAD-Tags to catalog; file   8 of  12 [sample_GGAGAA]
  /usr/local/bin//sstacks -b 1 -c SIMbig/stackf/batch_1 -s SIMbig/stackf/sample_GGAGAA -o SIMbig/stackf -p 12 2>&1
Matching RAD-Tags to catalog; file   9 of  12 [sample_AGTTAT]
  /usr/local/bin//sstacks -b 1 -c SIMbig/stackf/batch_1 -s SIMbig/stackf/sample_AGTTAT -o SIMbig/stackf -p 12 2>&1
Matching RAD-Tags to catalog; file  10 of  12 [sample_TAGTTA]
  /usr/local/bin//sstacks -b 1 -c SIMbig/stackf/batch_1 -s SIMbig/stackf/sample_TAGTTA -o SIMbig/stackf -p 12 2>&1
Matching RAD-Tags to catalog; file  11 of  12 [sample_GAAATA]
  /usr/local/bin//sstacks -b 1 -c SIMbig/stackf/batch_1 -s SIMbig/stackf/sample_GAAATA -o SIMbig/stackf -p 12 2>&1
Matching RAD-Tags to catalog; file  12 of  12 [sample_AGTTAA]
  /usr/local/bin//sstacks -b 1 -c SIMbig/stackf/batch_1 -s SIMbig/stackf/sample_AGTTAA -o SIMbig/stackf -p 12 2>&1
Calculating population-level summary statistics
/usr/local/bin//populations -b 1 -P SIMbig/stackf -s -t 12 -m 6 2>&1
CPU times: user 221.47 s, sys: 34.13 s, total: 255.60 s
Wall time: 73780.51 s

Measure stats for big simulation analysis

In [76]:

infile = open("SIMbig/stackf/batch_1.haplotypes.tsv").readlines()
tcov = [line.split("\t")[1] for line in infile[1:]]
print mean(map(int,tcov)), 'mean coverage stacks'

infile = open("SIMbig/big0/stats/big.c85d6m2pN.stats").readlines()[26:38]
tcov = []
for line in infile:
    a = line.split("\t")
    tcov += [int(a[0])]*int(a[1]) 
print mean(tcov), 'mean coverage pyrad'

infile = open("SIMbig/big4/stats/big.c85d6m2pN.stats").readlines()[26:38]
tcov = []
for line in infile:
    a = line.split("\t")
    tcov += [int(a[0])]*int(a[1]) 
print mean(tcov), 'mean coverage pyrad hierarchical'

4.4786988998 mean coverage stacks
4.47936475229 mean coverage pyrad
2.8736130253 mean coverage pyrad hierarchical

In [77]:

infile = open("SIMbig/big0/stats/big.c85d6m2pN.stats").readlines()
phigh = [int(infile[j].strip().split("\t")[1]) for j in range(26,38)]
P0 = [sum(phigh)-sum(phigh[:i-1]) for i in range(1,13)]

infile = open("SIMbig/big4/stats/big.c85d6m2pN.stats").readlines()
phigh = [int(infile[j].strip().split("\t")[1]) for j in range(26,38)]
P4 = [sum(phigh)-sum(phigh[:i-1]) for i in range(1,13)]

lines = open("SIMbig/stackf/batch_1.haplotypes.tsv").readlines()
cnts = [int(field.strip().split("\t")[1]) for field in lines[1:]]
sno = [cnts.count(i) for i in range(1,13)]
S0 = [int(sum(sno)-sum(sno[:i-1])) for i in range(1,13)]

In [78]:

print array([P0,P4,S0]).transpose()

[[ 135884.   85618.  135885.]
 [ 121263.   45613.  121228.]
 [ 106282.   45613.  106261.]
 [  82504.   45613.   82487.]
 [  59113.    5609.   59106.]
 [  43673.    5609.   43670.]
 [  29057.    5609.   29054.]
 [  16368.    5609.   16367.]
 [   8528.     285.    8528.]
 [   4206.     285.    4206.]
 [   1511.     285.    1511.]
 [    285.     285.     285.]]

In [79]:

100*(253./500000)

Out[79]:

0.050600000000000006

In [ ]: