This is an IPython notebook. Most of the code is composed of bash scripts, indicated by %%bash at the top of the cell, otherwise it is IPython code. This notebook includes code to download, assemble and analyze a published RADseq data set.
### Notebook 9
### Data set 9 (Ohomopterus)
### Authors: Takahashi et al. 2014
### Data Location: DRP001067
Sequence data for this study are archived on the NCBI sequence read archive (SRA). Below I read in SraRunTable.txt for this project which contains all of the information we need to download the data.
%%bash
## make a new directory for this analysis
mkdir -p empirical_9/fastq/
import os
def wget_download_ddbj(SRR, outdir):
""" Python function to get sra data from ncbi and write to
outdir with a new name using bash call wget """
## create a call string
call = "wget -q -r -nH --cut-dirs=9 -P "+outdir+" "+\
"ftp://ftp.ddbj.nig.ac.jp/ddbj_database/dra/sra/ByExp/"+\
"sra/DRX/DRX011/DRX011{:03d}".format(SRR)
## run wget call
! $call
for ID in range(602,634):
wget_download_ddbj(ID, "empirical_9/fastq/")
Here we pass the SRR number and the sample name to the wget_download
function so that the files are saved with their sample names.
%%bash
## convert sra files to fastq using fastq-dump tool
## output as gzipped into the fastq directory
fastq-dump --gzip -O empirical_9/fastq/ empirical_9/fastq/*.sra
## remove .sra files
rm empirical_9/fastq/*.sra
Read 10248198 spots for empirical_9/fastq/DRR021959.sra Written 10248198 spots for empirical_9/fastq/DRR021959.sra Read 7839034 spots for empirical_9/fastq/DRR021960.sra Written 7839034 spots for empirical_9/fastq/DRR021960.sra Read 1132781 spots for empirical_9/fastq/DRR021961.sra Written 1132781 spots for empirical_9/fastq/DRR021961.sra Read 2609632 spots for empirical_9/fastq/DRR021962.sra Written 2609632 spots for empirical_9/fastq/DRR021962.sra Read 5077478 spots for empirical_9/fastq/DRR021963.sra Written 5077478 spots for empirical_9/fastq/DRR021963.sra Read 5173478 spots for empirical_9/fastq/DRR021964.sra Written 5173478 spots for empirical_9/fastq/DRR021964.sra Read 2077912 spots for empirical_9/fastq/DRR021965.sra Written 2077912 spots for empirical_9/fastq/DRR021965.sra Read 4930613 spots for empirical_9/fastq/DRR021966.sra Written 4930613 spots for empirical_9/fastq/DRR021966.sra Read 2494972 spots for empirical_9/fastq/DRR021967.sra Written 2494972 spots for empirical_9/fastq/DRR021967.sra Read 5276173 spots for empirical_9/fastq/DRR021968.sra Written 5276173 spots for empirical_9/fastq/DRR021968.sra Read 6316234 spots for empirical_9/fastq/DRR021969.sra Written 6316234 spots for empirical_9/fastq/DRR021969.sra Read 4871618 spots for empirical_9/fastq/DRR021970.sra Written 4871618 spots for empirical_9/fastq/DRR021970.sra Read 105160 spots for empirical_9/fastq/DRR021971.sra Written 105160 spots for empirical_9/fastq/DRR021971.sra Read 5521722 spots for empirical_9/fastq/DRR021972.sra Written 5521722 spots for empirical_9/fastq/DRR021972.sra Read 874466 spots for empirical_9/fastq/DRR021973.sra Written 874466 spots for empirical_9/fastq/DRR021973.sra Read 5376366 spots for empirical_9/fastq/DRR021974.sra Written 5376366 spots for empirical_9/fastq/DRR021974.sra Read 4715894 spots for empirical_9/fastq/DRR021975.sra Written 4715894 spots for empirical_9/fastq/DRR021975.sra Read 9091972 spots for empirical_9/fastq/DRR021976.sra Written 9091972 spots for empirical_9/fastq/DRR021976.sra Read 6609431 spots for empirical_9/fastq/DRR021977.sra Written 6609431 spots for empirical_9/fastq/DRR021977.sra Read 5827960 spots for empirical_9/fastq/DRR021978.sra Written 5827960 spots for empirical_9/fastq/DRR021978.sra Read 4742805 spots for empirical_9/fastq/DRR021979.sra Written 4742805 spots for empirical_9/fastq/DRR021979.sra Read 2231410 spots for empirical_9/fastq/DRR021980.sra Written 2231410 spots for empirical_9/fastq/DRR021980.sra Read 166474 spots for empirical_9/fastq/DRR021981.sra Written 166474 spots for empirical_9/fastq/DRR021981.sra Read 9263783 spots for empirical_9/fastq/DRR021982.sra Written 9263783 spots for empirical_9/fastq/DRR021982.sra Read 7912300 spots for empirical_9/fastq/DRR021983.sra Written 7912300 spots for empirical_9/fastq/DRR021983.sra Read 8731338 spots for empirical_9/fastq/DRR021984.sra Written 8731338 spots for empirical_9/fastq/DRR021984.sra Read 752632 spots for empirical_9/fastq/DRR021985.sra Written 752632 spots for empirical_9/fastq/DRR021985.sra Read 1482551 spots for empirical_9/fastq/DRR021986.sra Written 1482551 spots for empirical_9/fastq/DRR021986.sra Read 147164 spots for empirical_9/fastq/DRR021987.sra Written 147164 spots for empirical_9/fastq/DRR021987.sra Read 337046 spots for empirical_9/fastq/DRR021988.sra Written 337046 spots for empirical_9/fastq/DRR021988.sra Read 129310 spots for empirical_9/fastq/DRR021989.sra Written 129310 spots for empirical_9/fastq/DRR021989.sra Read 800590 spots for empirical_9/fastq/DRR021990.sra Written 800590 spots for empirical_9/fastq/DRR021990.sra Read 132868497 spots total Written 132868497 spots total
%%bash
ls -lh empirical_9/fastq/
total 11G -rw-rw-r-- 1 deren deren 801M Nov 28 18:32 DRR021959.fastq.gz -rw-rw-r-- 1 deren deren 618M Nov 28 18:32 DRR021960.fastq.gz -rw-rw-r-- 1 deren deren 91M Nov 28 18:32 DRR021961.fastq.gz -rw-rw-r-- 1 deren deren 206M Nov 28 18:32 DRR021962.fastq.gz -rw-rw-r-- 1 deren deren 399M Nov 28 18:32 DRR021963.fastq.gz -rw-rw-r-- 1 deren deren 408M Nov 28 18:32 DRR021964.fastq.gz -rw-rw-r-- 1 deren deren 167M Nov 28 18:32 DRR021965.fastq.gz -rw-rw-r-- 1 deren deren 387M Nov 28 18:32 DRR021966.fastq.gz -rw-rw-r-- 1 deren deren 197M Nov 28 18:32 DRR021967.fastq.gz -rw-rw-r-- 1 deren deren 418M Nov 28 18:32 DRR021968.fastq.gz -rw-rw-r-- 1 deren deren 499M Nov 28 18:32 DRR021969.fastq.gz -rw-rw-r-- 1 deren deren 386M Nov 28 18:32 DRR021970.fastq.gz -rw-rw-r-- 1 deren deren 8.4M Nov 28 18:32 DRR021971.fastq.gz -rw-rw-r-- 1 deren deren 439M Nov 28 18:32 DRR021972.fastq.gz -rw-rw-r-- 1 deren deren 70M Nov 28 18:32 DRR021973.fastq.gz -rw-rw-r-- 1 deren deren 434M Nov 28 18:32 DRR021974.fastq.gz -rw-rw-r-- 1 deren deren 375M Nov 28 18:32 DRR021975.fastq.gz -rw-rw-r-- 1 deren deren 713M Nov 28 18:32 DRR021976.fastq.gz -rw-rw-r-- 1 deren deren 520M Nov 28 18:32 DRR021977.fastq.gz -rw-rw-r-- 1 deren deren 473M Nov 28 18:32 DRR021978.fastq.gz -rw-rw-r-- 1 deren deren 373M Nov 28 18:32 DRR021979.fastq.gz -rw-rw-r-- 1 deren deren 179M Nov 28 18:32 DRR021980.fastq.gz -rw-rw-r-- 1 deren deren 14M Nov 28 18:32 DRR021981.fastq.gz -rw-rw-r-- 1 deren deren 727M Nov 28 18:32 DRR021982.fastq.gz -rw-rw-r-- 1 deren deren 625M Nov 28 18:32 DRR021983.fastq.gz -rw-rw-r-- 1 deren deren 697M Nov 28 18:32 DRR021984.fastq.gz -rw-rw-r-- 1 deren deren 60M Nov 28 18:32 DRR021985.fastq.gz -rw-rw-r-- 1 deren deren 118M Nov 28 18:32 DRR021986.fastq.gz -rw-rw-r-- 1 deren deren 13M Nov 28 18:32 DRR021987.fastq.gz -rw-rw-r-- 1 deren deren 28M Nov 28 18:32 DRR021988.fastq.gz -rw-rw-r-- 1 deren deren 11M Nov 28 18:32 DRR021989.fastq.gz -rw-rw-r-- 1 deren deren 63M Nov 28 18:32 DRR021990.fastq.gz
%%bash
pyrad --version
pyRAD 3.0.63
%%bash
## remove old params file if it exists
rm params.txt
## create a new default params file
pyrad -n
new params.txt file created
The data here are from Illumina Casava <1.8, so the phred scores are offset by 64 instead of 33, so we use that in the params file below.
%%bash
## substitute new parameters into file
sed -i '/## 1. /c\empirical_9/ ## 1. working directory ' params.txt
sed -i '/## 6. /c\TGCAG ## 6. cutters ' params.txt
sed -i '/## 7. /c\20 ## 7. N processors ' params.txt
sed -i '/## 9. /c\6 ## 9. NQual ' params.txt
sed -i '/## 10./c\.85 ## 10. clust threshold ' params.txt
sed -i '/## 12./c\4 ## 12. MinCov ' params.txt
sed -i '/## 13./c\10 ## 13. maxSH ' params.txt
sed -i '/## 14./c\empirical_9_m4 ## 14. output name ' params.txt
sed -i '/## 18./c\empirical_9/fastq/*.gz ## 18. data location ' params.txt
sed -i '/## 29./c\2,2 ## 29. trim overhang ' params.txt
sed -i '/## 30./c\p,n,s ## 30. output formats ' params.txt
cat params.txt
==** parameter inputs for pyRAD version 3.0.63 **======================== affected step == empirical_9/ ## 1. working directory ./*.fastq.gz ## 2. Loc. of non-demultiplexed files (if not line 18) (s1) ./*.barcodes ## 3. Loc. of barcode file (if not line 18) (s1) vsearch ## 4. command (or path) to call vsearch (or usearch) (s3,s6) muscle ## 5. command (or path) to call muscle (s3,s7) TGCAG ## 6. cutters 20 ## 7. N processors 6 ## 8. Mindepth: min coverage for a cluster (s4,s5) 6 ## 9. NQual .85 ## 10. clust threshold rad ## 11. Datatype: rad,gbs,pairgbs,pairddrad,(others:see docs)(all) 4 ## 12. MinCov 10 ## 13. maxSH empirical_9_m4 ## 14. output name ==== optional params below this line =================================== affected step == ## 15.opt.: select subset (prefix* only selector) (s2-s7) ## 16.opt.: add-on (outgroup) taxa (list or prefix*) (s6,s7) ## 17.opt.: exclude taxa (list or prefix*) (s7) empirical_9/fastq/*.gz ## 18. data location ## 19.opt.: maxM: N mismatches in barcodes (def= 1) (s1) ## 20.opt.: phred Qscore offset (def= 33) (s2) ## 21.opt.: filter: def=0=NQual 1=NQual+adapters. 2=strict (s2) ## 22.opt.: a priori E,H (def= 0.001,0.01, if not estimated) (s5) ## 23.opt.: maxN: max Ns in a cons seq (def=5) (s5) ## 24.opt.: maxH: max heterozyg. sites in cons seq (def=5) (s5) ## 25.opt.: ploidy: max alleles in cons seq (def=2;see docs) (s4,s5) ## 26.opt.: maxSNPs: (def=100). Paired (def=100,100) (s7) ## 27.opt.: maxIndels: within-clust,across-clust (def. 3,99) (s3,s7) ## 28.opt.: random number seed (def. 112233) (s3,s6,s7) 2,2 ## 29. trim overhang p,n,s ## 30. output formats ## 31.opt.: maj. base call at depth>x<mindepth (def.x=mindepth) (s5) ## 32.opt.: keep trimmed reads (def=0). Enter min length. (s2) ## 33.opt.: max stack size (int), def= max(500,mean+2*SD) (s3) ## 34.opt.: minDerep: exclude dereps with <= N copies, def=1 (s3) ## 35.opt.: use hierarchical clustering (def.=0, 1=yes) (s6) ## 36.opt.: repeat masking (def.=1='dust' method, 0=no) (s3,s6) ## 37.opt.: vsearch max threads per job (def.=6; see docs) (s3,s6) ==== optional: list group/clade assignments below this line (see docs) ==================
In this data set the data were uploaded separated by sample, but with the barcode still attached to the sequences. The python code below will remove the 5bp barcode from each sequence.
os.path.splitext(fil)
('DRR021982.fastq', '.gz')
import glob
import itertools
import gzip
import os
for infile in glob.glob("empirical_9/fastq/DRR*"):
iofile = gzip.open(infile, 'rb')
dire, fil = os.path.split(infile)
fastq = os.path.splitext(fil)[0]
outhandle = os.path.join(dire, "T_"+fastq)
outfile = open(outhandle, 'wb')
data = iter(iofile)
store = []
while 1:
try:
line = data.next()
except StopIteration:
break
if len(line) < 80:
store.append(line)
else:
store.append(line[5:])
if len(store) == 10000:
outfile.write("".join(store))
store = []
iofile.close()
outfile.close()
! gzip $outhandle
%%bash
pyrad -p params.txt -s 234567 >> log.txt 2>&1
%%bash
sed -i '/## 12./c\2 ## 12. MinCov ' params.txt
sed -i '/## 14./c\empirical_9_m2 ## 14. output name ' params.txt
%%bash
pyrad -p params.txt -s 7 >> log.txt 2>&1
We are interested in the relationship between the amount of input (raw) data between any two samples, the average coverage they recover when clustered together, and the phylogenetic distances separating samples.
The average number of raw reads per sample is 1.36M.
import pandas as pd
import numpy as np
## read in the data
s2dat = pd.read_table("empirical_9/stats/s2.rawedit.txt", header=0, nrows=32)
## print summary stats
print s2dat["passed.total"].describe()
## find which sample has the most raw data
maxraw = s2dat["passed.total"].max()
print "\nmost raw data in sample:"
print s2dat['sample '][s2dat['passed.total']==maxraw]
count 32.000000 mean 3888871.375000 std 2913183.441258 min 98285.000000 25% 1000709.500000 50% 4518550.000000 75% 5558420.000000 max 9568142.000000 Name: passed.total, dtype: float64 most raw data in sample: 0 T_DRR021959 Name: sample , dtype: object
pyrad v.3.0.63 outputs depth information for each sample which I read in here and plot. First let's ask which sample has the highest depth of coverage. The std of coverages is pretty low in this data set compared to several others.
## read in the s3 results
s9dat = pd.read_table("empirical_9/stats/s3.clusters.txt", header=0, nrows=33)
## print summary stats
print "summary of means\n=================="
print s9dat['dpt.me'].describe()
## print summary stats
print "\nsummary of std\n=================="
print s9dat['dpt.sd'].describe()
## print summary stats
print "\nsummary of proportion lowdepth\n=================="
print pd.Series(1-s9dat['d>5.tot']/s9dat["total"]).describe()
## find which sample has the greatest depth of retained loci
max_hiprop = (s9dat["d>5.tot"]/s9dat["total"]).max()
print "\nhighest coverage in sample:"
print s9dat['taxa'][s9dat['d>5.tot']/s9dat["total"]==max_hiprop]
summary of means ================== count 32.000000 mean 57.046781 std 37.276956 min 5.436000 25% 23.110250 50% 65.631000 75% 77.041000 max 135.646000 Name: dpt.me, dtype: float64 summary of std ================== count 32.000000 mean 272.580687 std 201.847071 min 16.346000 25% 94.865750 50% 258.018000 75% 399.662500 max 690.205000 Name: dpt.sd, dtype: float64 summary of proportion lowdepth ================== count 32.000000 mean 0.188730 std 0.161401 min 0.074257 25% 0.085030 50% 0.114366 75% 0.210204 max 0.655746 dtype: float64 highest coverage in sample: 7 T_DRR021966 Name: taxa, dtype: object
maxprop =(s9dat['d>5.tot']/s9dat['total']).max()
print "\nhighest prop coverage in sample:"
print s9dat['taxa'][s9dat['d>5.tot']/s9dat['total']==maxprop]
highest prop coverage in sample: 7 T_DRR021966 Name: taxa, dtype: object
## print mean and std of coverage for the highest coverage sample
with open("empirical_9/clust.85/T_DRR021966.depths", 'rb') as indat:
depths = np.array(indat.read().strip().split(","), dtype=int)
print "Means for sample T_DRR021966"
print depths.mean(), depths.std()
print depths[depths>5].mean(), depths[depths>5].std()
Means for sample T_DRR021966 72.1266330292 439.779736427 77.7021948176 456.619346077
Green shows the loci that were discarded and orange the loci that were retained. The majority of data were discarded for being too low of coverage.
import toyplot
import toyplot.svg
import numpy as np
## read in the depth information for this sample
with open("empirical_9/clust.85/T_DRR021966.depths", 'rb') as indat:
depths = np.array(indat.read().strip().split(","), dtype=int)
## make a barplot in Toyplot
canvas = toyplot.Canvas(width=350, height=300)
axes = canvas.axes(xlabel="Depth of coverage (N reads)",
ylabel="N loci",
label="dataset9/sample=T_DRR021966")
## select the loci with depth > 5 (kept)
keeps = depths[depths>5]
## plot kept and discarded loci
edat = np.histogram(depths, range(30)) # density=True)
kdat = np.histogram(keeps, range(30)) #, density=True)
axes.bars(edat)
axes.bars(kdat)
#toyplot.svg.render(canvas, "empirical_9_depthplot.svg")
<toyplot.mark.BarMagnitudes at 0x7f91f2120590>
cat empirical_9/stats/empirical_9_m4.stats
76963 ## loci with > minsp containing data 76778 ## loci with > minsp containing data & paralogs removed 76778 ## loci with > minsp containing data & paralogs removed & final filtering ## number of loci recovered in final data set for each taxon. taxon nloci T_DRR021959 40272 T_DRR021960 39500 T_DRR021961 29626 T_DRR021962 38304 T_DRR021963 40251 T_DRR021964 40440 T_DRR021965 37187 T_DRR021966 41059 T_DRR021967 38884 T_DRR021968 42626 T_DRR021969 43490 T_DRR021970 43303 T_DRR021971 5249 T_DRR021972 46589 T_DRR021973 29939 T_DRR021974 47628 T_DRR021975 46705 T_DRR021976 47963 T_DRR021977 47881 T_DRR021978 46731 T_DRR021979 46386 T_DRR021980 41322 T_DRR021981 8509 T_DRR021982 48193 T_DRR021983 47548 T_DRR021984 47715 T_DRR021985 27186 T_DRR021986 36910 T_DRR021987 6636 T_DRR021988 14945 T_DRR021989 6640 T_DRR021990 7699 ## nloci = number of loci with data for exactly ntaxa ## ntotal = number of loci for which at least ntaxa have data ntaxa nloci saved ntotal 1 - 2 - - 3 - - 4 8714 * 76778 5 5780 * 68064 6 4454 * 62284 7 3488 * 57830 8 3019 * 54342 9 2568 * 51323 10 2381 * 48755 11 2254 * 46374 12 2153 * 44120 13 2052 * 41967 14 1941 * 39915 15 1933 * 37974 16 1927 * 36041 17 1955 * 34114 18 2016 * 32159 19 2150 * 30143 20 2367 * 27993 21 2645 * 25626 22 3061 * 22981 23 3563 * 19920 24 3957 * 16357 25 4030 * 12400 26 3789 * 8370 27 2673 * 4581 28 1357 * 1908 29 450 * 551 30 91 * 101 31 9 * 10 32 1 * 1 ## nvar = number of loci containing n variable sites (pis+autapomorphies). ## sumvar = sum of variable sites (SNPs). ## pis = number of loci containing n parsimony informative sites. ## sumpis = sum of parsimony informative sites. nvar sumvar PIS sumPIS 0 967 0 8521 0 1 1685 1685 8479 8479 2 2050 5785 8315 25109 3 2519 13342 7439 47426 4 2935 25082 7082 75754 5 3112 40642 6307 107289 6 3429 61216 5747 141771 7 3555 86101 5016 176883 8 3723 115885 4372 211859 9 3846 150499 3823 246266 10 3664 187139 3017 276436 11 3684 227663 2471 303617 12 3663 271619 1940 326897 13 3536 317587 1360 344577 14 3428 365579 1016 358801 15 3317 415334 705 369376 16 3081 464630 466 376832 17 2976 515222 287 381711 18 2775 565172 175 384861 19 2600 614572 95 386666 20 2356 661692 62 387906 21 2165 707157 36 388662 22 1888 748693 21 389124 23 1679 787310 13 389423 24 1444 821966 7 389591 25 1289 854191 3 389666 26 1099 882765 0 389666 27 890 906795 1 389693 28 763 928159 1 389721 29 609 945820 1 389750 30 511 961150 0 389750 31 370 972620 0 389750 32 293 981996 0 389750 33 272 990972 0 389750 34 186 997296 0 389750 35 132 1001916 0 389750 36 98 1005444 0 389750 37 49 1007257 0 389750 38 55 1009347 0 389750 39 31 1010556 0 389750 40 19 1011316 0 389750 41 18 1012054 0 389750 42 7 1012348 0 389750 43 6 1012606 0 389750 44 1 1012650 0 389750 45 1 1012695 0 389750 46 1 1012741 0 389750 47 0 1012741 0 389750 48 0 1012741 0 389750 49 0 1012741 0 389750 50 0 1012741 0 389750 51 0 1012741 0 389750 52 1 1012793 0 389750 total var= 1012793 total pis= 389750 sampled unlinked SNPs= 75811 sampled unlinked bi-allelic SNPs= 73051
%%bash
head -n 20 empirical_9/stats/empirical_9_m2.stats
119626 ## loci with > minsp containing data 119441 ## loci with > minsp containing data & paralogs removed 119441 ## loci with > minsp containing data & paralogs removed & final filtering ## number of loci recovered in final data set for each taxon. taxon nloci T_DRR021959 45933 T_DRR021960 45128 T_DRR021961 33226 T_DRR021962 41997 T_DRR021963 45401 T_DRR021964 45882 T_DRR021965 40633 T_DRR021966 46239 T_DRR021967 43330 T_DRR021968 47995 T_DRR021969 47919 T_DRR021970 47490
%%bash
## raxml argumement w/ ...
raxmlHPC-PTHREADS-AVX -f a -m GTRGAMMA -N 100 -x 12345 -p 12345 -T 20 \
-w /home/deren/Documents/RADmissing/empirical_9/ \
-n empirical_9_m4 -s empirical_9/outfiles/empirical_9_m4.phy
%%bash
## raxml argumement w/ ...
raxmlHPC-PTHREADS-AVX -f a -m GTRGAMMA -N 100 -x 12345 -p 12345 -T 20 \
-w /home/deren/Documents/RADmissing/empirical_9/ \
-n empirical_9_m2 -s empirical_9/outfiles/empirical_9_m2.phy
%%bash
head -n 20 empirical_9/RAxML_info.empirical_9_m4
This is RAxML version 8.0.16 released by Alexandros Stamatakis on March 21 2014. With greatly appreciated code contributions by: Andre Aberer (HITS) Simon Berger (HITS) Alexey Kozlov (HITS) Nick Pattengale (Sandia) Wayne Pfeiffer (SDSC) Akifumi S. Tanabe (NRIFS) David Dao (KIT) Charlie Taylor (UF) Alignment has 818110 distinct alignment patterns Proportion of gaps and completely undetermined characters in this alignment: 54.64% RAxML rapid bootstrapping and subsequent ML search
%%bash
head -n 20 empirical_9/RAxML_info.empirical_9_m2
ape
¶%load_ext rpy2.ipython
%%R -h 800 -w 800
library(ape)
tre <- read.tree("empirical_9/RAxML_bipartitions.empirical_9")
ltre <- ladderize(tre)
par(mfrow=c(1,2))
plot(ltre, use.edge.length=F)
nodelabels(ltre$node.label)
plot(ltre, type='u')
%%R
mean(cophenetic.phylo(ltre))
[1] 0.04027079