This is an IPython notebook. Most of the code is composed of bash scripts, indicated by %%bash at the top of the cell, otherwise it is IPython code. This notebook includes code to download, assemble and analyze a published RADseq data set.
### Notebook 3
### Data set 3 (American oaks)
### Authors: Eaton et al. (2015)
### Data Location: NCBI SRA SRP055977
Sequence data for this study are archived on the NCBI sequence read archive (SRA). Below I read in SraRunTable.txt for this project which contains all of the information we need to download the data.
%%bash
## make a new directory for this analysis
mkdir -p empirical_3/fastq/
## IPython code
import pandas as pd
import urllib2
import os
## open the SRA run table from github url
url = "https://raw.githubusercontent.com/"+\
"dereneaton/RADmissing/master/empirical_3_SraRunTable.txt"
intable = urllib2.urlopen(url)
indata = pd.read_table(intable, sep="\t")
## print first few rows
print indata.head()
BioSample_s Library_Name_s MBases_l MBytes_l Organism_s \ 0 SAMN03394519 AR_re 366 240 Quercus arizonica 1 SAMN03394520 BJSL25_re 484 313 Quercus brandegeei 2 SAMN03394521 BJVL19_re 427 275 Quercus brandegeei 3 SAMN03394522 CH_re 411 268 Quercus chrysolepis 4 SAMN03394523 CRL0001_re 339 220 Quercus oleoides ReleaseDate_s Run_s SRA_Sample_s Sample_Name_s \ 0 Mar 13, 2015 SRR1915524 SRS868426 AR_re 1 Mar 13, 2015 SRR1915525 SRS874263 BJSL25_re 2 Mar 13, 2015 SRR1915526 SRS874262 BJVL19_re 3 Mar 13, 2015 SRR1915527 SRS874261 CH_re 4 Mar 13, 2015 SRR1915528 SRS874260 CRL0001_re geo_loc_name_s Assay_Type_s AssemblyName_s \ 0 USA: UMN Greenhouse, St. Paul, MN WGS <not provided> 1 Mexico: Sierra la Laguna, Baja del Sur WGS <not provided> 2 Mexico: Valle Perdido, Baja del Sur WGS <not provided> 3 USA: UMN Greenhouse, St. Paul, MN WGS <not provided> 4 Costa Rica: Rincón de la Vieja National Park WGS <not provided> BioProject_s BioSampleModel_s Center_Name_s Consent_s \ 0 PRJNA277574 Plant UNIVERSITY OF MINNESOTA public 1 PRJNA277574 Plant UNIVERSITY OF MINNESOTA public 2 PRJNA277574 Plant UNIVERSITY OF MINNESOTA public 3 PRJNA277574 Plant UNIVERSITY OF MINNESOTA public 4 PRJNA277574 Plant UNIVERSITY OF MINNESOTA public InsertSize_l LibraryLayout_s LoadDate_s Platform_s 0 0 SINGLE Mar 13, 2015 ILLUMINA ... 1 0 SINGLE Mar 13, 2015 ILLUMINA ... 2 0 SINGLE Mar 13, 2015 ILLUMINA ... 3 0 SINGLE Mar 13, 2015 ILLUMINA ... 4 0 SINGLE Mar 13, 2015 ILLUMINA ... [5 rows x 27 columns]
def wget_download(SRR, outdir, outname):
""" Python function to get sra data from ncbi and write to
outdir with a new name using bash call wget """
## get output name
output = os.path.join(outdir, outname+".sra")
## create a call string
call = "wget -q -r -nH --cut-dirs=9 -O "+output+" "+\
"ftp://ftp-trace.ncbi.nlm.nih.gov/"+\
"sra/sra-instant/reads/ByRun/sra/SRR/"+\
"{}/{}/{}.sra;".format(SRR[:6], SRR, SRR)
## call bash script
! $call
Here we pass the SRR number and the sample name to the wget_download
function so that the files are saved with their sample names.
for ID, SRR in zip(indata.Library_Name_s, indata.Run_s):
wget_download(SRR, "empirical_3/fastq/", ID)
%%bash
## convert sra files to fastq using fastq-dump tool
## output as gzipped into the fastq directory
fastq-dump --gzip -O empirical_3/fastq/ empirical_3/fastq/*.sra
## remove .sra files
rm empirical_3/fastq/*.sra
Read 4046890 spots for empirical_3/fastq/AR_re.sra Written 4046890 spots for empirical_3/fastq/AR_re.sra Read 931926 spots for empirical_3/fastq/BJSB3_v.sra Written 931926 spots for empirical_3/fastq/BJSB3_v.sra Read 5352627 spots for empirical_3/fastq/BJSL25_re.sra Written 5352627 spots for empirical_3/fastq/BJSL25_re.sra Read 969575 spots for empirical_3/fastq/BJSL25_v.sra Written 969575 spots for empirical_3/fastq/BJSL25_v.sra Read 4715624 spots for empirical_3/fastq/BJVL19_re.sra Written 4715624 spots for empirical_3/fastq/BJVL19_re.sra Read 817443 spots for empirical_3/fastq/BJVL19_v.sra Written 817443 spots for empirical_3/fastq/BJVL19_v.sra Read 849191 spots for empirical_3/fastq/BZBB1_v.sra Written 849191 spots for empirical_3/fastq/BZBB1_v.sra Read 4539385 spots for empirical_3/fastq/CH_re.sra Written 4539385 spots for empirical_3/fastq/CH_re.sra Read 3742953 spots for empirical_3/fastq/CRL0001_re.sra Written 3742953 spots for empirical_3/fastq/CRL0001_re.sra Read 1012884 spots for empirical_3/fastq/CRL0001_v.sra Written 1012884 spots for empirical_3/fastq/CRL0001_v.sra Read 3242503 spots for empirical_3/fastq/CRL0030_re.sra Written 3242503 spots for empirical_3/fastq/CRL0030_re.sra Read 1204952 spots for empirical_3/fastq/CRL0030_v.sra Written 1204952 spots for empirical_3/fastq/CRL0030_v.sra Read 570148 spots for empirical_3/fastq/CUCA4_v.sra Written 570148 spots for empirical_3/fastq/CUCA4_v.sra Read 90661 spots for empirical_3/fastq/CUMM5_v.sra Written 90661 spots for empirical_3/fastq/CUMM5_v.sra Read 775357 spots for empirical_3/fastq/CUSV6_v.sra Written 775357 spots for empirical_3/fastq/CUSV6_v.sra Read 1130911 spots for empirical_3/fastq/CUVN10_v.sra Written 1130911 spots for empirical_3/fastq/CUVN10_v.sra Read 1953553 spots for empirical_3/fastq/DO_re.sra Written 1953553 spots for empirical_3/fastq/DO_re.sra Read 3744449 spots for empirical_3/fastq/DU_re.sra Written 3744449 spots for empirical_3/fastq/DU_re.sra Read 743556 spots for empirical_3/fastq/EN_re.sra Written 743556 spots for empirical_3/fastq/EN_re.sra Read 3219505 spots for empirical_3/fastq/FLAB109_re.sra Written 3219505 spots for empirical_3/fastq/FLAB109_re.sra Read 347949 spots for empirical_3/fastq/FLAB109_v.sra Written 347949 spots for empirical_3/fastq/FLAB109_v.sra Read 2826213 spots for empirical_3/fastq/FLBA140_re.sra Written 2826213 spots for empirical_3/fastq/FLBA140_re.sra Read 975393 spots for empirical_3/fastq/FLBA140_v.sra Written 975393 spots for empirical_3/fastq/FLBA140_v.sra Read 941880 spots for empirical_3/fastq/FLCK18_v.sra Written 941880 spots for empirical_3/fastq/FLCK18_v.sra Read 791393 spots for empirical_3/fastq/FLCK216_v.sra Written 791393 spots for empirical_3/fastq/FLCK216_v.sra Read 905151 spots for empirical_3/fastq/FLMO62_v.sra Written 905151 spots for empirical_3/fastq/FLMO62_v.sra Read 3976506 spots for empirical_3/fastq/FLSA185_v.sra Written 3976506 spots for empirical_3/fastq/FLSA185_v.sra Read 857771 spots for empirical_3/fastq/FLSF33_v.sra Written 857771 spots for empirical_3/fastq/FLSF33_v.sra Read 908978 spots for empirical_3/fastq/FLSF47_v.sra Written 908978 spots for empirical_3/fastq/FLSF47_v.sra Read 2543324 spots for empirical_3/fastq/FLSF54_re.sra Written 2543324 spots for empirical_3/fastq/FLSF54_re.sra Read 1115926 spots for empirical_3/fastq/FLSF54_v.sra Written 1115926 spots for empirical_3/fastq/FLSF54_v.sra Read 1099920 spots for empirical_3/fastq/FLWO6_v.sra Written 1099920 spots for empirical_3/fastq/FLWO6_v.sra Read 2777404 spots for empirical_3/fastq/HE_re.sra Written 2777404 spots for empirical_3/fastq/HE_re.sra Read 1112292 spots for empirical_3/fastq/HNDA09_v.sra Written 1112292 spots for empirical_3/fastq/HNDA09_v.sra Read 1436228 spots for empirical_3/fastq/LALC2_v.sra Written 1436228 spots for empirical_3/fastq/LALC2_v.sra Read 1154841 spots for empirical_3/fastq/MXED8_v.sra Written 1154841 spots for empirical_3/fastq/MXED8_v.sra Read 1347187 spots for empirical_3/fastq/MXGT4_v.sra Written 1347187 spots for empirical_3/fastq/MXGT4_v.sra Read 1046574 spots for empirical_3/fastq/MXSA3017_v.sra Written 1046574 spots for empirical_3/fastq/MXSA3017_v.sra Read 4482353 spots for empirical_3/fastq/NI_re.sra Written 4482353 spots for empirical_3/fastq/NI_re.sra Read 505504 spots for empirical_3/fastq/SCCU3_v.sra Written 505504 spots for empirical_3/fastq/SCCU3_v.sra Read 879229 spots for empirical_3/fastq/TXGR3_v.sra Written 879229 spots for empirical_3/fastq/TXGR3_v.sra Read 934386 spots for empirical_3/fastq/TXMD3_v.sra Written 934386 spots for empirical_3/fastq/TXMD3_v.sra Read 163427 spots for empirical_3/fastq/TXWV2_v.sra Written 163427 spots for empirical_3/fastq/TXWV2_v.sra Read 76783922 spots total Written 76783922 spots total
This study includes several re-sequenced individuals. We combine them before beginning the analysis.
##IPython code
import glob
taxa = [i.split("/")[-1].split('_')[0] for i in glob.glob("empirical_3/fastq/*.gz")]
for taxon in set(taxa):
if taxa.count(taxon) > 1:
print taxon, "merged"
## merge replicate files
! cat empirical_3/fastq/$taxon\_v.fastq.gz \
empirical_3/fastq/$taxon\_re.fastq.gz \
> empirical_3/fastq/$taxon\_me.fastq.gz
## remove ind replicate files
! rm empirical_3/fastq/$taxon\_v.fastq.gz
! rm empirical_3/fastq/$taxon\_re.fastq.gz
BJVL19 merged CRL0001 merged BJSL25 merged FLBA140 merged FLAB109 merged CRL0030 merged FLSF54 merged
%%bash
pyrad --version
pyRAD 3.0.63
%%bash
## remove old params file if it exists
rm params.txt
## create a new default params file
pyrad -n
new params.txt file created
%%bash
## substitute new parameters into file
sed -i '/## 1. /c\empirical_3/ ## 1. working directory ' params.txt
sed -i '/## 6. /c\TGCAG ## 6. cutters ' params.txt
sed -i '/## 7. /c\20 ## 7. N processors ' params.txt
sed -i '/## 9. /c\6 ## 9. NQual ' params.txt
sed -i '/## 10./c\.85 ## 10. clust threshold ' params.txt
sed -i '/## 12./c\4 ## 12. MinCov ' params.txt
sed -i '/## 13./c\10 ## 13. maxSH ' params.txt
sed -i '/## 14./c\empirical_3_m4 ## 14. output name ' params.txt
sed -i '/## 18./c\empirical_3/fastq/*.gz ## 18. data location ' params.txt
sed -i '/## 29./c\2,2 ## 29. trim overhang ' params.txt
sed -i '/## 30./c\p,n,s ## 30. output formats ' params.txt
cat params.txt
==** parameter inputs for pyRAD version 3.0.63 **======================== affected step == empirical_3/ ## 1. working directory ./*.fastq.gz ## 2. Loc. of non-demultiplexed files (if not line 18) (s1) ./*.barcodes ## 3. Loc. of barcode file (if not line 18) (s1) vsearch ## 4. command (or path) to call vsearch (or usearch) (s3,s6) muscle ## 5. command (or path) to call muscle (s3,s7) TGCAG ## 6. cutters 20 ## 7. N processors 6 ## 8. Mindepth: min coverage for a cluster (s4,s5) 6 ## 9. NQual .85 ## 10. clust threshold rad ## 11. Datatype: rad,gbs,pairgbs,pairddrad,(others:see docs)(all) 4 ## 12. MinCov 10 ## 13. maxSH empirical_3_m4 ## 14. output name ==== optional params below this line =================================== affected step == ## 15.opt.: select subset (prefix* only selector) (s2-s7) ## 16.opt.: add-on (outgroup) taxa (list or prefix*) (s6,s7) ## 17.opt.: exclude taxa (list or prefix*) (s7) empirical_3/fastq/*.gz ## 18. data location ## 19.opt.: maxM: N mismatches in barcodes (def= 1) (s1) ## 20.opt.: phred Qscore offset (def= 33) (s2) ## 21.opt.: filter: def=0=NQual 1=NQual+adapters. 2=strict (s2) ## 22.opt.: a priori E,H (def= 0.001,0.01, if not estimated) (s5) ## 23.opt.: maxN: max Ns in a cons seq (def=5) (s5) ## 24.opt.: maxH: max heterozyg. sites in cons seq (def=5) (s5) ## 25.opt.: ploidy: max alleles in cons seq (def=2;see docs) (s4,s5) ## 26.opt.: maxSNPs: (def=100). Paired (def=100,100) (s7) ## 27.opt.: maxIndels: within-clust,across-clust (def. 3,99) (s3,s7) ## 28.opt.: random number seed (def. 112233) (s3,s6,s7) 2,2 ## 29. trim overhang p,n,s ## 30. output formats ## 31.opt.: maj. base call at depth>x<mindepth (def.x=mindepth) (s5) ## 32.opt.: keep trimmed reads (def=0). Enter min length. (s2) ## 33.opt.: max stack size (int), def= max(500,mean+2*SD) (s3) ## 34.opt.: minDerep: exclude dereps with <= N copies, def=1 (s3) ## 35.opt.: use hierarchical clustering (def.=0, 1=yes) (s6) ## 36.opt.: repeat masking (def.=1='dust' method, 0=no) (s3,s6) ## 37.opt.: vsearch max threads per job (def.=6; see docs) (s3,s6) ==== optional: list group/clade assignments below this line (see docs) ==================
%%bash
pyrad -p params.txt -s 234567 >> log.txt 2>&1
%%bash
sed -i '/## 12./c\2 ## 12. MinCov ' params.txt
sed -i '/## 14./c\empirical_3_m2 ## 14. output name ' params.txt
%%bash
pyrad -p params.txt -s 7 >> log.txt 2>&1
We are interested in the relationship between the amount of input (raw) data between any two samples, the average coverage they recover when clustered together, and the phylogenetic distances separating samples.
The average number of raw reads per sample is 1.36M.
## read in the data
s2dat = pd.read_table("empirical_3/stats/s2.rawedit.txt", header=0, nrows=36)
## print summary stats
print s2dat["passed.total"].describe()
## find which sample has the most raw data
maxraw = s2dat["passed.total"].max()
print "\nmost raw data in sample:"
print s2dat['sample '][s2dat['passed.total']==maxraw]
count 36.000000 mean 1955564.750000 std 1594388.931033 min 84505.000000 25% 806803.750000 50% 1050559.000000 75% 3425603.250000 max 5829065.000000 Name: passed.total, dtype: float64 most raw data in sample: 2 BJSL25_me Name: sample , dtype: object
pyrad v.3.0.63 outputs depth information for each sample which I read in here and plot. First let's ask which sample has the highest depth of coverage. The std of coverages is pretty low in this data set compared to several others.
## read in the s3 results
s3dat = pd.read_table("empirical_3/stats/s3.clusters.txt", header=0, nrows=39)
## print summary stats
print "summary of means\n=================="
print s3dat['dpt.me'].describe()
## print summary stats
print "\nsummary of std\n=================="
print s3dat['dpt.sd'].describe()
## print summary stats
print "\nsummary of proportion lowdepth\n=================="
print pd.Series(1-s3dat['d>5.tot']/s3dat["total"]).describe()
## find which sample has the greatest depth of retained loci
max_hiprop = (s3dat["d>5.tot"]/s3dat["total"]).max()
print "\nhighest coverage in sample:"
print s3dat['taxa'][s3dat['d>5.tot']/s3dat["total"]==max_hiprop]
summary of means ================== count 36.00000 mean 17.80300 std 13.02276 min 3.31500 25% 9.02300 50% 11.61800 75% 25.41200 max 51.11900 Name: dpt.me, dtype: float64 summary of std ================== count 36.000000 mean 70.155611 std 59.520107 min 3.718000 25% 28.191000 50% 47.720500 75% 98.996000 max 211.783000 Name: dpt.sd, dtype: float64 summary of proportion lowdepth ================== count 36.000000 mean 0.427849 std 0.157876 min 0.211693 25% 0.300563 50% 0.440962 75% 0.497383 max 0.860847 dtype: float64 highest coverage in sample: 0 AR_re Name: taxa, dtype: object
import numpy as np
## print mean and std of coverage for the highest coverage sample
with open("empirical_3/clust.85/AR_re.depths", 'rb') as indat:
depths = np.array(indat.read().strip().split(","), dtype=int)
print depths.mean(), depths.std()
37.41198396 122.321191848
Green shows the loci that were discarded and orange the loci that were retained. The majority of data were discarded for being too low of coverage.
import toyplot
import toyplot.svg
import numpy as np
## read in the depth information for this sample
with open("empirical_3/clust.85/AR_re.depths", 'rb') as indat:
depths = np.array(indat.read().strip().split(","), dtype=int)
## make a barplot in Toyplot
canvas = toyplot.Canvas(width=350, height=300)
axes = canvas.axes(xlabel="Depth of coverage (N reads)",
ylabel="N loci",
label="dataset3/sample=AR_re")
## select the loci with depth > 5 (kept)
keeps = depths[depths>5]
## plot kept and discarded loci
edat = np.histogram(depths, range(30)) # density=True)
kdat = np.histogram(keeps, range(30)) #, density=True)
axes.bars(edat)
axes.bars(kdat)
#toyplot.svg.render(canvas, "empirical_3_depthplot.svg")
<toyplot.mark.BarMagnitudes at 0x7f5de591dad0>
cat empirical_3/stats/empirical_3_m4.stats
90871 ## loci with > minsp containing data 87969 ## loci with > minsp containing data & paralogs removed 87969 ## loci with > minsp containing data & paralogs removed & final filtering ## number of loci recovered in final data set for each taxon. taxon nloci AR_re 54176 BJSB3_v 38030 BJSL25_me 61596 BJVL19_me 60159 BZBB1_v 35466 CH_re 51058 CRL0001_me 62704 CRL0030_me 60181 CUCA4_v 25875 CUMM5_v 2520 CUSV6_v 34898 CUVN10_v 43516 DO_re 46865 DU_re 52925 EN_re 28342 FLAB109_me 53770 FLBA140_me 59903 FLCK18_v 35834 FLCK216_v 33674 FLMO62_v 37431 FLSA185_v 29650 FLSF33_v 35589 FLSF47_v 37411 FLSF54_me 60404 FLWO6_v 35292 HE_re 42687 HNDA09_v 43262 LALC2_v 48666 MXED8_v 41094 MXGT4_v 47152 MXSA3017_v 40169 NI_re 44506 SCCU3_v 18442 TXGR3_v 34111 TXMD3_v 37451 TXWV2_v 5109 ## nloci = number of loci with data for exactly ntaxa ## ntotal = number of loci for which at least ntaxa have data ntaxa nloci saved ntotal 1 - 2 - - 3 - - 4 7085 * 87969 5 5234 * 80884 6 4497 * 75650 7 3579 * 71153 8 3074 * 67574 9 2755 * 64500 10 2629 * 61745 11 2561 * 59116 12 2409 * 56555 13 2434 * 54146 14 2477 * 51712 15 2302 * 49235 16 2464 * 46933 17 2394 * 44469 18 2444 * 42075 19 2517 * 39631 20 2580 * 37114 21 2652 * 34534 22 2775 * 31882 23 2944 * 29107 24 2830 * 26163 25 2960 * 23333 26 3060 * 20373 27 3228 * 17313 28 3093 * 14085 29 3047 * 10992 30 2766 * 7945 31 2285 * 5179 32 1581 * 2894 33 941 * 1313 34 310 * 372 35 59 * 62 36 3 * 3 ## nvar = number of loci containing n variable sites (pis+autapomorphies). ## sumvar = sum of variable sites (SNPs). ## pis = number of loci containing n parsimony informative sites. ## sumpis = sum of parsimony informative sites. nvar sumvar PIS sumPIS 0 4738 0 14177 0 1 4963 4963 12086 12086 2 5663 16289 11896 35878 3 6227 34970 10819 68335 4 6540 61130 9289 105491 5 6974 96000 8034 145661 6 6661 135966 6292 183413 7 6482 181340 4703 216334 8 6305 231780 3517 244470 9 5622 282378 2420 266250 10 5119 333568 1656 282810 11 4321 381099 1104 294954 12 3815 426879 772 304218 13 3219 468726 507 310809 14 2572 504734 297 314967 15 2151 536999 165 317442 16 1660 563559 107 319154 17 1261 584996 61 320191 18 998 602960 33 320785 19 753 617267 11 320994 20 559 628447 10 321194 21 402 636889 4 321278 22 287 643203 5 321388 23 205 647918 3 321457 24 134 651134 0 321457 25 112 653934 1 321482 26 73 655832 0 321482 27 64 657560 0 321482 28 33 658484 0 321482 29 14 658890 0 321482 30 16 659370 0 321482 31 7 659587 0 321482 32 3 659683 0 321482 33 5 659848 0 321482 34 6 660052 0 321482 35 2 660122 0 321482 36 1 660158 0 321482 37 1 660195 0 321482 38 0 660195 0 321482 39 1 660234 0 321482 total var= 660234 total pis= 321482 sampled unlinked SNPs= 83231 sampled unlinked bi-allelic SNPs= 51854
%%bash
head -n 10 empirical_3/stats/empirical_3_m2.stats
145112 ## loci with > minsp containing data 142210 ## loci with > minsp containing data & paralogs removed 142210 ## loci with > minsp containing data & paralogs removed & final filtering ## number of loci recovered in final data set for each taxon. taxon nloci AR_re 58912 BJSB3_v 39297
%%bash
## raxml argumement w/ ...
raxmlHPC-PTHREADS-AVX -f a -m GTRGAMMA -N 100 -x 12345 -p 12345 -T 20 \
-w /home/deren/Documents/RADmissing/empirical_3/ \
-n empirical_3_m4 -s empirical_3/outfiles/empirical_3_m4.phy
%%bash
## raxml argumement w/ ...
raxmlHPC-PTHREADS-AVX -f a -m GTRGAMMA -N 100 -x 12345 -p 12345 -T 20 \
-w /home/deren/Documents/RADmissing/empirical_3/ \
-n empirical_3_m2 -s empirical_3/outfiles/empirical_3_m2.phy
%%bash
head -n 20 empirical_3/RAxML_info.empirical_3_m4
This is RAxML version 8.0.16 released by Alexandros Stamatakis on March 21 2014. With greatly appreciated code contributions by: Andre Aberer (HITS) Simon Berger (HITS) Alexey Kozlov (HITS) Nick Pattengale (Sandia) Wayne Pfeiffer (SDSC) Akifumi S. Tanabe (NRIFS) David Dao (KIT) Charlie Taylor (UF) Alignment has 612971 distinct alignment patterns Proportion of gaps and completely undetermined characters in this alignment: 53.74% RAxML rapid bootstrapping and subsequent ML search
%%bash
head -n 20 empirical_3/RAxML_info.empirical_3_m2
head: cannot open ‘empirical_3/RAxML_info.empirical_3_m2’ for reading: No such file or directory
ape
¶%load_ext rpy2.ipython
%%R -w 400 -h 800
library(ape)
tre <- read.tree("empirical_3/RAxML_bipartitions.empirical_3")
ltre <- ladderize(tre)
plot(ltre, edge.width=2)
%%R
mean(cophenetic.phylo(ltre))
[1] 0.01793173