This is an IPython notebook. Most of the code is composed of bash scripts, indicated by %%bash at the top of the cell, otherwise it is IPython code. This notebook includes code to download, assemble and analyze a published RADseq data set.
### Notebook 6
### Data set 6 (Finches)
### Authors: DaCosta & Sorenson (2016)
### Data Location: SRP059199
Sequence data for this study are archived on the NCBI sequence read archive (SRA). Below I read in SraRunTable.txt for this project which contains all of the information we need to download the data.
%%bash
## make a new directory for this analysis
mkdir -p empirical_6/fastq/
## IPython code
import pandas as pd
import numpy as np
import urllib2
import os
## open the SRA run table from github url
url = "https://raw.githubusercontent.com/"+\
"dereneaton/RADmissing/master/empirical_6_SraRunTable.txt"
intable = urllib2.urlopen(url)
indata = pd.read_table(intable, sep="\t")
## print first few rows
print indata.head()
BioSample_s Library_Name_s MBases_l MBytes_l \ 0 SAMN03753617 90.o31 77 49 1 SAMN03753619 90.o38 84 54 2 SAMN03753618 90.o50 83 55 3 SAMN03753613 90.o53 74 46 4 SAMN03753601 A056 70 44 Organism_s Run_s SRA_Sample_s \ 0 Vidua hypocherina SRR2053224 SRS954896 1 Vidua regia SRR2053225 SRS954895 2 Vidua fischeri SRR2053226 SRS954894 3 Vidua orientalis SRR2053227 SRS954893 4 Lagonosticta senegala rhodopsis SRR2053228 SRS954892 Sample_Name_s isolate_s Assay_Type_s \ 0 Vidua hypocherina, sample 90.o31 90.o31 OTHER 1 Vidua regia, sample 90.o38 90.o38 OTHER 2 Vidua fischeri, sample 90.o50 90.o50 OTHER 3 Vidua orientalis, sample 90.o53 90.o53 OTHER 4 Lagonosticta senegala rhodopsis, sample A056 A056 OTHER ... LoadDate_s Platform_s ReleaseDate_s SRA_Study_s \ 0 ... Jun 05, 2015 ILLUMINA Jun 24, 2015 SRP059199 1 ... Jun 05, 2015 ILLUMINA Jun 24, 2015 SRP059199 2 ... Jun 05, 2015 ILLUMINA Jun 24, 2015 SRP059199 3 ... Jun 05, 2015 ILLUMINA Jun 24, 2015 SRP059199 4 ... Jun 05, 2015 ILLUMINA Jun 24, 2015 SRP059199 age_s g1k_analysis_group_s g1k_pop_code_s sex_s \ 0 after hatch year <not provided> <not provided> male 1 after hatch year <not provided> <not provided> male 2 after hatch year <not provided> <not provided> male 3 after hatch year <not provided> <not provided> male 4 after hatch year <not provided> <not provided> male source_s tissue_s 0 <not provided> muscle 1 <not provided> muscle 2 <not provided> muscle 3 <not provided> muscle 4 <not provided> muscle [5 rows x 27 columns]
def wget_download(SRR, outdir, outname):
""" Python function to get sra data from ncbi and write to
outdir with a new name using bash call wget """
## get output name
output = os.path.join(outdir, outname+".sra")
## create a call string
call = "wget -q -r -nH --cut-dirs=9 -O "+output+" "+\
"ftp://ftp-trace.ncbi.nlm.nih.gov/"+\
"sra/sra-instant/reads/ByRun/sra/SRR/"+\
"{}/{}/{}.sra;".format(SRR[:6], SRR, SRR)
## call bash script
! $call
Here we pass the SRR number and the sample name to the wget_download
function so that the files are saved with their sample names.
for ID, SRR in zip(indata.Library_Name_s, indata.Run_s):
wget_download(SRR, "empirical_6/fastq/", ID)
%%bash
## convert sra files to fastq using fastq-dump tool
## output as gzipped into the fastq directory
fastq-dump --gzip -O empirical_6/fastq/ empirical_6/fastq/*.sra
## remove .sra files
rm empirical_6/fastq/*.sra
Read 889818 spots for empirical_6/fastq/90.o31.sra Written 889818 spots for empirical_6/fastq/90.o31.sra Read 1012887 spots for empirical_6/fastq/90.o38.sra Written 1012887 spots for empirical_6/fastq/90.o38.sra Read 998161 spots for empirical_6/fastq/90.o50.sra Written 998161 spots for empirical_6/fastq/90.o50.sra Read 862619 spots for empirical_6/fastq/90.o53.sra Written 862619 spots for empirical_6/fastq/90.o53.sra Read 792799 spots for empirical_6/fastq/A056.sra Written 792799 spots for empirical_6/fastq/A056.sra Read 833981 spots for empirical_6/fastq/A081.sra Written 833981 spots for empirical_6/fastq/A081.sra Read 809186 spots for empirical_6/fastq/A082.sra Written 809186 spots for empirical_6/fastq/A082.sra Read 939649 spots for empirical_6/fastq/A089.sra Written 939649 spots for empirical_6/fastq/A089.sra Read 1020850 spots for empirical_6/fastq/A107.sra Written 1020850 spots for empirical_6/fastq/A107.sra Read 1001743 spots for empirical_6/fastq/A145.sra Written 1001743 spots for empirical_6/fastq/A145.sra Read 729322 spots for empirical_6/fastq/A147.sra Written 729322 spots for empirical_6/fastq/A147.sra Read 834688 spots for empirical_6/fastq/A167.sra Written 834688 spots for empirical_6/fastq/A167.sra Read 709824 spots for empirical_6/fastq/A177.sra Written 709824 spots for empirical_6/fastq/A177.sra Read 935241 spots for empirical_6/fastq/A1794.sra Written 935241 spots for empirical_6/fastq/A1794.sra Read 898117 spots for empirical_6/fastq/A204.sra Written 898117 spots for empirical_6/fastq/A204.sra Read 992534 spots for empirical_6/fastq/A248.sra Written 992534 spots for empirical_6/fastq/A248.sra Read 1034739 spots for empirical_6/fastq/A328.sra Written 1034739 spots for empirical_6/fastq/A328.sra Read 927116 spots for empirical_6/fastq/JMD1249.sra Written 927116 spots for empirical_6/fastq/JMD1249.sra Read 1031674 spots for empirical_6/fastq/JMD1285.sra Written 1031674 spots for empirical_6/fastq/JMD1285.sra Read 956686 spots for empirical_6/fastq/MDS004.sra Written 956686 spots for empirical_6/fastq/MDS004.sra Read 1220879 spots for empirical_6/fastq/MDS054.sra Written 1220879 spots for empirical_6/fastq/MDS054.sra Read 1064587 spots for empirical_6/fastq/MDS065.sra Written 1064587 spots for empirical_6/fastq/MDS065.sra Read 741675 spots for empirical_6/fastq/NKK511.sra Written 741675 spots for empirical_6/fastq/NKK511.sra Read 1079701 spots for empirical_6/fastq/SAR6894.sra Written 1079701 spots for empirical_6/fastq/SAR6894.sra Read 22318476 spots total Written 22318476 spots total
%%bash
ls -l empirical_6/fastq/
total 1668628 -rw-rw-r-- 1 deren deren 69481962 Nov 23 13:13 90.o31.fastq.gz -rw-rw-r-- 1 deren deren 75367262 Nov 23 13:13 90.o38.fastq.gz -rw-rw-r-- 1 deren deren 76680615 Nov 23 13:13 90.o50.fastq.gz -rw-rw-r-- 1 deren deren 66086600 Nov 23 13:13 90.o53.fastq.gz -rw-rw-r-- 1 deren deren 62055035 Nov 23 13:13 A056.fastq.gz -rw-rw-r-- 1 deren deren 64828556 Nov 23 13:13 A081.fastq.gz -rw-rw-r-- 1 deren deren 63527984 Nov 23 13:13 A082.fastq.gz -rw-rw-r-- 1 deren deren 73811356 Nov 23 13:13 A089.fastq.gz -rw-rw-r-- 1 deren deren 79426948 Nov 23 13:13 A107.fastq.gz -rw-rw-r-- 1 deren deren 78226977 Nov 23 13:13 A145.fastq.gz -rw-rw-r-- 1 deren deren 56867082 Nov 23 13:13 A147.fastq.gz -rw-rw-r-- 1 deren deren 67136950 Nov 23 13:13 A167.fastq.gz -rw-rw-r-- 1 deren deren 54625508 Nov 23 13:13 A177.fastq.gz -rw-rw-r-- 1 deren deren 68861314 Nov 23 13:13 A1794.fastq.gz -rw-rw-r-- 1 deren deren 70348288 Nov 23 13:13 A204.fastq.gz -rw-rw-r-- 1 deren deren 72142052 Nov 23 13:14 A248.fastq.gz -rw-rw-r-- 1 deren deren 73305922 Nov 23 13:14 A328.fastq.gz -rw-rw-r-- 1 deren deren 70371041 Nov 23 13:14 JMD1249.fastq.gz -rw-rw-r-- 1 deren deren 77220081 Nov 23 13:14 JMD1285.fastq.gz -rw-rw-r-- 1 deren deren 76259313 Nov 23 13:14 MDS004.fastq.gz -rw-rw-r-- 1 deren deren 87175055 Nov 23 13:14 MDS054.fastq.gz -rw-rw-r-- 1 deren deren 83991953 Nov 23 13:14 MDS065.fastq.gz -rw-rw-r-- 1 deren deren 57656728 Nov 23 13:14 NKK511.fastq.gz -rw-rw-r-- 1 deren deren 83176262 Nov 23 13:14 SAR6894.fastq.gz
This study includes several technical replicates per sequenced individuals, which we combine into a single file for each individual here.
%%bash
pyrad --version
pyRAD 3.0.63
%%bash
## remove old params file if it exists
rm params.txt
## create a new default params file
pyrad -n
new params.txt file created
The data here are from Illumina Casava <1.8, so the phred scores are offset by 64 instead of 33, so we use that in the params file below.
%%bash
## substitute new parameters into file
sed -i '/## 1. /c\empirical_6/ ## 1. working directory ' params.txt
sed -i '/## 6. /c\CCTGCAGG,AATTC ## 6. cutters ' params.txt
sed -i '/## 7. /c\20 ## 7. N processors ' params.txt
sed -i '/## 9. /c\6 ## 9. NQual ' params.txt
sed -i '/## 10./c\.85 ## 10. clust threshold ' params.txt
sed -i '/## 12./c\4 ## 12. MinCov ' params.txt
sed -i '/## 13./c\10 ## 13. maxSH ' params.txt
sed -i '/## 14./c\empirical_6_m4 ## 14. output name ' params.txt
sed -i '/## 18./c\empirical_6/fastq/*.gz ## 18. data location ' params.txt
sed -i '/## 29./c\2,2 ## 29. trim overhang ' params.txt
sed -i '/## 30./c\p,n,s ## 30. output formats ' params.txt
cat params.txt
==** parameter inputs for pyRAD version 3.0.63 **======================== affected step == empirical_6/ ## 1. working directory ./*.fastq.gz ## 2. Loc. of non-demultiplexed files (if not line 18) (s1) ./*.barcodes ## 3. Loc. of barcode file (if not line 18) (s1) vsearch ## 4. command (or path) to call vsearch (or usearch) (s3,s6) muscle ## 5. command (or path) to call muscle (s3,s7) CCTGCAGG,AATTC ## 6. cutters 20 ## 7. N processors 6 ## 8. Mindepth: min coverage for a cluster (s4,s5) 6 ## 9. NQual .85 ## 10. clust threshold rad ## 11. Datatype: rad,gbs,pairgbs,pairddrad,(others:see docs)(all) 4 ## 12. MinCov 10 ## 13. maxSH empirical_6_m4 ## 14. output name ==== optional params below this line =================================== affected step == ## 15.opt.: select subset (prefix* only selector) (s2-s7) ## 16.opt.: add-on (outgroup) taxa (list or prefix*) (s6,s7) ## 17.opt.: exclude taxa (list or prefix*) (s7) empirical_6/fastq/*.gz ## 18. data location ## 19.opt.: maxM: N mismatches in barcodes (def= 1) (s1) ## 20.opt.: phred Qscore offset (def= 33) (s2) ## 21.opt.: filter: def=0=NQual 1=NQual+adapters. 2=strict (s2) ## 22.opt.: a priori E,H (def= 0.001,0.01, if not estimated) (s5) ## 23.opt.: maxN: max Ns in a cons seq (def=5) (s5) ## 24.opt.: maxH: max heterozyg. sites in cons seq (def=5) (s5) ## 25.opt.: ploidy: max alleles in cons seq (def=2;see docs) (s4,s5) ## 26.opt.: maxSNPs: (def=100). Paired (def=100,100) (s7) ## 27.opt.: maxIndels: within-clust,across-clust (def. 3,99) (s3,s7) ## 28.opt.: random number seed (def. 112233) (s3,s6,s7) 2,2 ## 29. trim overhang p,n,s ## 30. output formats ## 31.opt.: maj. base call at depth>x<mindepth (def.x=mindepth) (s5) ## 32.opt.: keep trimmed reads (def=0). Enter min length. (s2) ## 33.opt.: max stack size (int), def= max(500,mean+2*SD) (s3) ## 34.opt.: minDerep: exclude dereps with <= N copies, def=1 (s3) ## 35.opt.: use hierarchical clustering (def.=0, 1=yes) (s6) ## 36.opt.: repeat masking (def.=1='dust' method, 0=no) (s3,s6) ## 37.opt.: vsearch max threads per job (def.=6; see docs) (s3,s6) ==== optional: list group/clade assignments below this line (see docs) ==================
%%bash
pyrad -p params.txt -s 234567 >> log.txt 2>&1
%%bash
sed -i '/## 12./c\2 ## 12. MinCov ' params.txt
sed -i '/## 14./c\empirical_6_m2 ## 14. output name ' params.txt
%%bash
pyrad -p params.txt -s 7 >> log.txt 2>&1
We are interested in the relationship between the amount of input (raw) data between any two samples, the average coverage they recover when clustered together, and the phylogenetic distances separating samples.
The average number of raw reads per sample is 1.36M.
import pandas as pd
## read in the data
s2dat = pd.read_table("empirical_6/stats/s2.rawedit.txt", header=0, nrows=25)
## print summary stats
print s2dat["passed.total"].describe()
## find which sample has the most raw data
maxraw = s2dat["passed.total"].max()
print "\nmost raw data in sample:"
print s2dat['sample '][s2dat['passed.total']==maxraw]
count 24.000000 mean 609459.416667 std 114564.902701 min 472250.000000 25% 531561.250000 50% 593659.500000 75% 648491.250000 max 961219.000000 Name: passed.total, dtype: float64 most raw data in sample: 18 JMD1285 Name: sample , dtype: object
pyrad v.3.0.63 outputs depth information for each sample which I read in here and plot. First let's ask which sample has the highest depth of coverage. The std of coverages is pretty low in this data set compared to several others.
## read in the s3 results
s6dat = pd.read_table("empirical_6/stats/s3.clusters.txt", header=0, nrows=25)
## print summary stats
print "summary of means\n=================="
print s6dat['dpt.me'].describe()
## print summary stats
print "\nsummary of std\n=================="
print s6dat['dpt.sd'].describe()
## print summary stats
print "\nsummary of proportion lowdepth\n=================="
print pd.Series(1-s6dat['d>5.tot']/s6dat["total"]).describe()
## find which sample has the greatest depth of retained loci
max_hiprop = (s6dat["d>5.tot"]/s6dat["total"]).max()
print "\nhighest coverage in sample:"
print s6dat['taxa'][s6dat['d>5.tot']/s6dat["total"]==max_hiprop]
summary of means ================== count 24.000000 mean 38.139250 std 5.881965 min 28.454000 25% 33.512000 50% 37.240000 75% 41.636500 max 53.215000 Name: dpt.me, dtype: float64 summary of std ================== count 24.000000 mean 112.613125 std 46.878437 min 70.168000 25% 85.451500 50% 96.007500 75% 121.462750 max 241.616000 Name: dpt.sd, dtype: float64 summary of proportion lowdepth ================== count 24.000000 mean 0.458336 std 0.074790 min 0.308587 25% 0.416855 50% 0.463149 75% 0.526700 max 0.572036 dtype: float64 highest coverage in sample: 11 A167 Name: taxa, dtype: object
import numpy as np
## print mean and std of coverage for the highest coverage sample
with open("empirical_6/clust.85/A167.depths", 'rb') as indat:
depths = np.array(indat.read().strip().split(","), dtype=int)
print depths.mean(), depths.std()
41.6207345663 129.154402626
Green shows the loci that were discarded and orange the loci that were retained. The majority of data were discarded for being too low of coverage.
import toyplot
import toyplot.svg
import numpy as np
## read in the depth information for this sample
with open("empirical_6/clust.85/A167.depths", 'rb') as indat:
depths = np.array(indat.read().strip().split(","), dtype=int)
## make a barplot in Toyplot
canvas = toyplot.Canvas(width=350, height=300)
axes = canvas.axes(xlabel="Depth of coverage (N reads)",
ylabel="N loci",
label="dataset6/sample=A167")
## select the loci with depth > 5 (kept)
keeps = depths[depths>5]
## plot kept and discarded loci
edat = np.histogram(depths, range(30)) # density=True)
kdat = np.histogram(keeps, range(30)) #, density=True)
axes.bars(edat)
axes.bars(kdat)
#toyplot.svg.render(canvas, "empirical_6_depthplot.svg")
<toyplot.mark.BarMagnitudes at 0x7ffa54764090>
cat empirical_6/stats/empirical_6_m4.stats
13093 ## loci with > minsp containing data 13093 ## loci with > minsp containing data & paralogs removed 12864 ## loci with > minsp containing data & paralogs removed & final filtering ## number of loci recovered in final data set for each taxon. taxon nloci 90.o31 6706 90.o38 6694 90.o50 6490 90.o53 7132 A056 5731 A081 7098 A082 7229 A089 6918 A107 6255 A145 6608 A147 6797 A167 6015 A177 7009 A1794 4790 A204 6040 A248 6077 A328 5912 JMD1249 7054 JMD1285 6707 MDS004 7271 MDS054 5489 MDS065 6500 NKK511 7033 SAR6894 4845 ## nloci = number of loci with data for exactly ntaxa ## ntotal = number of loci for which at least ntaxa have data ntaxa nloci saved ntotal 1 - 2 - - 3 - - 4 1345 * 12864 5 844 * 11519 6 886 * 10675 7 658 * 9789 8 695 * 9131 9 878 * 8436 10 998 * 7558 11 455 * 6560 12 567 * 6105 13 1175 * 5538 14 847 * 4363 15 268 * 3516 16 205 * 3248 17 209 * 3043 18 223 * 2834 19 232 * 2611 20 281 * 2379 21 361 * 2098 22 498 * 1737 23 674 * 1239 24 565 * 565 ## nvar = number of loci containing n variable sites (pis+autapomorphies). ## sumvar = sum of variable sites (SNPs). ## pis = number of loci containing n parsimony informative sites. ## sumpis = sum of parsimony informative sites. nvar sumvar PIS sumPIS 0 623 0 2713 0 1 903 903 2414 2414 2 1025 2953 1969 6352 3 1095 6238 1452 10708 4 1069 10514 1073 15000 5 1013 15579 757 18785 6 1002 21591 628 22553 7 900 27891 499 26046 8 745 33851 424 29438 9 697 40124 284 31994 10 610 46224 222 34214 11 545 52219 160 35974 12 510 58339 110 37294 13 438 64033 73 38243 14 372 69241 36 38747 15 271 73306 22 39077 16 240 77146 14 39301 17 200 80546 9 39454 18 173 83660 2 39490 19 113 85807 0 39490 20 103 87867 1 39510 21 73 89400 0 39510 22 55 90610 0 39510 23 29 91277 1 39533 24 22 91805 1 39557 25 15 92180 0 39557 26 8 92388 0 39557 27 8 92604 0 39557 28 4 92716 0 39557 29 0 92716 0 39557 30 2 92776 0 39557 31 1 92807 0 39557 total var= 92807 total pis= 39557 sampled unlinked SNPs= 12241 sampled unlinked bi-allelic SNPs= 10885
%%bash
head -n 20 empirical_6/stats/empirical_6_m2.stats
17797 ## loci with > minsp containing data 17797 ## loci with > minsp containing data & paralogs removed 17477 ## loci with > minsp containing data & paralogs removed & final filtering ## number of loci recovered in final data set for each taxon. taxon nloci 90.o31 6934 90.o38 7055 90.o50 6786 90.o53 7440 A056 6719 A081 7315 A082 7503 A089 7430 A107 6794 A145 7235 A147 7216 A167 7098
%%bash
## raxml argumement w/ ...
raxmlHPC-PTHREADS-AVX -f a -m GTRGAMMA -N 100 -x 12345 -p 12345 -T 20 \
-w /home/deren/Documents/RADmissing/empirical_6/ \
-n empirical_6_m4 -s empirical_6/outfiles/empirical_6_m4.phy
%%bash
## raxml argumement w/ ...
raxmlHPC-PTHREADS-AVX -f a -m GTRGAMMA -N 100 -x 12345 -p 12345 -T 20 \
-w /home/deren/Documents/RADmissing/empirical_6/ \
-n empirical_6_m2 -s empirical_6/outfiles/empirical_6_m2.phy
%%bash
head -n 20 empirical_6/RAxML_info.empirical_6_m4
This is RAxML version 8.0.16 released by Alexandros Stamatakis on March 21 2014. With greatly appreciated code contributions by: Andre Aberer (HITS) Simon Berger (HITS) Alexey Kozlov (HITS) Nick Pattengale (Sandia) Wayne Pfeiffer (SDSC) Akifumi S. Tanabe (NRIFS) David Dao (KIT) Charlie Taylor (UF) Alignment has 83996 distinct alignment patterns Proportion of gaps and completely undetermined characters in this alignment: 50.46% RAxML rapid bootstrapping and subsequent ML search
%%bash
head -n 20 empirical_6/RAxML_info.empirical_6_m2
ape
¶%load_ext rpy2.ipython
%%R -h 800 -w 800
library(ape)
tre <- read.tree("empirical_6/RAxML_bipartitions.empirical_6")
ltre <- ladderize(tre)
par(mfrow=c(1,2))
plot(ltre, use.edge.length=F)
nodelabels(ltre$node.label)
plot(ltre, type='u')
%%R
mean(cophenetic.phylo(ltre))
[1] 0.04078121
print pd.DataFrame([indata.Library_Name_s, indata.Organism_s]).T
Library_Name_s Organism_s 0 90.o31 Vidua hypocherina 1 90.o38 Vidua regia 2 90.o50 Vidua fischeri 3 90.o53 Vidua orientalis 4 A056 Lagonosticta senegala rhodopsis 5 A081 Vidua paradisaea 6 A082 Vidua obtusa 7 A089 Vidua macroura 8 A107 Lagonosticta rhodopareia 9 A145 Lagonosticta larvata 10 A147 Vidua macroura 11 A167 Lagonosticta senegala rendalli 12 A177 Vidua chalybeata 13 A1794 Anomalospiza imberbis 14 A204 Lagonosticta rufopicta 15 A248 Lagonosticta rara 16 A328 Lagonosticta sanguinodorsalis 17 JMD1249 Vidua purpurascens 18 JMD1285 Vidua chalybeata 19 MDS004 Vidua raricola 20 MDS054 Clytospiza monteiri 21 MDS065 Lagonosticta rubricata congica 22 NKK511 Vidua interjecta 23 SAR6894 Lagonosticta rubricata rubricata