This is an IPython notebook. Most of the code is composed of bash scripts, indicated by %%bash at the top of the cell, otherwise it is IPython code. This notebook includes code to download, assemble and analyze a published RADseq data set.
### Notebook 8
### Data set 8: Barnacles
### Authors: Herrera et al. 2015
### Data Location: SRP051026
Sequence data for this study are archived on the NCBI sequence read archive (SRA). Below I read in SraRunTable.txt for this project which contains all of the information we need to download the data.
%%bash
## make a new directory for this analysis
mkdir -p empirical_8/fastq/
## IPython code
import pandas as pd
import numpy as np
import urllib2
import os
## open the SRA run table from github url
url = "https://raw.githubusercontent.com/"+\
"dereneaton/RADmissing/master/empirical_8_SraRunTable.txt"
intable = urllib2.urlopen(url)
indata = pd.read_table(intable, sep="\t")
## print first few rows
print indata.head()
BioSample_s MBases_l MBytes_l Organism_s \ 0 SAMN03256053 39 32 Ashinkailepas seepiophila 1 SAMN03256054 87 72 Ashinkailepas kermadecensis 2 SAMN03256055 17 14 Leucolepas longa 3 SAMN03256056 78 65 Neobrachylepas relica 4 SAMN03256057 7 6 Neolepas sp. SH-2014 ReleaseDate_s Run_s SRA_Sample_s Sample_Name_s collected_by_s \ 0 Dec 10, 2014 SRR1702480 SRS785645 AsOK3 HOV Shinkai 2000 1 Dec 11, 2014 SRR1702481 SRS785647 AsNiN2 ROV Quest 4000 2 Dec 11, 2014 SRR1702482 SRS785649 VuTO2 HROV Nereus 3 Dec 11, 2014 SRR1702484 SRS785650 bar06 ROV Jason 2 4 Dec 11, 2014 SRR1702485 SRS785651 NeIn2 HOV Shinkai 6500 collection_date_s ... Platform_s SRA_Study_s breed_s \ 0 1997 ... ILLUMINA SRP051026 not applicable 1 2012 ... ILLUMINA SRP051026 not applicable 2 2009 ... ILLUMINA SRP051026 not applicable 3 2009 ... ILLUMINA SRP051026 not applicable 4 2009 ... ILLUMINA SRP051026 not applicable g1k_analysis_group_s g1k_pop_code_s identified_by_s isolate_s \ 0 <not provided> <not provided> S. Herrera not applicable 1 <not provided> <not provided> S. Herrera not applicable 2 <not provided> <not provided> S. Herrera not applicable 3 <not provided> <not provided> S. Herrera not applicable 4 <not provided> <not provided> S. Herrera not applicable isolation_source_s source_s tissue_s 0 Hydrothermal vent <not provided> muscle 1 Hydrothermal vent <not provided> muscle 2 Hydrothermal vent <not provided> muscle 3 Hydrothermal vent <not provided> muscle 4 Hydrothermal vent <not provided> muscle [5 rows x 33 columns]
def wget_download(SRR, outdir, outname):
""" Python function to get sra data from ncbi and write to
outdir with a new name using bash call wget """
## get output name
output = os.path.join(outdir, outname+".sra")
## create a call string
call = "wget -q -r -nH --cut-dirs=9 -O "+output+" "+\
"ftp://ftp-trace.ncbi.nlm.nih.gov/"+\
"sra/sra-instant/reads/ByRun/sra/SRR/"+\
"{}/{}/{}.sra;".format(SRR[:6], SRR, SRR)
## call bash script
! $call
Here we pass the SRR number and the sample name to the wget_download
function so that the files are saved with their sample names.
for ID, SRR in zip(indata.Sample_Name_s, indata.Run_s):
wget_download(SRR, "empirical_8/fastq/", ID)
%%bash
## convert sra files to fastq using fastq-dump tool
## output as gzipped into the fastq directory
fastq-dump --gzip -O empirical_8/fastq/ empirical_8/fastq/*.sra
## remove .sra files
rm empirical_8/fastq/*.sra
Read 1012150 spots for empirical_8/fastq/166_40.sra Written 1012150 spots for empirical_8/fastq/166_40.sra Read 1581225 spots for empirical_8/fastq/72638_22.sra Written 1581225 spots for empirical_8/fastq/72638_22.sra Read 2024006 spots for empirical_8/fastq/82121_15.sra Written 2024006 spots for empirical_8/fastq/82121_15.sra Read 1020530 spots for empirical_8/fastq/AsNiN2.sra Written 1020530 spots for empirical_8/fastq/AsNiN2.sra Read 458811 spots for empirical_8/fastq/AsOK3.sra Written 458811 spots for empirical_8/fastq/AsOK3.sra Read 909109 spots for empirical_8/fastq/bar06.sra Written 909109 spots for empirical_8/fastq/bar06.sra Read 570988 spots for empirical_8/fastq/bar22.sra Written 570988 spots for empirical_8/fastq/bar22.sra Read 898436 spots for empirical_8/fastq/barJC6731B1_1.sra Written 898436 spots for empirical_8/fastq/barJC6731B1_1.sra Read 84629 spots for empirical_8/fastq/NeIn2.sra Written 84629 spots for empirical_8/fastq/NeIn2.sra Read 77842 spots for empirical_8/fastq/NeOg1.sra Written 77842 spots for empirical_8/fastq/NeOg1.sra Read 139189 spots for empirical_8/fastq/SEPR_3.sra Written 139189 spots for empirical_8/fastq/SEPR_3.sra Read 845743 spots for empirical_8/fastq/VuMaU1.sra Written 845743 spots for empirical_8/fastq/VuMaU1.sra Read 208597 spots for empirical_8/fastq/VuTO2.sra Written 208597 spots for empirical_8/fastq/VuTO2.sra Read 9831255 spots total Written 9831255 spots total
%%bash
ls -lh empirical_8/fastq/
total 724M -rw-rw-r-- 1 deren deren 75M Nov 24 12:49 166_40.fastq.gz -rw-rw-r-- 1 deren deren 117M Nov 24 12:49 72638_22.fastq.gz -rw-rw-r-- 1 deren deren 151M Nov 24 12:49 82121_15.fastq.gz -rw-rw-r-- 1 deren deren 75M Nov 24 12:49 AsNiN2.fastq.gz -rw-rw-r-- 1 deren deren 34M Nov 24 12:49 AsOK3.fastq.gz -rw-rw-r-- 1 deren deren 67M Nov 24 12:49 bar06.fastq.gz -rw-rw-r-- 1 deren deren 42M Nov 24 12:49 bar22.fastq.gz -rw-rw-r-- 1 deren deren 66M Nov 24 12:49 barJC6731B1_1.fastq.gz -rw-rw-r-- 1 deren deren 6.3M Nov 24 12:49 NeIn2.fastq.gz -rw-rw-r-- 1 deren deren 5.8M Nov 24 12:49 NeOg1.fastq.gz -rw-rw-r-- 1 deren deren 11M Nov 24 12:49 SEPR_3.fastq.gz -rw-rw-r-- 1 deren deren 63M Nov 24 12:49 VuMaU1.fastq.gz -rw-rw-r-- 1 deren deren 15M Nov 24 12:49 VuTO2.fastq.gz
%%bash
pyrad --version
pyRAD 3.0.63
%%bash
## remove old params file if it exists
rm params.txt
## create a new default params file
pyrad -n
new params.txt file created
The data here are from Illumina Casava <1.8, so the phred scores are offset by 64 instead of 33, so we use that in the params file below.
%%bash
## substitute new parameters into file
sed -i '/## 1. /c\empirical_8/ ## 1. working directory ' params.txt
sed -i '/## 6. /c\TGCAGG ## 6. cutters ' params.txt
sed -i '/## 7. /c\20 ## 7. N processors ' params.txt
sed -i '/## 9. /c\6 ## 9. NQual ' params.txt
sed -i '/## 10./c\.85 ## 10. clust threshold ' params.txt
sed -i '/## 12./c\4 ## 12. MinCov ' params.txt
sed -i '/## 13./c\10 ## 13. maxSH ' params.txt
sed -i '/## 14./c\empirical_8_m4 ## 14. output name ' params.txt
sed -i '/## 18./c\empirical_8/fastq/*.gz ## 18. data location ' params.txt
sed -i '/## 29./c\2,2 ## 29. trim overhang ' params.txt
sed -i '/## 30./c\p,n,s ## 30. output formats ' params.txt
cat params.txt
==** parameter inputs for pyRAD version 3.0.63 **======================== affected step == empirical_8/ ## 1. working directory ./*.fastq.gz ## 2. Loc. of non-demultiplexed files (if not line 18) (s1) ./*.barcodes ## 3. Loc. of barcode file (if not line 18) (s1) vsearch ## 4. command (or path) to call vsearch (or usearch) (s3,s6) muscle ## 5. command (or path) to call muscle (s3,s7) TGCAGG ## 6. cutters 20 ## 7. N processors 6 ## 8. Mindepth: min coverage for a cluster (s4,s5) 6 ## 9. NQual .85 ## 10. clust threshold rad ## 11. Datatype: rad,gbs,pairgbs,pairddrad,(others:see docs)(all) 4 ## 12. MinCov 10 ## 13. maxSH empirical_8_m4 ## 14. output name ==== optional params below this line =================================== affected step == ## 15.opt.: select subset (prefix* only selector) (s2-s7) ## 16.opt.: add-on (outgroup) taxa (list or prefix*) (s6,s7) ## 17.opt.: exclude taxa (list or prefix*) (s7) empirical_8/fastq/*.gz ## 18. data location ## 19.opt.: maxM: N mismatches in barcodes (def= 1) (s1) ## 20.opt.: phred Qscore offset (def= 33) (s2) ## 21.opt.: filter: def=0=NQual 1=NQual+adapters. 2=strict (s2) ## 22.opt.: a priori E,H (def= 0.001,0.01, if not estimated) (s5) ## 23.opt.: maxN: max Ns in a cons seq (def=5) (s5) ## 24.opt.: maxH: max heterozyg. sites in cons seq (def=5) (s5) ## 25.opt.: ploidy: max alleles in cons seq (def=2;see docs) (s4,s5) ## 26.opt.: maxSNPs: (def=100). Paired (def=100,100) (s7) ## 27.opt.: maxIndels: within-clust,across-clust (def. 3,99) (s3,s7) ## 28.opt.: random number seed (def. 112233) (s3,s6,s7) 2,2 ## 29. trim overhang p,n,s ## 30. output formats ## 31.opt.: maj. base call at depth>x<mindepth (def.x=mindepth) (s5) ## 32.opt.: keep trimmed reads (def=0). Enter min length. (s2) ## 33.opt.: max stack size (int), def= max(500,mean+2*SD) (s3) ## 34.opt.: minDerep: exclude dereps with <= N copies, def=1 (s3) ## 35.opt.: use hierarchical clustering (def.=0, 1=yes) (s6) ## 36.opt.: repeat masking (def.=1='dust' method, 0=no) (s3,s6) ## 37.opt.: vsearch max threads per job (def.=6; see docs) (s3,s6) ==== optional: list group/clade assignments below this line (see docs) ==================
%%bash
pyrad -p params.txt -s 234567 >> log.txt 2>&1
%%bash
sed -i '/## 12./c\2 ## 12. MinCov ' params.txt
sed -i '/## 14./c\empirical_8_m2 ## 14. output name ' params.txt
%%bash
pyrad -p params.txt -s 7 >> log.txt 2>&1
We are interested in the relationship between the amount of input (raw) data between any two samples, the average coverage they recover when clustered together, and the phylogenetic distances separating samples.
The average number of raw reads per sample is 1.36M.
import pandas as pd
## read in the data
s2dat = pd.read_table("empirical_8/stats/s2.rawedit.txt", header=0, nrows=42)
## print summary stats
print s2dat["passed.total"].describe()
## find which sample has the most raw data
maxraw = s2dat["passed.total"].max()
print "\nmost raw data in sample:"
print s2dat['sample '][s2dat['passed.total']==maxraw]
count 13.000000 mean 702926.615385 std 544381.441228 min 72418.000000 25% 197210.000000 50% 775823.000000 75% 945017.000000 max 1854154.000000 Name: passed.total, dtype: float64 most raw data in sample: 2 82121_15 Name: sample , dtype: object
pyrad v.3.0.63 outputs depth information for each sample which I read in here and plot. First let's ask which sample has the highest depth of coverage. The std of coverages is pretty low in this data set compared to several others.
## read in the s3 results
s8dat = pd.read_table("empirical_8/stats/s3.clusters.txt", header=0, nrows=14)
## print summary stats
print "summary of means\n=================="
print s8dat['dpt.me'].describe()
## print summary stats
print "\nsummary of std\n=================="
print s8dat['dpt.sd'].describe()
## print summary stats
print "\nsummary of proportion lowdepth\n=================="
print pd.Series(1-s8dat['d>5.tot']/s8dat["total"]).describe()
## find which sample has the greatest depth of retained loci
max_hiprop = (s8dat["d>5.tot"]/s8dat["total"]).max()
print "\nhighest coverage in sample:"
print s8dat['taxa'][s8dat['d>5.tot']/s8dat["total"]==max_hiprop]
summary of means ================== count 13.000000 mean 33.165923 std 17.043971 min 12.552000 25% 18.726000 50% 35.771000 75% 40.778000 max 67.946000 Name: dpt.me, dtype: float64 summary of std ================== count 13.000000 mean 109.360538 std 76.550954 min 19.386000 25% 57.572000 50% 94.316000 75% 154.275000 max 256.698000 Name: dpt.sd, dtype: float64 summary of proportion lowdepth ================== count 13.000000 mean 0.241166 std 0.100536 min 0.147948 25% 0.164743 50% 0.182743 75% 0.310341 max 0.422289 dtype: float64 highest coverage in sample: 2 82121_15 Name: taxa, dtype: object
maxprop =(s8dat['d>5.tot']/s8dat['total']).max()
print "\nhighest prop coverage in sample:"
print s8dat['taxa'][s8dat['d>5.tot']/s8dat['total']==maxprop]
highest prop coverage in sample: 2 82121_15 Name: taxa, dtype: object
import numpy as np
## print mean and std of coverage for the highest coverage sample
with open("empirical_8/clust.85/82121_15.depths", 'rb') as indat:
depths = np.array(indat.read().strip().split(","), dtype=int)
print "Means for sample 82121_15"
print depths.mean(), depths.std()
print depths[depths>5].mean(), depths[depths>5].std()
Means for sample 82121_15 67.9461337257 256.697997144 79.3792210284 276.499000416
Green shows the loci that were discarded and orange the loci that were retained. The majority of data were discarded for being too low of coverage.
import toyplot
import toyplot.svg
import numpy as np
## read in the depth information for this sample
with open("empirical_8/clust.85/82121_15.depths", 'rb') as indat:
depths = np.array(indat.read().strip().split(","), dtype=int)
## make a barplot in Toyplot
canvas = toyplot.Canvas(width=350, height=300)
axes = canvas.axes(xlabel="Depth of coverage (N reads)",
ylabel="N loci",
label="dataset8/sample=82121_15")
## select the loci with depth > 5 (kept)
keeps = depths[depths>5]
## plot kept and discarded loci
edat = np.histogram(depths, range(30)) # density=True)
kdat = np.histogram(keeps, range(30)) #, density=True)
axes.bars(edat)
axes.bars(kdat)
#toyplot.svg.render(canvas, "empirical_8_depthplot.svg")
<toyplot.mark.BarMagnitudes at 0x7fe4346dd410>
cat empirical_8/stats/empirical_8_m4.stats
15502 ## loci with > minsp containing data 15502 ## loci with > minsp containing data & paralogs removed 15502 ## loci with > minsp containing data & paralogs removed & final filtering ## number of loci recovered in final data set for each taxon. taxon nloci 166_40 12509 72638_22 12551 82121_15 12957 AsNiN2 8260 AsOK3 5942 NeIn2 2368 NeOg1 636 SEPR_3 2401 VuMaU1 10889 VuTO2 4745 bar06 2308 bar22 9924 barJC6731B1_1 12071 ## nloci = number of loci with data for exactly ntaxa ## ntotal = number of loci for which at least ntaxa have data ntaxa nloci saved ntotal 1 - 2 - - 3 - - 4 2793 * 15502 5 2943 * 12709 6 3110 * 9766 7 2763 * 6656 8 2107 * 3893 9 1187 * 1786 10 469 * 599 11 116 * 130 12 14 * 14 ## nvar = number of loci containing n variable sites (pis+autapomorphies). ## sumvar = sum of variable sites (SNPs). ## pis = number of loci containing n parsimony informative sites. ## sumpis = sum of parsimony informative sites. nvar sumvar PIS sumPIS 0 356 0 3376 0 1 692 692 3028 3028 2 1043 2778 2719 8466 3 1249 6525 1967 14367 4 1419 12201 1495 20347 5 1502 19711 1012 25407 6 1441 28357 714 29691 7 1356 37849 426 32673 8 1225 47649 319 35225 9 1065 57234 169 36746 10 843 65664 120 37946 11 722 73606 76 38782 12 642 81310 47 39346 13 511 87953 17 39567 14 396 93497 7 39665 15 302 98027 3 39710 16 205 101307 7 39822 17 152 103891 0 39822 18 148 106555 0 39822 19 74 107961 0 39822 20 56 109081 0 39822 21 31 109732 0 39822 22 33 110458 0 39822 23 17 110849 0 39822 24 12 111137 0 39822 25 5 111262 0 39822 26 1 111288 0 39822 27 2 111342 0 39822 28 2 111398 0 39822 total var= 111398 total pis= 39822 sampled unlinked SNPs= 15146 sampled unlinked bi-allelic SNPs= 14830
cat empirical_8/stats/empirical_8_m2.stats
26325 ## loci with > minsp containing data 26325 ## loci with > minsp containing data & paralogs removed 26325 ## loci with > minsp containing data & paralogs removed & final filtering ## number of loci recovered in final data set for each taxon. taxon nloci 166_40 14910 72638_22 16286 82121_15 16879 AsNiN2 12110 AsOK3 9262 NeIn2 2676 NeOg1 832 SEPR_3 2729 VuMaU1 12796 VuTO2 5385 bar06 2969 bar22 11523 barJC6731B1_1 14297 ## nloci = number of loci with data for exactly ntaxa ## ntotal = number of loci for which at least ntaxa have data ntaxa nloci saved ntotal 1 - 2 7376 * 26325 3 3447 * 18949 4 2793 * 15502 5 2943 * 12709 6 3110 * 9766 7 2763 * 6656 8 2107 * 3893 9 1187 * 1786 10 469 * 599 11 116 * 130 12 14 * 14 ## nvar = number of loci containing n variable sites (pis+autapomorphies). ## sumvar = sum of variable sites (SNPs). ## pis = number of loci containing n parsimony informative sites. ## sumpis = sum of parsimony informative sites. nvar sumvar PIS sumPIS 0 3730 0 13784 0 1 2679 2679 3314 3314 2 2320 7319 2799 8912 3 2157 13790 1998 14906 4 2056 22014 1510 20946 5 2029 32159 1015 26021 6 1857 43301 714 30305 7 1653 54872 426 33287 8 1542 67208 319 35839 9 1320 79088 169 37360 10 1061 89698 120 38560 11 921 99829 76 39396 12 805 109489 47 39960 13 632 117705 17 40181 14 457 124103 7 40279 15 320 128903 3 40324 16 221 132439 7 40436 17 167 135278 0 40436 18 154 138050 0 40436 19 80 139570 0 40436 20 58 140730 0 40436 21 33 141423 0 40436 22 34 142171 0 40436 23 17 142562 0 40436 24 12 142850 0 40436 25 5 142975 0 40436 26 1 143001 0 40436 27 2 143055 0 40436 28 2 143111 0 40436 total var= 143111 total pis= 40436 sampled unlinked SNPs= 22595 sampled unlinked bi-allelic SNPs= 22162
%%bash
## raxml argumement w/ ...
raxmlHPC-PTHREADS-AVX -f a -m GTRGAMMA -N 100 -x 12345 -p 12345 -T 20 \
-w /home/deren/Documents/RADmissing/empirical_8/ \
-n empirical_8_m4 -s empirical_8/outfiles/empirical_8_m4.phy
%%bash
## raxml argumement w/ ...
raxmlHPC-PTHREADS-AVX -f a -m GTRGAMMA -N 100 -x 12345 -p 12345 -T 20 \
-w /home/deren/Documents/RADmissing/empirical_8/ \
-n empirical_8_m2 -s empirical_8/outfiles/empirical_8_m2.phy
%%bash
head -n 20 empirical_8/RAxML_info.empirical_8
This is RAxML version 8.0.16 released by Alexandros Stamatakis on March 21 2014. With greatly appreciated code contributions by: Andre Aberer (HITS) Simon Berger (HITS) Alexey Kozlov (HITS) Nick Pattengale (Sandia) Wayne Pfeiffer (SDSC) Akifumi S. Tanabe (NRIFS) David Dao (KIT) Charlie Taylor (UF) Alignment has 63494 distinct alignment patterns Proportion of gaps and completely undetermined characters in this alignment: 51.86% RAxML rapid bootstrapping and subsequent ML search
ape
¶%load_ext rpy2.ipython
%%R -h 800 -w 800
library(ape)
tre <- read.tree("empirical_8/RAxML_bipartitions.empirical_8")
ltre <- ladderize(tre)
par(mfrow=c(1,2))
plot(ltre, use.edge.length=F)
nodelabels(ltre$node.label)
plot(ltre, type='u')
%%R
mean(cophenetic.phylo(ltre))
[1] 0.05226094