This is an Jupyter/IPython notebook. Most of the code is composed of bash scripts, indicated by %%bash at the top of the cell, otherwise it is IPython code. This notebook includes code to download, assemble and analyze a published RADseq data set, and further code below to analyze missing data in that data set.
### Notebook 2
### Data set 2 (Phrynosomatidae)
### Authors: Leache et al. (2015)
### Data Location: NCBI SRA SRP063316
Sequence data for this study are archived on the NCBI sequence read archive (SRA). Below I read in SraRunTable.txt downloaded from the SRA website for this project. It contains all of the information on the id numbers needed to download data for all samples for this project.
%%bash
## make a new directory for this analysis
mkdir -p empirical_2/fastq/
## IPython code
## import libraries
import pandas as pd
import urllib2
import os
## read in the SRA run table from public github url
## as a pandas data frame
url = "https://raw.githubusercontent.com/"+\
"dereneaton/RADmissing/master/empirical_2_SraRunTable.txt"
intable = urllib2.urlopen(url)
indata = pd.read_table(intable, sep="\t")
## print first few rows
print indata.head()
BioSample_s Library_Name_s MBases_l MBytes_l Organism_s \ 0 SAMN04027526 PHCE3 88 58 Phrynosoma cerroense 1 SAMN04027532 PHCN3 67 44 Phrynosoma cornutum 2 SAMN04027531 PHCN2 79 52 Phrynosoma cornutum 3 SAMN04027530 PHCN1 47 31 Phrynosoma cornutum 4 SAMN04027529 PHCE6 45 29 Phrynosoma cerroense Run_s SRA_Sample_s Sample_Name_s \ 0 SRR2240520 SRS1054806 PHCE3 1 SRR2240526 SRS1054807 PHCN3 2 SRR2240525 SRS1054808 PHCN2 3 SRR2240524 SRS1054809 PHCN1 4 SRR2240523 SRS1054810 PHCE6 geo_loc_name_s sex_s \ 0 Mexico:Colonia Guerrero, Baja California male 1 USA:3.4 mi (by gravel road) N Steins, Hidalgo ... not collected 2 USA:Hwy 180, 8.4 mi NW of Hwy 10 (at Deming), ... not collected 3 USA:0.5 mi E Portal, Cochise Co., Arizona not collected 4 Mexico:7.5 mi S Guerrero Negro, Baja Californi... female voucher_s Assay_Type_s AssemblyName_s BioProject_s \ 0 MVZ:Herp:161187 OTHER <not provided> PRJNA294316 1 AMCC:Herp:106073 OTHER <not provided> PRJNA294316 2 CAS:Herp:228870 OTHER <not provided> PRJNA294316 3 MVZ:Herp:238582 OTHER <not provided> PRJNA294316 4 MVZ:Herp:161210 OTHER <not provided> PRJNA294316 BioSampleModel_s Center_Name_s Consent_s InsertSize_l \ 0 Model organism or animal NREMC public 0 1 Model organism or animal NREMC public 0 2 Model organism or animal NREMC public 0 3 Model organism or animal NREMC public 0 4 Model organism or animal NREMC public 0 LibraryLayout_s LoadDate_s 0 SINGLE Sep 04, 2015 ... 1 SINGLE Sep 04, 2015 ... 2 SINGLE Sep 04, 2015 ... 3 SINGLE Sep 04, 2015 ... 4 SINGLE Sep 04, 2015 ... [5 rows x 29 columns]
This is a function to make wget
calls to download data base on SRA IDs
def wget_download(SRR, outdir, outname):
""" Python function to get sra data from ncbi and write to
outdir with a new name using bash call wget """
## get output name
output = os.path.join(outdir, outname+".sra")
## create a call string
call = "wget -q -r -nH --cut-dirs=9 -O "+output+" "+\
"ftp://ftp-trace.ncbi.nlm.nih.gov/"+\
"sra/sra-instant/reads/ByRun/sra/SRR/"+\
"{}/{}/{}.sra;".format(SRR[:6], SRR, SRR)
## call bash script
! $call
Here we pass the SRR number and the sample name to the wget_download
function so that the files are saved with their sample names.
for ID, SRR in zip(indata.Library_Name_s, indata.Run_s):
wget_download(SRR, "empirical_2/fastq/", ID)
%%bash
## convert sra files to fastq using fastq-dump tool
## output as gzipped into the fastq directory
fastq-dump --gzip -O empirical_2/fastq/ empirical_2/fastq/*.sra
## remove .sra files
rm empirical_2/fastq/*.sra
Read 1883604 spots for empirical_2/fastq/CADR1.sra Written 1883604 spots for empirical_2/fastq/CADR1.sra Read 1168210 spots for empirical_2/fastq/CADR2.sra Written 1168210 spots for empirical_2/fastq/CADR2.sra Read 1452471 spots for empirical_2/fastq/COTE1.sra Written 1452471 spots for empirical_2/fastq/COTE1.sra Read 5406187 spots for empirical_2/fastq/GAWI1.sra Written 5406187 spots for empirical_2/fastq/GAWI1.sra Read 699921 spots for empirical_2/fastq/HOMA1.sra Written 699921 spots for empirical_2/fastq/HOMA1.sra Read 2590961 spots for empirical_2/fastq/PETH1.sra Written 2590961 spots for empirical_2/fastq/PETH1.sra Read 1200924 spots for empirical_2/fastq/PHAS1.sra Written 1200924 spots for empirical_2/fastq/PHAS1.sra Read 2698423 spots for empirical_2/fastq/PHAS2.sra Written 2698423 spots for empirical_2/fastq/PHAS2.sra Read 1622179 spots for empirical_2/fastq/PHAS3.sra Written 1622179 spots for empirical_2/fastq/PHAS3.sra Read 1230297 spots for empirical_2/fastq/PHAS4.sra Written 1230297 spots for empirical_2/fastq/PHAS4.sra Read 1723043 spots for empirical_2/fastq/PHBL1.sra Written 1723043 spots for empirical_2/fastq/PHBL1.sra Read 1753039 spots for empirical_2/fastq/PHBL2.sra Written 1753039 spots for empirical_2/fastq/PHBL2.sra Read 1598240 spots for empirical_2/fastq/PHBL3.sra Written 1598240 spots for empirical_2/fastq/PHBL3.sra Read 1744633 spots for empirical_2/fastq/PHBL4.sra Written 1744633 spots for empirical_2/fastq/PHBL4.sra Read 651033 spots for empirical_2/fastq/PHBR1.sra Written 651033 spots for empirical_2/fastq/PHBR1.sra Read 1047774 spots for empirical_2/fastq/PHBR2.sra Written 1047774 spots for empirical_2/fastq/PHBR2.sra Read 2913849 spots for empirical_2/fastq/PHBR3.sra Written 2913849 spots for empirical_2/fastq/PHBR3.sra Read 3075277 spots for empirical_2/fastq/PHBR4.sra Written 3075277 spots for empirical_2/fastq/PHBR4.sra Read 1560169 spots for empirical_2/fastq/PHCE1.sra Written 1560169 spots for empirical_2/fastq/PHCE1.sra Read 1655737 spots for empirical_2/fastq/PHCE2.sra Written 1655737 spots for empirical_2/fastq/PHCE2.sra Read 2066291 spots for empirical_2/fastq/PHCE3.sra Written 2066291 spots for empirical_2/fastq/PHCE3.sra Read 1385481 spots for empirical_2/fastq/PHCE4.sra Written 1385481 spots for empirical_2/fastq/PHCE4.sra Read 2873860 spots for empirical_2/fastq/PHCE5.sra Written 2873860 spots for empirical_2/fastq/PHCE5.sra Read 1068532 spots for empirical_2/fastq/PHCE6.sra Written 1068532 spots for empirical_2/fastq/PHCE6.sra Read 1112293 spots for empirical_2/fastq/PHCN1.sra Written 1112293 spots for empirical_2/fastq/PHCN1.sra Read 1854834 spots for empirical_2/fastq/PHCN2.sra Written 1854834 spots for empirical_2/fastq/PHCN2.sra Read 1572449 spots for empirical_2/fastq/PHCN3.sra Written 1572449 spots for empirical_2/fastq/PHCN3.sra Read 1434649 spots for empirical_2/fastq/PHCN4.sra Written 1434649 spots for empirical_2/fastq/PHCN4.sra Read 755638 spots for empirical_2/fastq/PHCO1.sra Written 755638 spots for empirical_2/fastq/PHCO1.sra Read 1333197 spots for empirical_2/fastq/PHDI1.sra Written 1333197 spots for empirical_2/fastq/PHDI1.sra Read 1461823 spots for empirical_2/fastq/PHDI2.sra Written 1461823 spots for empirical_2/fastq/PHDI2.sra Read 1625284 spots for empirical_2/fastq/PHDO1.sra Written 1625284 spots for empirical_2/fastq/PHDO1.sra Read 2479285 spots for empirical_2/fastq/PHDO2.sra Written 2479285 spots for empirical_2/fastq/PHDO2.sra Read 993112 spots for empirical_2/fastq/PHGO1.sra Written 993112 spots for empirical_2/fastq/PHGO1.sra Read 848505 spots for empirical_2/fastq/PHGO2.sra Written 848505 spots for empirical_2/fastq/PHGO2.sra Read 2298845 spots for empirical_2/fastq/PHGO3.sra Written 2298845 spots for empirical_2/fastq/PHGO3.sra Read 4136233 spots for empirical_2/fastq/PHGO4.sra Written 4136233 spots for empirical_2/fastq/PHGO4.sra Read 1130365 spots for empirical_2/fastq/PHHE1.sra Written 1130365 spots for empirical_2/fastq/PHHE1.sra Read 1828106 spots for empirical_2/fastq/PHHE2.sra Written 1828106 spots for empirical_2/fastq/PHHE2.sra Read 1782490 spots for empirical_2/fastq/PHHE3.sra Written 1782490 spots for empirical_2/fastq/PHHE3.sra Read 2137199 spots for empirical_2/fastq/PHHE4.sra Written 2137199 spots for empirical_2/fastq/PHHE4.sra Read 1326138 spots for empirical_2/fastq/PHHE5.sra Written 1326138 spots for empirical_2/fastq/PHHE5.sra Read 3215345 spots for empirical_2/fastq/PHMC1.sra Written 3215345 spots for empirical_2/fastq/PHMC1.sra Read 1083680 spots for empirical_2/fastq/PHMC2.sra Written 1083680 spots for empirical_2/fastq/PHMC2.sra Read 1119901 spots for empirical_2/fastq/PHMC3.sra Written 1119901 spots for empirical_2/fastq/PHMC3.sra Read 2292044 spots for empirical_2/fastq/PHMC4.sra Written 2292044 spots for empirical_2/fastq/PHMC4.sra Read 1769810 spots for empirical_2/fastq/PHMO1.sra Written 1769810 spots for empirical_2/fastq/PHMO1.sra Read 1780446 spots for empirical_2/fastq/PHMO2.sra Written 1780446 spots for empirical_2/fastq/PHMO2.sra Read 566270 spots for empirical_2/fastq/PHOR1.sra Written 566270 spots for empirical_2/fastq/PHOR1.sra Read 2065124 spots for empirical_2/fastq/PHOR2.sra Written 2065124 spots for empirical_2/fastq/PHOR2.sra Read 2024585 spots for empirical_2/fastq/PHOR3.sra Written 2024585 spots for empirical_2/fastq/PHOR3.sra Read 718036 spots for empirical_2/fastq/PHOR4.sra Written 718036 spots for empirical_2/fastq/PHOR4.sra Read 1307372 spots for empirical_2/fastq/PHPL1.sra Written 1307372 spots for empirical_2/fastq/PHPL1.sra Read 2900231 spots for empirical_2/fastq/PHPL2.sra Written 2900231 spots for empirical_2/fastq/PHPL2.sra Read 1960451 spots for empirical_2/fastq/PHPL3.sra Written 1960451 spots for empirical_2/fastq/PHPL3.sra Read 1154546 spots for empirical_2/fastq/PHSH1.sra Written 1154546 spots for empirical_2/fastq/PHSH1.sra Read 1650407 spots for empirical_2/fastq/PHSH2.sra Written 1650407 spots for empirical_2/fastq/PHSH2.sra Read 940706 spots for empirical_2/fastq/PHSH3.sra Written 940706 spots for empirical_2/fastq/PHSH3.sra Read 814375 spots for empirical_2/fastq/PHSH4.sra Written 814375 spots for empirical_2/fastq/PHSH4.sra Read 978350 spots for empirical_2/fastq/PHSO1.sra Written 978350 spots for empirical_2/fastq/PHSO1.sra Read 1523072 spots for empirical_2/fastq/PHSO2.sra Written 1523072 spots for empirical_2/fastq/PHSO2.sra Read 1668677 spots for empirical_2/fastq/PHSO3.sra Written 1668677 spots for empirical_2/fastq/PHSO3.sra Read 1311410 spots for empirical_2/fastq/PHTA1.sra Written 1311410 spots for empirical_2/fastq/PHTA1.sra Read 1685806 spots for empirical_2/fastq/PHTA2.sra Written 1685806 spots for empirical_2/fastq/PHTA2.sra Read 1740419 spots for empirical_2/fastq/PHTA3.sra Written 1740419 spots for empirical_2/fastq/PHTA3.sra Read 1522268 spots for empirical_2/fastq/PHTA4.sra Written 1522268 spots for empirical_2/fastq/PHTA4.sra Read 1898189 spots for empirical_2/fastq/SCAN1.sra Written 1898189 spots for empirical_2/fastq/SCAN1.sra Read 1742309 spots for empirical_2/fastq/SCGA1.sra Written 1742309 spots for empirical_2/fastq/SCGA1.sra Read 1595531 spots for empirical_2/fastq/SCMA1.sra Written 1595531 spots for empirical_2/fastq/SCMA1.sra Read 1404985 spots for empirical_2/fastq/SCOC1.sra Written 1404985 spots for empirical_2/fastq/SCOC1.sra Read 806846 spots for empirical_2/fastq/UMNO1.sra Written 806846 spots for empirical_2/fastq/UMNO1.sra Read 2923832 spots for empirical_2/fastq/URBI1.sra Written 2923832 spots for empirical_2/fastq/URBI1.sra Read 3465996 spots for empirical_2/fastq/UROR1.sra Written 3465996 spots for empirical_2/fastq/UROR1.sra Read 4818547 spots for empirical_2/fastq/UTST1.sra Written 4818547 spots for empirical_2/fastq/UTST1.sra Read 131630146 spots total Written 131630146 spots total
%%bash
pyrad --version
pyRAD 3.0.63
%%bash
## create a new default params file
pyrad -n
new params.txt file created
%%bash
## substitute new parameters into file
sed -i '/## 1. /c\empirical_2/ ## 1. working directory ' params.txt
sed -i '/## 6. /c\TGCAGG ## 6. cutters ' params.txt
sed -i '/## 7. /c\30 ## 7. N processors ' params.txt
sed -i '/## 9. /c\6 ## 9. NQual ' params.txt
sed -i '/## 10./c\.85 ## 10. clust threshold ' params.txt
sed -i '/## 12./c\4 ## 12. MinCov ' params.txt
sed -i '/## 13./c\10 ## 13. maxSH ' params.txt
sed -i '/## 14./c\empirical_2_m4 ## 14. output name ' params.txt
sed -i '/## 18./c\empirical_2/fastq/*.gz ## 18. data location ' params.txt
sed -i '/## 29./c\2,2 ## 29. trim overhang ' params.txt
sed -i '/## 30./c\p,n,s ## 30. output formats ' params.txt
cat params.txt
==** parameter inputs for pyRAD version 3.0.63 **======================== affected step == empirical_2/ ## 1. working directory ./*.fastq.gz ## 2. Loc. of non-demultiplexed files (if not line 18) (s1) ./*.barcodes ## 3. Loc. of barcode file (if not line 18) (s1) vsearch ## 4. command (or path) to call vsearch (or usearch) (s3,s6) muscle ## 5. command (or path) to call muscle (s3,s7) TGCAGG ## 6. cutters 30 ## 7. N processors 6 ## 8. Mindepth: min coverage for a cluster (s4,s5) 6 ## 9. NQual .85 ## 10. clust threshold rad ## 11. Datatype: rad,gbs,pairgbs,pairddrad,(others:see docs)(all) 4 ## 12. MinCov 10 ## 13. maxSH empirical_2_m4 ## 14. output name ==== optional params below this line =================================== affected step == ## 15.opt.: select subset (prefix* only selector) (s2-s7) ## 16.opt.: add-on (outgroup) taxa (list or prefix*) (s6,s7) ## 17.opt.: exclude taxa (list or prefix*) (s7) empirical_2/fastq/*.gz ## 18. data location ## 19.opt.: maxM: N mismatches in barcodes (def= 1) (s1) ## 20.opt.: phred Qscore offset (def= 33) (s2) ## 21.opt.: filter: def=0=NQual 1=NQual+adapters. 2=strict (s2) ## 22.opt.: a priori E,H (def= 0.001,0.01, if not estimated) (s5) ## 23.opt.: maxN: max Ns in a cons seq (def=5) (s5) ## 24.opt.: maxH: max heterozyg. sites in cons seq (def=5) (s5) ## 25.opt.: ploidy: max alleles in cons seq (def=2;see docs) (s4,s5) ## 26.opt.: maxSNPs: (def=100). Paired (def=100,100) (s7) ## 27.opt.: maxIndels: within-clust,across-clust (def. 3,99) (s3,s7) ## 28.opt.: random number seed (def. 112233) (s3,s6,s7) 2,2 ## 29. trim overhang p,n,s ## 30. output formats ## 31.opt.: maj. base call at depth>x<mindepth (def.x=mindepth) (s5) ## 32.opt.: keep trimmed reads (def=0). Enter min length. (s2) ## 33.opt.: max stack size (int), def= max(500,mean+2*SD) (s3) ## 34.opt.: minDerep: exclude dereps with <= N copies, def=1 (s3) ## 35.opt.: use hierarchical clustering (def.=0, 1=yes) (s6) ## 36.opt.: repeat masking (def.=1='dust' method, 0=no) (s3,s6) ## 37.opt.: vsearch max threads per job (def.=6; see docs) (s3,s6) ==== optional: list group/clade assignments below this line (see docs) ==================
We assemble the data set at mincov=4 and mincov=2
%%bash
pyrad -p params.txt -s 234567 >> log.txt 2>&1
%%bash
sed -i '/## 12./c\2 ## 12. MinCov ' params.txt
sed -i '/## 14./c\empirical_2_m2 ## 14. output name ' params.txt
%%bash
pyrad -p params.txt -s 7 >> log.txt 2>&1
We are interested in the relationship between the amount of input (raw) data between any two samples, the average coverage they recover when clustered together, and the phylogenetic distances separating samples.
The average number of raw reads per sample is 1.5M.
## read in the data
s2dat = pd.read_table("empirical_2/stats/s2.rawedit.txt", header=0, nrows=74)
## print summary stats
print s2dat["passed.total"].describe()
## find which sample has the most raw data
maxraw = s2dat["passed.total"].max()
print "\nmost raw data in sample:"
print s2dat['sample '][s2dat['passed.total']==maxraw]
count 74.000000 mean 1500455.675676 std 753933.061083 min 469134.000000 25% 1023280.500000 50% 1366480.000000 75% 1648418.250000 max 4436366.000000 Name: passed.total, dtype: float64 most raw data in sample: 3 GAWI1 Name: sample , dtype: object
pyrad v.3.0.63 outputs depth information for each sample which I read in here and plot. First let's ask which sample has the highest depth of coverage. The std here is the std in means across samples. The std of depths within individuals is much higher.
## read in the s3 results
s3dat = pd.read_table("empirical_2/stats/s3.clusters.txt", header=0, nrows=74)
## print summary stats
print "summary of means\n=================="
print s3dat['dpt.me'].describe()
## print summary stats
print "\nsummary of std\n=================="
print s3dat['dpt.sd'].describe()
## print summary stats
print "\nsummary of proportion lowdepth\n=================="
print pd.Series(1-s3dat['d>5.tot']/s3dat["total"]).describe()
## find which sample has the greatest depth of retained loci
max_hiprop = (s3dat["d>5.tot"]/s3dat["total"]).max()
print "\nhighest coverage in sample:"
print s3dat['taxa'][s3dat['d>5.tot']/s3dat["total"]==max_hiprop]
summary of means ================== count 74.000000 mean 38.482919 std 15.210132 min 13.655000 25% 27.747500 50% 36.933500 75% 45.540000 max 97.009000 Name: dpt.me, dtype: float64 summary of std ================== count 74.000000 mean 339.613689 std 304.751343 min 108.546000 25% 172.106500 50% 235.372000 75% 349.702750 max 2090.841000 Name: dpt.sd, dtype: float64 summary of proportion lowdepth ================== count 74.000000 mean 0.716992 std 0.057099 min 0.618293 25% 0.677747 50% 0.702102 75% 0.756614 max 0.894946 dtype: float64 highest coverage in sample: 17 PHBR4 Name: taxa, dtype: object
## print mean and std of coverage for the highest coverage sample
with open("empirical_2/clust.85/PHBR4.depths", 'rb') as indat:
depths = np.array(indat.read().strip().split(","), dtype=int)
print depths.mean(), depths.std()
57.1020689655 368.84812359
Green shows the loci that were discarded and orange the loci that were retained. The majority of data were discarded for being too low of coverage.
import toyplot
import toyplot.svg
import numpy as np
## read in the depth information for this sample
with open("empirical_2/clust.85/PHBR4.depths", 'rb') as indat:
depths = np.array(indat.read().strip().split(","), dtype=int)
## make a barplot in Toyplot
canvas = toyplot.Canvas(width=350, height=300)
axes = canvas.axes(xlabel="Depth of coverage (N reads)",
ylabel="N loci",
label="dataset2/sample=PHBR4")
## select the loci with depth > 5 (kept)
keeps = depths[depths>5]
## plot kept and discarded loci
edat = np.histogram(depths, range(30)) # density=True)
kdat = np.histogram(keeps, range(30)) #, density=True)
axes.bars(edat)
axes.bars(kdat)
#toyplot.svg.render(canvas, "empirical_2_depthplot.svg")
<toyplot.mark.BarMagnitudes at 0x7fd4eee25c10>
cat empirical_2/stats/empirical_2_m4.stats
41086 ## loci with > minsp containing data 41033 ## loci with > minsp containing data & paralogs removed 41011 ## loci with > minsp containing data & paralogs removed & final filtering ## number of loci recovered in final data set for each taxon. taxon nloci CADR1 2794 CADR2 2368 COTE1 2169 GAWI1 770 HOMA1 1473 PETH1 2020 PHAS1 4764 PHAS2 7858 PHAS3 6246 PHAS4 5945 PHBL1 9814 PHBL2 7591 PHBL3 6895 PHBL4 6951 PHBR1 4095 PHBR2 4235 PHBR3 8615 PHBR4 9529 PHCE1 8592 PHCE2 9942 PHCE3 8272 PHCE4 7031 PHCE5 9280 PHCE6 7123 PHCN1 5035 PHCN2 6544 PHCN3 5789 PHCN4 6183 PHCO1 5207 PHDI1 6054 PHDI2 6145 PHDO1 6949 PHDO2 8862 PHGO1 5427 PHGO2 4957 PHGO3 7282 PHGO4 7148 PHHE1 6596 PHHE2 8473 PHHE3 6519 PHHE4 8101 PHHE5 5674 PHMC1 8894 PHMC2 6284 PHMC3 6505 PHMC4 6808 PHMO1 5078 PHMO2 5500 PHOR1 3478 PHOR2 8272 PHOR3 8218 PHOR4 4160 PHPL1 5890 PHPL2 8079 PHPL3 6535 PHSH1 6346 PHSH2 7586 PHSH3 4994 PHSH4 5671 PHSO1 3819 PHSO2 5372 PHSO3 6552 PHTA1 5481 PHTA2 6787 PHTA3 6527 PHTA4 7901 SCAN1 1165 SCGA1 1383 SCMA1 1489 SCOC1 1651 UMNO1 1099 URBI1 1277 UROR1 1481 UTST1 1861 ## nloci = number of loci with data for exactly ntaxa ## ntotal = number of loci for which at least ntaxa have data ntaxa nloci saved ntotal 1 - 2 - - 3 - - 4 10107 * 41011 5 5693 * 30904 6 4159 * 25211 7 3133 * 21052 8 2510 * 17919 9 1924 * 15409 10 1545 * 13485 11 1310 * 11940 12 1056 * 10630 13 986 * 9574 14 781 * 8588 15 690 * 7807 16 584 * 7117 17 529 * 6533 18 492 * 6004 19 413 * 5512 20 391 * 5099 21 329 * 4708 22 319 * 4379 23 299 * 4060 24 289 * 3761 25 264 * 3472 26 255 * 3208 27 237 * 2953 28 211 * 2716 29 185 * 2505 30 192 * 2320 31 179 * 2128 32 138 * 1949 33 119 * 1811 34 168 * 1692 35 151 * 1524 36 106 * 1373 37 107 * 1267 38 109 * 1160 39 96 * 1051 40 90 * 955 41 85 * 865 42 93 * 780 43 71 * 687 44 49 * 616 45 56 * 567 46 45 * 511 47 47 * 466 48 46 * 419 49 50 * 373 50 42 * 323 51 30 * 281 52 33 * 251 53 31 * 218 54 18 * 187 55 26 * 169 56 16 * 143 57 16 * 127 58 11 * 111 59 15 * 100 60 11 * 85 61 11 * 74 62 8 * 63 63 11 * 55 64 6 * 44 65 10 * 38 66 7 * 28 67 7 * 21 68 6 * 14 69 0 * 8 70 4 * 8 71 2 * 4 72 1 * 2 73 0 * 1 74 1 * 1 ## nvar = number of loci containing n variable sites (pis+autapomorphies). ## sumvar = sum of variable sites (SNPs). ## pis = number of loci containing n parsimony informative sites. ## sumpis = sum of parsimony informative sites. nvar sumvar PIS sumPIS 0 8048 0 16090 0 1 5782 5782 7148 7148 2 4644 15070 5054 17256 3 4018 27124 3757 28527 4 3561 41368 2869 40003 5 3270 57718 2249 51248 6 2957 75460 1492 60200 7 2201 90867 893 66451 8 1702 104483 608 71315 9 1332 116471 377 74708 10 1037 126841 211 76818 11 698 134519 128 78226 12 542 141023 71 79078 13 372 145859 36 79546 14 262 149527 14 79742 15 198 152497 8 79862 16 115 154337 2 79894 17 87 155816 2 79928 18 67 157022 1 79946 19 43 157839 1 79965 20 28 158399 0 79965 21 16 158735 0 79965 22 8 158911 0 79965 23 9 159118 0 79965 24 7 159286 0 79965 25 6 159436 0 79965 26 0 159436 0 79965 27 0 159436 0 79965 28 0 159436 0 79965 29 0 159436 0 79965 30 1 159466 0 79965 total var= 159466 total pis= 79965 sampled unlinked SNPs= 32963 sampled unlinked bi-allelic SNPs= 18577
%%bash
head -n 10 empirical_2/stats/empirical_2_m2.stats
87440 ## loci with > minsp containing data 87387 ## loci with > minsp containing data & paralogs removed 87325 ## loci with > minsp containing data & paralogs removed & final filtering ## number of loci recovered in final data set for each taxon. taxon nloci CADR1 7329 CADR2 6453
%%bash
## raxml argumement w/ ...
raxmlHPC-PTHREADS-AVX -f a -m GTRGAMMA -N 100 -x 12345 -p 12345 -T 20 \
-w /home/deren/Documents/RADmissing/empirical_2/ \
-n empirical_2_m4 -s empirical_2/outfiles/empirical_2_m4.phy
%%bash
## raxml argumement w/ ...
raxmlHPC-PTHREADS-AVX -f a -m GTRGAMMA -N 100 -x 12345 -p 12345 -T 20 \
-w /home/deren/Documents/RADmissing/empirical_2/ \
-n empirical_2_m2 -s empirical_2/outfiles/empirical_2_m2.phy
%%bash
head -n 20 empirical_2/RAxML_info.empirical_2_m4
This is RAxML version 8.0.16 released by Alexandros Stamatakis on March 21 2014. With greatly appreciated code contributions by: Andre Aberer (HITS) Simon Berger (HITS) Alexey Kozlov (HITS) Nick Pattengale (Sandia) Wayne Pfeiffer (SDSC) Akifumi S. Tanabe (NRIFS) David Dao (KIT) Charlie Taylor (UF) Alignment has 296560 distinct alignment patterns Proportion of gaps and completely undetermined characters in this alignment: 86.18% RAxML rapid bootstrapping and subsequent ML search
%%bash
head -n 20 empirical_2/RAxML_info.empirical_2_m2
ape
¶%load_ext rpy2.ipython
%%R -w 600 -h 800
library(ape)
tre <- read.tree("empirical_2/RAxML_bipartitions.empirical_2")
ltre <- ladderize(tre)
plot(ltre, cex=0.8, edge.width=2)
#nodelabels(ltre$node.label)
%%R
mean(cophenetic.phylo(ltre))
[1] 0.06260378