### Notebook 1
### Data set 1 (Viburnum)
### Language: Bash
### Data Location: NCBI SRA PRJNA299402 & PRJNA299407
%%bash
## make a new directory for this analysis
mkdir -p empirical_1/
mkdir -p empirical_1/halfrun
mkdir -p empirical_1/fullrun
## import Python libraries
import pandas as pd
import numpy as np
import ipyparallel
import urllib2
import glob
import os
Sequence data for this study is archived on the NCBI sequence read archive (SRA). The data were run in two separate Illumina runs, but are combined under a single project id number.
The library contains 95 samples. We uploaded the two demultiplexed samples for each individual separately, so each sample has 2 files. Below we examine just the first library (the "half" data set) and then both libraries combined (the "full" data set). We analyze on 64 samples since the remaining samples are replicate individuals within species that are part of a separate project.
You can download the data set using the script below:
%%bash
## get the data from Dryad
for run in $(seq 24 28);
do
wget -q -r -nH --cut-dirs=9 \
ftp://ftp-trace.ncbi.nlm.nih.gov/\
sra/sra-instant/reads/ByRun/sra/SRR/\
SRR191/SRR19155$run/SRR19155$run".sra";
done
%%bash
## convert sra files to fastq using fastq-dump tool
fastq-dump *.sra
## IPython code
## This reads in a table mapping sample names to SRA numbers
## that is hosted on github
## open table from github url
url = "https://raw.githubusercontent.com/"+\
"dereneaton/virentes/master/SraRunTable.txt"
intable = urllib2.urlopen(url)
## make name xfer dictionary
DF = pd.read_table(intable, sep="\t")
D = {DF.Run_s[i]:DF.Library_Name_s[i] for i in DF.index}
## change file names and move to fastq dir/
for fname in glob.glob("*.fastq"):
os.rename(fname, "analysis_pyrad/fastq/"+\
D[fname.replace(".fastq",".fq")])
%%bash
mkdir -p fastq_combined
## IPython code that makes a bash call w/ (!)
## get all the data from the two libraries and concatenate it
lib1tax = glob.glob("/home/deren/Documents/Vib_Lib1/fastq_Lib1/*.gz")
lib2tax = glob.glob("/home/deren/Documents/Vib_Lib1/fastq_Lib2/*.gz")
## names had to be modified to match
taxa = [i.split("/")[-1].split("_", 1)[1] for i in lib1tax]
for tax in taxa:
! cat /home/deren/Documents/Vib_Lib1/fastq_Lib1/Lib1_$tax \
/home/deren/Documents/Vib_Lib1/fastq_Lib2/Lib2_$tax \
> /home/deren/Documents/Vib_Lib1/fastq_combined/$tax
cat: /home/deren/Documents/Vib_Lib1/fastq_Lib2/Lib2_lantan_combined_R1_.gz: No such file or directory cat: /home/deren/Documents/Vib_Lib1/fastq_Lib2/Lib2_propiinquum_combined_R1_.gz: No such file or directory
%%bash
pyrad --version
pyRAD 3.0.63
%%bash
## create a new default params file
rm params.txt
pyrad -n
new params.txt file created
%%bash
## substitute new parameters into file
sed -i '/## 1. /c\empirical_1/halfrun ## 1. working directory ' params.txt
sed -i '/## 6. /c\TGCAG ## 6. cutters ' params.txt
sed -i '/## 7. /c\30 ## 7. N processors ' params.txt
sed -i '/## 9. /c\6 ## 9. NQual ' params.txt
sed -i '/## 10./c\.85 ## 10. clust threshold ' params.txt
sed -i '/## 12./c\4 ## 12. MinCov ' params.txt
sed -i '/## 13./c\10 ## 13. maxSH ' params.txt
sed -i '/## 14./c\empirical_1_half_m4 ## 14. output name ' params.txt
sed -i '/## 18./c\/home/deren/Documents/Vib_Lib1/fastq_Lib1/*.fastq ## 18. data location ' params.txt
sed -i '/## 29./c\2,2 ## 29. trim overhang ' params.txt
sed -i '/## 30./c\p,n,s,a ## 30. output formats ' params.txt
cat params.txt
==** parameter inputs for pyRAD version 3.0.63 **======================== affected step == empirical_1/halfrun ## 1. working directory ./*.fastq.gz ## 2. Loc. of non-demultiplexed files (if not line 18) (s1) ./*.barcodes ## 3. Loc. of barcode file (if not line 18) (s1) vsearch ## 4. command (or path) to call vsearch (or usearch) (s3,s6) muscle ## 5. command (or path) to call muscle (s3,s7) TGCAG ## 6. cutters 30 ## 7. N processors 6 ## 8. Mindepth: min coverage for a cluster (s4,s5) 6 ## 9. NQual .85 ## 10. clust threshold rad ## 11. Datatype: rad,gbs,pairgbs,pairddrad,(others:see docs)(all) 4 ## 12. MinCov 10 ## 13. maxSH empirical_1_half_m4 ## 14. output name ==== optional params below this line =================================== affected step == ## 15.opt.: select subset (prefix* only selector) (s2-s7) ## 16.opt.: add-on (outgroup) taxa (list or prefix*) (s6,s7) ## 17.opt.: exclude taxa (list or prefix*) (s7) /home/deren/Documents/Vib_Lib1/fastq_Lib1/*.fastq ## 18. data location ## 19.opt.: maxM: N mismatches in barcodes (def= 1) (s1) ## 20.opt.: phred Qscore offset (def= 33) (s2) ## 21.opt.: filter: def=0=NQual 1=NQual+adapters. 2=strict (s2) ## 22.opt.: a priori E,H (def= 0.001,0.01, if not estimated) (s5) ## 23.opt.: maxN: max Ns in a cons seq (def=5) (s5) ## 24.opt.: maxH: max heterozyg. sites in cons seq (def=5) (s5) ## 25.opt.: ploidy: max alleles in cons seq (def=2;see docs) (s4,s5) ## 26.opt.: maxSNPs: (def=100). Paired (def=100,100) (s7) ## 27.opt.: maxIndels: within-clust,across-clust (def. 3,99) (s3,s7) ## 28.opt.: random number seed (def. 112233) (s3,s6,s7) 2,2 ## 29. trim overhang p,n,s,a ## 30. output formats ## 31.opt.: maj. base call at depth>x<mindepth (def.x=mindepth) (s5) ## 32.opt.: keep trimmed reads (def=0). Enter min length. (s2) ## 33.opt.: max stack size (int), def= max(500,mean+2*SD) (s3) ## 34.opt.: minDerep: exclude dereps with <= N copies, def=1 (s3) ## 35.opt.: use hierarchical clustering (def.=0, 1=yes) (s6) ## 36.opt.: repeat masking (def.=1='dust' method, 0=no) (s3,s6) ## 37.opt.: vsearch max threads per job (def.=6; see docs) (s3,s6) ==== optional: list group/clade assignments below this line (see docs) ==================
%%bash
pyrad -p params.txt -s 234567 >> log.txt 2>&1
%%bash
sed -i '/## 12./c\2 ## 12. MinCov ' params.txt
sed -i '/## 14./c\empirical_1_half_m2 ## 14. output name ' params.txt
%%bash
pyrad -p params.txt -s 7 >> log.txt 2>&1
Added the 'a' option to output formats to build an ".alleles" file which will be used later for mrbayes/bucky analyses.
%%bash
## substitute new parameters into file
sed -i '/## 1. /c\empirical_1/fullrun ## 1. working directory ' params.txt
sed -i '/## 6. /c\TGCAG ## 6. cutters ' params.txt
sed -i '/## 7. /c\30 ## 7. N processors ' params.txt
sed -i '/## 9. /c\6 ## 9. NQual ' params.txt
sed -i '/## 10./c\.85 ## 10. clust threshold ' params.txt
sed -i '/## 12./c\4 ## 12. MinCov ' params.txt
sed -i '/## 13./c\10 ## 13. maxSH ' params.txt
sed -i '/## 14./c\empirical_1_full_m4 ## 14. output name ' params.txt
sed -i '/## 18./c\/home/deren/Documents/Vib_Lib1/fastq_combined/*.fastq ## 18. data location ' params.txt
sed -i '/## 29./c\2,2 ## 29. trim overhang ' params.txt
sed -i '/## 30./c\p,n,s,a ## 30. output formats ' params.txt
%%bash
pyrad -p params.txt -s 234567 >> log.txt 2>&1
%%bash
sed -i '/## 12./c\2 ## 12. MinCov ' params.txt
sed -i '/## 14./c\empirical_1_full_m2 ## 14. output name ' params.txt
%%bash
pyrad -p params.txt -s 7 >> log.txt 2>&1
We are interested in the relationship between the amount of input (raw) data between any two samples, the average coverage they recover when clustered together, and the phylogenetic distances separating samples.
The average number of raw reads per sample is 1.37M.
## read in the data
s2dat = pd.read_table("empirical_1/halfrun/stats/s2.rawedit.txt", header=0, nrows=66)
## print summary stats
print s2dat["passed.total"].describe()
## find which sample has the most raw data
maxraw = s2dat["passed.total"].max()
print "\nmost raw data in sample:"
print s2dat['sample '][s2dat['passed.total']==maxraw]
count 65.000000 mean 1372359.907692 std 640115.353508 min 324478.000000 25% 961366.000000 50% 1249609.000000 75% 1602036.000000 max 3113934.000000 Name: passed.total, dtype: float64 most raw data in sample: 54 Lib1_sulcatum_D9_MEX_003 Name: sample , dtype: object
The average nreads now is 2.74M
## read in the data
s2dat = pd.read_table("empirical_1/fullrun/stats/s2.rawedit.txt", header=0, nrows=66)
## print summary stats
print s2dat["passed.total"].describe()
## find which sample has the most raw data
maxraw = s2dat["passed.total"].max()
print "\nmost raw data in sample:"
print s2dat['sample '][s2dat['passed.total']==maxraw]
count 65.000000 mean 2736777.215385 std 1262460.162891 min 643550.000000 25% 1918460.000000 50% 2519904.000000 75% 3215483.000000 max 6183790.000000 Name: passed.total, dtype: float64 most raw data in sample: 20 foetidum_D24_KFC_1942 Name: sample , dtype: object
pyrad v.3.0.63 outputs depth information for each sample which I read in here and plot. First let's ask which sample has the highest depth of coverage. The std here is the std in means across samples. The std of depths within individuals is much higher.
## read in the s3 results
s3dat = pd.read_table("empirical_1/halfrun/stats/s3.clusters.txt", header=0, nrows=66)
## print summary stats
print "summary of means\n=================="
print s3dat['dpt.me'].describe()
## print summary stats
print "\nsummary of std\n=================="
print s3dat['dpt.sd'].describe()
## print summary stats
print "\nsummary of proportion lowdepth\n=================="
print pd.Series(1-s3dat['d>5.tot']/s3dat["total"]).describe()
## find which sample has the greatest depth of retained loci
maxdepth = s3dat["d>5.tot"].max()
print "\nhighest coverage in sample:"
print s3dat['taxa'][s3dat['d>5.tot']==maxdepth]
summary of means ================== count 65.000000 mean 14.869446 std 3.033109 min 10.008000 25% 12.912000 50% 14.197000 75% 16.340000 max 23.558000 Name: dpt.me, dtype: float64 summary of std ================== count 65.000000 mean 72.607323 std 57.365233 min 9.900000 25% 29.634000 50% 50.903000 75% 103.118000 max 262.167000 Name: dpt.sd, dtype: float64 summary of proportion lowdepth ================== count 65.000000 mean 0.282502 std 0.031652 min 0.209510 25% 0.271586 50% 0.283439 75% 0.295572 max 0.392641 dtype: float64 highest coverage in sample: 54 Lib1_sulcatum_D9_MEX_003 Name: taxa, dtype: object
## read in the s3 results
s3dat = pd.read_table("empirical_1/fullrun/stats/s3.clusters.txt", header=0, nrows=66)
## print summary stats
print "summary of means\n=================="
print s3dat['dpt.me'].describe()
## print summary stats
print "\nsummary of std\n=================="
print s3dat['dpt.sd'].describe()
## print summary stats
print "\nsummary of proportion hidepth\n=================="
print pd.Series(1-s3dat['d>5.tot']/s3dat["total"]).describe()
## find which sample has the greatest depth of retained loci
max_hiprop = (s3dat["d>5.tot"]/s3dat["total"]).max()
print "\nhighest coverage in sample:"
print s3dat['taxa'][s3dat['d>5.tot']/s3dat["total"]==max_hiprop]
summary of means ================== count 65.000000 mean 18.535369 std 5.564201 min 10.011000 25% 14.403000 50% 17.576000 75% 21.119000 max 36.199000 Name: dpt.me, dtype: float64 summary of std ================== count 65.000000 mean 108.366000 std 89.468413 min 11.761000 25% 43.009000 50% 77.229000 75% 157.560000 max 400.431000 Name: dpt.sd, dtype: float64 summary of proportion hidepth ================== count 65.000000 mean 0.252426 std 0.041188 min 0.153852 25% 0.229203 50% 0.254315 75% 0.278021 max 0.371032 dtype: float64 highest coverage in sample: 31 lantanoides_D15_Beartown_2 Name: taxa, dtype: object
## print mean and std of coverage for the highest coverage sample
with open("empirical_1/fullrun/clust.85/lantanoides_D15_Beartown_2.depths", 'rb') as indat:
depths = np.array(indat.read().strip().split(","), dtype=int)
print depths.mean(), depths.std()
32.3991013699 199.423257009
import toyplot
import toyplot.svg
import numpy as np
## read in the depth information for this sample
with open("empirical_1/fullrun/clust.85/lantanoides_D15_Beartown_2.depths", 'rb') as indat:
depths = np.array(indat.read().strip().split(","), dtype=int)
## make a barplot in Toyplot
canvas = toyplot.Canvas(width=350, height=300)
axes = canvas.axes(xlabel="Depth of coverage (N reads)",
ylabel="N loci",
label="dataset1/sample=sulcatum_D9_MEX_003")
## select the loci with depth > 5 (kept)
keeps = depths[depths>5]
## plot kept and discarded loci
edat = np.histogram(depths, range(30)) # density=True)
kdat = np.histogram(keeps, range(30)) #, density=True)
axes.bars(edat)
axes.bars(kdat)
#toyplot.svg.render(canvas, "empirical_1_full_depthplot.svg")
<toyplot.mark.BarMagnitudes at 0x7fdc49a9d190>
cat empirical_1/halfrun/stats/empirical_1_half_m4.stats
144161 ## loci with > minsp containing data 143948 ## loci with > minsp containing data & paralogs removed 143948 ## loci with > minsp containing data & paralogs removed & final filtering ## number of loci recovered in final data set for each taxon. taxon nloci Lib1_acerfolium_ELS88 41968 Lib1_acutifolium_DRY3_MEX_006 31990 Lib1_amplificatum_D3_SAN_156003 24211 Lib1_anamensis_C6_PWS_2094 10955 Lib1_beccarii_combinedX 26054 Lib1_betulifolium 23739 Lib1_bitchuense_combined 19268 Lib1_carlesii_D1_BP_001 42273 Lib1_cassinoides_ELS2 25549 Lib1_cinnamomifolium_PWS2105X 29976 Lib1_clemensiae_DRY6_PWS_2135 31519 Lib1_coriaceum_combined 26857 Lib1_cylindricum_DRY1_WC_268 32747 Lib1_davidii_D32_WC_269 37651 Lib1_dentatum_ELS4 40230 Lib1_dilatatum_ELS45 37037 Lib1_erosum_D23_MJDJP_4 28689 Lib1_erubescens_RCW36 22413 Lib1_farreri_RCW21 27313 Lib1_foetens_ERAD10 11437 Lib1_foetidum_D24_KFC_1942 39936 Lib1_formosanum_C7_JH_2007 3797 Lib1_furcatum_combined 21882 Lib1_glaberrimum_D34_PWS_2323 27587 Lib1_grandiflorum_ERAD11_Wendy 12655 Lib1_hanceanum_D4_PWS_2195 28252 Lib1_henryi_D22_WC_272 16184 Lib1_integrifolium_D25_KFC_1946 34016 Lib1_jamesonii_D12_PWS_1636 29455 Lib1_japonicum_D26_WC_273 46474 Lib1_lantana_combined 25166 Lib1_lantanoides_D15_Beartown_2 36966 Lib1_lentago_ELS85 25759 Lib1_lutescens_D35_PWS_2077 31943 Lib1_luzonicum_D27_9M_2005 35934 Lib1_macrocephalum_D2_WC_284 27360 Lib1_muhalla_D28_WC_274 39682 Lib1_nervosum_C4_PWS_2298 22191 Lib1_odoratissimum_combined 17368 Lib1_opulus_D6_WC_250 26534 Lib1_orientale_DRY2_MJD_GEORGIA 36299 Lib1_parvifolium_D29_KFC_1953 46443 Lib1_plicatum_C1_MJDJP_12 21943 Lib1_propinquum_DRY4_WC_276 35837 Lib1_prunifolium_ELS57 29395 Lib1_punctatum_D19_PWS_2097 35392 Lib1_recognitum_AA_1471_83B 27517 Lib1_rhytidophyllum 26189 Lib1_rufidulum_ELS25 22054 Lib1_sambucina_D20_PWS_2100 31266 Lib1_sargentii_RCW19 21166 Lib1_sempervirens_combined 20609 Lib1_setigerum 38816 Lib1_sieboldii_AA_616_6B 15182 Lib1_sulcatum_D9_MEX_003 26087 Lib1_suspensum_C5_MJD_111711 24471 Lib1_sympodiale_D18_KFC_1932 37477 Lib1_taiwanianum_TW1_KFC_1952 25293 Lib1_tashiori_D30_TET_YAH 40603 Lib1_tinus_D33_WC_277 35049 Lib1_triphyllum_D13_PWS_1783 29055 Lib1_urceolatum_MJD_Japan_8 18009 Lib1_veitchii 26094 Lib1_vernicosum_D21_PWS_2123 31675 Lib1_wrightii_D31_MJDJP_1 36417 ## nloci = number of loci with data for exactly ntaxa ## ntotal = number of loci for which at least ntaxa have data ntaxa nloci saved ntotal 1 - 2 - - 3 - - 4 24191 * 143948 5 15346 * 119757 6 10842 * 104411 7 8551 * 93569 8 6854 * 85018 9 5781 * 78164 10 5066 * 72383 11 4729 * 67317 12 4282 * 62588 13 4017 * 58306 14 3664 * 54289 15 3495 * 50625 16 3397 * 47130 17 3251 * 43733 18 3065 * 40482 19 2990 * 37417 20 2911 * 34427 21 2880 * 31516 22 2732 * 28636 23 2665 * 25904 24 2697 * 23239 25 2444 * 20542 26 2379 * 18098 27 2131 * 15719 28 2087 * 13588 29 1897 * 11501 30 1686 * 9604 31 1503 * 7918 32 1299 * 6415 33 1106 * 5116 34 948 * 4010 35 748 * 3062 36 602 * 2314 37 473 * 1712 38 335 * 1239 39 242 * 904 40 185 * 662 41 122 * 477 42 98 * 355 43 52 * 257 44 47 * 205 45 30 * 158 46 22 * 128 47 15 * 106 48 11 * 91 49 9 * 80 50 10 * 71 51 7 * 61 52 9 * 54 53 6 * 45 54 10 * 39 55 5 * 29 56 4 * 24 57 4 * 20 58 5 * 16 59 3 * 11 60 2 * 8 61 3 * 6 62 1 * 3 63 0 * 2 64 2 * 2 ## nvar = number of loci containing n variable sites (pis+autapomorphies). ## sumvar = sum of variable sites (SNPs). ## pis = number of loci containing n parsimony informative sites. ## sumpis = sum of parsimony informative sites. nvar sumvar PIS sumPIS 0 858 0 15884 0 1 1602 1602 14623 14623 2 2305 6212 13691 42005 3 2953 15071 13104 81317 4 3553 29283 12290 130477 5 4163 50098 11363 187292 6 4625 77848 10732 251684 7 5057 113247 9733 319815 8 5622 158223 8394 386967 9 6085 212988 7230 452037 10 6286 275848 6313 515167 11 6314 345302 5023 570420 12 6641 424994 4204 620868 13 6842 513940 3229 662845 14 6621 606634 2371 696039 15 6566 705124 1787 722844 16 6379 807188 1304 743708 17 6125 911313 925 759433 18 6037 1019979 616 770521 19 5562 1125657 442 778919 20 5251 1230677 250 783919 21 4906 1333703 165 787384 22 4581 1434485 112 789848 23 4113 1529084 61 791251 24 3708 1618076 49 792427 25 3271 1699851 24 793027 26 2950 1776551 8 793235 27 2533 1844942 12 793559 28 2263 1908306 2 793615 29 1992 1966074 1 793644 30 1662 2015934 3 793734 31 1343 2057567 3 793827 32 1144 2094175 0 793827 33 935 2125030 0 793827 34 731 2149884 0 793827 35 556 2169344 0 793827 36 502 2187416 0 793827 37 347 2200255 0 793827 38 295 2211465 0 793827 39 208 2219577 0 793827 40 128 2224697 0 793827 41 98 2228715 0 793827 42 64 2231403 0 793827 43 53 2233682 0 793827 44 46 2235706 0 793827 45 21 2236651 0 793827 46 16 2237387 0 793827 47 10 2237857 0 793827 48 12 2238433 0 793827 49 9 2238874 0 793827 50 0 2238874 0 793827 51 2 2238976 0 793827 52 2 2239080 0 793827 total var= 2239080 total pis= 793827 sampled unlinked SNPs= 143090 sampled unlinked bi-allelic SNPs= 142267
import numpy as np
indat = open("empirical_1/halfrun/stats/empirical_1_half_m4.stats").readlines()
counts = [int(i.strip().split("\t")[1]) for i in indat[8:73]]
print np.mean(counts)
print np.std(counts)
28605.4615385 8731.83305115
import numpy as np
import itertools
indat = open("empirical_1/halfrun/stats/empirical_1_half_m4.stats").readlines()
counts = [i.strip().split("\t") for i in indat[81:142]]
#print counts
ntax = [int(i[0]) for i in counts]
ncounts = [int(i[1]) for i in counts]
tots = list(itertools.chain(*[[i]*n for i,n in zip(ntax, ncounts)]))
print np.mean(tots)
print np.std(tots)
12.9168519187 9.09974422662
cat empirical_1/fullrun/stats/empirical_1_full_m4.stats
199691 ## loci with > minsp containing data 199094 ## loci with > minsp containing data & paralogs removed 199094 ## loci with > minsp containing data & paralogs removed & final filtering ## number of loci recovered in final data set for each taxon. taxon nloci acerfolium_ELS88 59188 acutifolium_DRY3_MEX_006 52932 amplificatum_D3_SAN_156003 38141 anamensis_C6_PWS_2094 20597 beccarii_combinedX 41837 betulifolium 39533 bitchuense_combined 34338 carlesii_D1_BP_001 58449 cassinoides_ELS2 40767 cinnamomifolium_PWS2105X 47043 clemensiae_DRY6_PWS_2135 45144 coriaceum_combined 43822 cylindricum_DRY1_WC_268 50762 davidii_D32_WC_269 53023 dentatum_ELS4 59443 dilatatum_ELS45 57384 erosum_D23_MJDJP_4 47927 erubescens_RCW36 39132 farreri_RCW21 44604 foetens_ERAD10 21914 foetidum_D24_KFC_1942 60151 formosanum_C7_JH_2007 7638 furcatum_combined 37959 glaberrimum_D34_PWS_2323 43979 grandiflorum_ERAD11_Wendy 24058 hanceanum_D4_PWS_2195 45482 henryi_D22_WC_272 29692 integrifolium_D25_KFC_1946 54254 jamesonii_D12_PWS_1636 48113 japonicum_D26_WC_273 64083 lantana_combined 42060 lantanoides_D15_Beartown_2 52090 lentago_ELS85 42188 lutescens_D35_PWS_2077 48126 luzonicum_D27_9M_2005 54891 macrocephalum_D2_WC_284 43983 muhalla_D28_WC_274 57264 nervosum_C4_PWS_2298 37230 odoratissimum_combined 31480 opulus_D6_WC_250 41578 orientale_DRY2_MJD_GEORGIA 54201 parvifolium_D29_KFC_1953 67338 plicatum_C1_MJDJP_12 37339 propinquum_DRY4_WC_276 51434 prunifolium_ELS57 45913 punctatum_D19_PWS_2097 49862 recognitum_AA_1471_83B 45155 rhytidophyllum 43744 rufidulum_ELS25 36741 sambucina_D20_PWS_2100 47901 sargentii_RCW19 34723 sempervirens_combined 35475 setigerum 58325 sieboldii_AA_616_6B 27714 sulcatum_D9_MEX_003 45500 suspensum_C5_MJD_111711 41110 sympodiale_D18_KFC_1932 54400 taiwanianum_TW1_KFC_1952 41711 tashiori_D30_TET_YAH 58700 tinus_D33_WC_277 50680 triphyllum_D13_PWS_1783 47654 urceolatum_MJD_Japan_8 31220 veitchii 43123 vernicosum_D21_PWS_2123 48356 wrightii_D31_MJDJP_1 55891 ## nloci = number of loci with data for exactly ntaxa ## ntotal = number of loci for which at least ntaxa have data ntaxa nloci saved ntotal 1 - 2 - - 3 - - 4 40036 * 199094 5 23235 * 159058 6 15581 * 135823 7 11312 * 120242 8 8843 * 108930 9 7183 * 100087 10 5858 * 92904 11 5064 * 87046 12 4440 * 81982 13 4041 * 77542 14 3703 * 73501 15 3402 * 69798 16 3177 * 66396 17 3054 * 63219 18 2789 * 60165 19 2655 * 57376 20 2534 * 54721 21 2391 * 52187 22 2361 * 49796 23 2216 * 47435 24 2123 * 45219 25 2121 * 43096 26 2074 * 40975 27 2049 * 38901 28 2023 * 36852 29 1933 * 34829 30 2020 * 32896 31 1893 * 30876 32 1916 * 28983 33 1907 * 27067 34 1916 * 25160 35 1984 * 23244 36 1857 * 21260 37 1824 * 19403 38 1755 * 17579 39 1764 * 15824 40 1724 * 14060 41 1583 * 12336 42 1567 * 10753 43 1476 * 9186 44 1347 * 7710 45 1206 * 6363 46 1098 * 5157 47 959 * 4059 48 790 * 3100 49 683 * 2310 50 468 * 1627 51 383 * 1159 52 285 * 776 53 201 * 491 54 118 * 290 55 76 * 172 56 43 * 96 57 28 * 53 58 16 * 25 59 4 * 9 60 2 * 5 61 2 * 3 62 1 * 1 ## nvar = number of loci containing n variable sites (pis+autapomorphies). ## sumvar = sum of variable sites (SNPs). ## pis = number of loci containing n parsimony informative sites. ## sumpis = sum of parsimony informative sites. nvar sumvar PIS sumPIS 0 2213 0 26451 0 1 3244 3244 20005 20005 2 4394 12032 17249 54503 3 5198 27626 15381 100646 4 5702 50434 14286 157790 5 6188 81374 13036 222970 6 6480 120254 11956 294706 7 6940 168834 10930 371216 8 7207 226490 10132 452272 9 7473 293747 9442 537250 10 7456 368307 8372 620970 11 7756 453623 7508 703558 12 7885 548243 6655 783418 13 8098 653517 5735 857973 14 7717 761555 4896 926517 15 7699 877040 4029 986952 16 7562 998032 3200 1038152 17 7171 1119939 2719 1084375 18 7081 1247397 2068 1121599 19 6630 1373367 1555 1151144 20 6291 1499187 1143 1174004 21 5930 1623717 763 1190027 22 5459 1743815 565 1202457 23 5189 1863162 386 1211335 24 4862 1979850 245 1217215 25 4552 2093650 164 1221315 26 4223 2203448 94 1223759 27 3946 2309990 54 1225217 28 3610 2411070 29 1226029 29 3397 2509583 18 1226551 30 3150 2604083 11 1226881 31 2772 2690015 9 1227160 32 2500 2770015 3 1227256 33 2322 2846641 2 1227322 34 2040 2916001 3 1227424 35 1717 2976096 0 1227424 36 1482 3029448 0 1227424 37 1185 3073293 0 1227424 38 997 3111179 0 1227424 39 876 3145343 0 1227424 40 642 3171023 0 1227424 41 498 3191441 0 1227424 42 394 3207989 0 1227424 43 260 3219169 0 1227424 44 224 3229025 0 1227424 45 140 3235325 0 1227424 46 124 3241029 0 1227424 47 64 3244037 0 1227424 48 46 3246245 0 1227424 49 39 3248156 0 1227424 50 28 3249556 0 1227424 51 19 3250525 0 1227424 52 8 3250941 0 1227424 53 8 3251365 0 1227424 54 2 3251473 0 1227424 55 1 3251528 0 1227424 56 1 3251584 0 1227424 57 1 3251641 0 1227424 58 1 3251699 0 1227424 total var= 3251699 total pis= 1227424 sampled unlinked SNPs= 196881 sampled unlinked bi-allelic SNPs= 195117
import numpy as np
indat = open("empirical_1/fullrun/stats/empirical_1_full_m4.stats").readlines()
counts = [int(i.strip().split("\t")[1]) for i in indat[8:73]]
print np.mean(counts)
print np.std(counts)
44638.7619048 11037.958308
import numpy as np
import itertools
indat = open("empirical_1/fullrun/stats/empirical_1_full_m4.stats").readlines()
counts = [i.strip().split("\t") for i in indat[81:140]]
#print counts
ntax = [int(i[0]) for i in counts]
ncounts = [int(i[1]) for i in counts]
tots = list(itertools.chain(*[[i]*n for i,n in zip(ntax, ncounts)]))
print np.mean(tots)
print np.std(tots)
14.6488040825 12.690632427
%%bash
## raxml argumement w/ ...
raxmlHPC-PTHREADS-AVX -f a -m GTRGAMMA -N 100 -x 12345 -p 12345 -T 35 \
-w /home/deren/Documents/RADmissing/empirical_1/halfrun \
-n empirical_1_halfrun -s empirical_1/halfrun/outfiles/empirical_1_half_m4.phy \
-o "Lib1_clemensiae_DRY6_PWS_2135"
%%bash
raxmlHPC-PTHREADS-AVX -f a -m GTRGAMMA -N 100 -x 12345 -p 12345 -T 35 \
-w /home/deren/Documents/RADmissing/empirical_1/fullrun \
-n empirical_1_fullrun -s empirical_1/fullrun/outfiles/empirical_1_full_m4.phy \
-o "clemensiae_DRY6_PWS_2135"
%%bash
head -n 20 empirical_1/halfrun/RAxML_info.empirical_1_half_m4
This is RAxML version 8.0.0 released by Alexandros Stamatakis on Dec 13 2013. With greatly appreciated code contributions by: Andre Aberer (HITS) Simon Berger (HITS) Alexey Kozlov (HITS) Nick Pattengale (Sandia) Wayne Pfeiffer (SDSC) Akifumi S. Tanabe (NRIFS) David Dao (KIT) Charlie Taylor (UF) Alignment has 1010208 distinct alignment patterns Proportion of gaps and completely undetermined characters in this alignment: 80.36% RAxML rapid bootstrapping and subsequent ML search
%%bash
head -n 20 empirical_1/fullrun/RAxML_info.empirical_1_full_m4
This is RAxML version 8.0.0 released by Alexandros Stamatakis on Dec 13 2013. With greatly appreciated code contributions by: Andre Aberer (HITS) Simon Berger (HITS) Alexey Kozlov (HITS) Nick Pattengale (Sandia) Wayne Pfeiffer (SDSC) Akifumi S. Tanabe (NRIFS) David Dao (KIT) Charlie Taylor (UF) Alignment has 1035341 distinct alignment patterns Proportion of gaps and completely undetermined characters in this alignment: 77.64% RAxML rapid bootstrapping and subsequent ML search
ape
¶%load_ext rpy2.ipython
%%R -w 600 -h 1000
library(ape)
tre_half <- read.tree("empirical_1/halfrun/RAxML_bipartitions.empirical_1_halfrun")
#rtre <- root(tre, "Lib1_clemensiae_DRY6_PWS_2135", resolve.root=T)
#rtre <- root(rtre, "Lib1_clemensiae_DRY6_PWS_2135", resolve.root=T)
ltre_half <- ladderize(tre_half)
plot(ltre_half, cex=0.8, edge.width=2)
nodelabels(ltre_half$node.label)
%%R -w 600 -h 1000
library(ape)
svg("outtree.svg", height=11, width=8)
tre_full <- read.tree("empirical_1/fullrun/RAxML_bipartitions.empirical_1_fullrun")
#rtre <- root(tre, "Lib1_clemensiae_DRY6_PWS_2135", resolve.root=T)
#rtre <- root(rtre, "Lib1_clemensiae_DRY6_PWS_2135", resolve.root=T)
ltre_full <- ladderize(tre_full)
plot(ltre_full, cex=0.8, edge.width=3)
#nodelabels(ltre_full$node.label)
dev.off()
png 2
The functions nexmake
and subsample
are used to split the .loci file into individual nexus files for each locus within a new directory. Each nexus file is given a mrbayes command to run. Then we run the bucky tool mbsum
to summarize the mrbayes output, and finally run bucky
to infer concordance trees from the posterior distributions of trees across all loci.
Loci are selected on the basis that they have coverage across all tips of the selected subtree and that they contain at least 1 SNP.
def nexmake(taxadict, loc, nexdir, trim):
outloc = open(nexdir+"/"+str(loc)+".nex", 'w')
header = """
#NEXUS
begin data;
dimensions ntax={} nchar={};
format datatype=dna interleave=yes missing=N gap=-;
matrix
""".format(len(taxadict), len(taxadict.values()[0]))
outloc.write(header)
for tax, seq in taxadict.items():
outloc.write("{}{}{}\n"\
.format(tax[trim:trim+9],
" "*(10-len(tax[0:9])),
"".join(seq)))
mbstring = """
;
end;
begin mrbayes;
set autoclose=yes nowarn=yes;
lset nst=6 rates=gamma;
mcmc ngen=2200000 samplefreq=2000;
sump burnin=200000;
sumt burnin=200000;
end;
"""
outloc.write(mbstring)
outloc.close()
def unstruct(amb):
" returns bases from ambiguity code"
D = {"R":["G","A"],
"K":["G","T"],
"S":["G","C"],
"Y":["T","C"],
"W":["T","A"],
"M":["C","A"]}
if amb in D:
return D.get(amb)
else:
return [amb,amb]
def resolveambig(subseq):
N = []
for col in subseq:
N.append([unstruct(i)[np.random.binomial(1, 0.5)] for i in col])
return np.array(N)
def newPIS(seqsamp):
counts = [Counter(col) for col in seqsamp.T if not ("-" in col or "N" in col)]
pis = [i.most_common(2)[1][1] > 1 for i in counts if len(i.most_common(2))>1]
if sum(pis) >= 2:
return sum(pis)
else:
return 0
def parseloci(iloci, taxadict, nexdir, trim=0):
nloc = 0
## create subsampled data set
for loc in iloci:
## if all tip samples have data in this locus
names = [line.split()[0] for line in loc.split("\n")[:-1]]
## check that locus has required samples for each subtree
if all([i in names for i in taxadict.values()]):
seqs = np.array([list(line.split()[1]) for line in loc.split("\n")[:-1]])
seqsamp = seqs[[names.index(tax) for tax in taxadict.values()]]
seqsamp = resolveambig(seqsamp)
pis = newPIS(seqsamp)
if pis:
nloc += 1
## remove invariable columns given this subsampling
keep = []
for n, col in enumerate(seqsamp.T):
if all([i not in ["N","-"] for i in col]):
keep.append(n)
subseq = seqsamp.T[keep].T
## write to a nexus file
nexdict = dict(zip(taxadict.keys(), [i.tostring() for i in subseq]))
nexmake(nexdict, nloc, nexdir, trim)
print nloc, 'loci kept'
def getloci(locifile):
## parse the loci file by new line characters
locifile = open(locifile)
lines = locifile.readlines()
## add "|" to end of lines that contain "|"
for idx in range(len(lines)):
if "|" in lines[idx]:
lines[idx] = lines[idx].strip()+"|\n"
## join lines back together into one large string
locistr = "".join(lines)
## break string into loci at the "|\n" character
loci = locistr.split("|\n")[:-1]
## how many loci?
print len(loci), "loci"
return loci
## run on both files
loci_full = getloci("empirical_1/fullrun/outfiles/empirical_1_full_m4.loci")
loci_half = getloci("empirical_1/halfrun/outfiles/empirical_1_half_m4.loci")
199094 loci 143948 loci
parseloci(loci_full[:], deep_dict_f, "deep_dict_full", 0)
parseloci(loci_half[:], deep_dict_h, "deep_dict_half", 0)
#parseloci(loci[:], shallow_dict, "shallow_dict", 0)
744 loci kept 71 loci kept
## create a parallel client
ipclient = ipyparallel.Client()
lview = ipclient.load_balanced_view()
## call function across all engines
def mrbayes(infile):
import subprocess
cmd = "mb %s" % infile
subprocess.check_call(cmd, shell=True)
## submit all nexus files to run mb
allnex = glob.glob("deep_dict_full/*.nex")
for nex in allnex:
lview.apply(mrbayes, nex)
ipclient.wait_interactive()
744/744 tasks finished after 3759 s done
mbsum
¶def mbsum(nexdir, nloci):
import subprocess
## combine trees from the two replicate runs
for n in range(1, nloci+1):
cmd = "mbsum -n 101 -o {}{}.in {}{}.nex.run1.t {}{}.nex.run2.t".\
format(nexdir, n, nexdir, n, nexdir, n)
subprocess.check_call(cmd, shell=True)
import os
import numpy as np
from collections import Counter
def subsample(infile, requires, outgroup, nexdir, trim):
""" sample n taxa from infile to create nex file"""
## counter
loc = 0
## create output directory
if not os.path.exists(nexdir):
os.mkdir(nexdir)
## input .alleles file
loci = open(infile, 'r').read().strip().split("//")
## create a dictionary of {names:seqs}
for locus in xrange(len(loci)):
locnex = [""]*len(requires)
for line in loci[locus].strip().split("\n"):
tax = line.split()[0]
seq = line.split()[-1]
if ">" in tax:
if tax in requires:
locnex[requires.index(tax)] = seq
## if all tips
if len([i for i in locnex if i]) == len(requires):
## if locus is variable
## count how many times each char occurs in each site
ccs = [Counter(i) for i in np.array([list(i) for i in locnex]).T]
## remove N and - characters and the first most occurring base
for i in ccs:
del i['-']
del i['N']
if i:
del i[i.most_common()[0][0]]
## is anything left occuring more than once (minor allele=ma)?
ma = max([max(i.values()) if i else 0 for i in ccs])
if ma > 1:
nexmake(requires, locnex, loc, outgroup, nexdir, trim)
loc += 1
return loc
## inputs
requires = [">triphyllum_D13_PWS_1783_0",
">jamesonii_D12_PWS_1636_0",
">sulcatum_D9_MEX_003_0",
">acutifolium_DRY3_MEX_006_0",
">dentatum_ELS4_0",
">recognitum_AA_1471_83B_0"]
outgroup = ""
infile = "empirical_1/fullrun/outfiles/empirical_1_full_m4.alleles"
nexdir = "nex_files1"
## run function
nloci = subsample(infile, requires, outgroup, nexdir, trim=1)
print nloci
3300
## inputs
requires = [">Lib1_triphyllum_D13_PWS_1783_0",
">Lib1_jamesonii_D12_PWS_1636_0",
">Lib1_sulcatum_D9_MEX_003_0",
">Lib1_acutifolium_DRY3_MEX_006_0",
">Lib1_dentatum_ELS4_0",
">Lib1_recognitum_AA_1471_83B_0"]
outgroup = ""
infile = "empirical_1/halfrun/outfiles/empirical_1_half_m4.alleles"
nexdir = "nex_files2"
## run function
nloci = subsample(infile, requires, outgroup, nexdir, trim=6)
print nloci
364
## inputs
requires = [">clemensiae_DRY6_PWS_2135_0",
">tinus_D33_WC_277_0",
">taiwanianum_TW1_KFC_1952_0",
">lantanoides_D15_Beartown_2_0",
">amplificatum_D3_SAN_156003_0",
">lutescens_D35_PWS_2077_0",
">lentago_ELS85_0",
">dentatum_ELS4_0"]
outgroup = ""
infile = "empirical_1/fullrun/outfiles/empirical_1_full_m4.alleles"
nexdir = "nex_files5"
## run function
nloci = subsample(infile, requires, outgroup, nexdir, trim=1)
print nloci
1203
## inputs
requires = [">Lib1_clemensiae_DRY6_PWS_2135_0",
">Lib1_tinus_D33_WC_277_0",
">Lib1_taiwanianum_TW1_KFC_1952_0",
">Lib1_lantanoides_D15_Beartown_2_0",
">Lib1_amplificatum_D3_SAN_156003_0",
">Lib1_lutescens_D35_PWS_2077_0",
">Lib1_lentago_ELS85_0",
">Lib1_dentatum_ELS4_0"]
outgroup = ""
infile = "empirical_1/halfrun/outfiles/empirical_1_half_m4.alleles"
nexdir = "nex_files6"
## run function
nloci = subsample(infile, requires, outgroup, nexdir, trim=6)
print nloci
106
import ipyparallel
import subprocess
import glob
## create a parallel client
ipclient = ipyparallel.Client()
lview = ipclient.load_balanced_view()
## call function across all engines
def mrbayes(infile):
import subprocess
cmd = "mb %s" % infile
subprocess.check_call(cmd, shell=True)
## run on the full data set
res = lview.map_async(mrbayes, glob.glob("nex_files1/*"))
_ = res.get()
## run on the half data set
res = lview.map_async(mrbayes, glob.glob("nex_files2/*"))
_ = res.get()
## run on the half data set
res = lview.map_async(mrbayes, glob.glob("nex_files3/*"))
_ = res.get()
## run on the half data set
res = lview.map_async(mrbayes, glob.glob("nex_files4/*"))
_ = res.get()
## run on the half data set
res = lview.map_async(mrbayes, glob.glob("nex_files5/*"))
_ = res.get()
## run on the half data set
res = lview.map_async(mrbayes, glob.glob("nex_files6/*"))
_ = res.get()
import os
import subprocess
def mbsum(nexdir, nloci):
## create dir for bucky input files
insdir = os.path.join(nexdir, "ins")
if not os.path.exists(insdir):
os.mkdir(insdir)
## combine trees from the two replicate runs
for n in range(nloci):
cmd = "mbsum -n 101 -o {}/{}.in {}{}.nex.run1.t {}{}.nex.run2.t".\
format(insdir, n, nexdir, n, nexdir, n)
subprocess.check_call(cmd, shell=True)
#mbsum("nex_files1/", 3300)
#mbsum("nex_files2/", 364)
#mbsum("nex_files3/", 1692)
#mbsum("nex_files4/", 169)
mbsum("nex_files5/", 1203)
mbsum("nex_files6/", 106)
args = []
for insdir in ["nex_files5/ins", "nex_files6/ins"]:
## independence test
args.append("bucky --use-independence-prior -k 4 -n 500000 \
-o {}/BUCKY.ind {}/*.in".format(insdir, insdir))
## alpha at three levels
for alpha in [0.1, 1, 10]:
args.append("bucky -a {} -k 4 -n 500000 -c 4 -o {}/BUCKY.{} {}/*.in".\
format(alpha, insdir, alpha, insdir))
def bucky(arg):
import subprocess
subprocess.check_call(arg, shell=True)
return arg
res = lview.map_async(bucky, args)
res.get()
['bucky --use-independence-prior -k 4 -n 500000 -o nex_files5/ins/BUCKY.ind nex_files5/ins/*.in', 'bucky -a 0.1 -k 4 -n 500000 -c 4 -o nex_files5/ins/BUCKY.0.1 nex_files5/ins/*.in', 'bucky -a 1 -k 4 -n 500000 -c 4 -o nex_files5/ins/BUCKY.1 nex_files5/ins/*.in', 'bucky -a 10 -k 4 -n 500000 -c 4 -o nex_files5/ins/BUCKY.10 nex_files5/ins/*.in', 'bucky --use-independence-prior -k 4 -n 500000 -o nex_files6/ins/BUCKY.ind nex_files6/ins/*.in', 'bucky -a 0.1 -k 4 -n 500000 -c 4 -o nex_files6/ins/BUCKY.0.1 nex_files6/ins/*.in', 'bucky -a 1 -k 4 -n 500000 -c 4 -o nex_files6/ins/BUCKY.1 nex_files6/ins/*.in', 'bucky -a 10 -k 4 -n 500000 -c 4 -o nex_files6/ins/BUCKY.10 nex_files6/ins/*.in']
del lbview
ipclient.close()
%%bash
head -n 40 nex_files1/ins/BUCKY.0.1.concordance
head: cannot open ‘nex_files1/ins/BUCKY.0.1.concordance’ for reading: No such file or directory
%%bash
head -n 40 nex_files1/ins/BUCKY.1.concordance
translate 1 triphyllu, 2 jamesonii, 3 sulcatum_, 4 acutifoli, 5 dentatum_, 6 recognitu; Population Tree: (((1,2),(3,4)),5,6); Primary Concordance Tree Topology: (((1,2),(3,4)),5,6); Population Tree, With Branch Lengths In Estimated Coalescent Units: (((1:10.000,2:10.000):0.478,(3:10.000,4:10.000):0.156):0.840,5:10.000,6:10.000); Primary Concordance Tree with Sample Concordance Factors: (((1:1.000,2:1.000):0.562,(3:1.000,4:1.000):0.351):0.591,5:1.000,6:1.000); Four-way partitions in the Population Tree: sample-wide CF, coalescent units and Ties(if present) {1; 2|3,4; 5,6} 0.586, 0.478, {1,2; 5,6|3; 4} 0.429, 0.156, {1,2; 3,4|5; 6} 0.712, 0.840, Splits in the Primary Concordance Tree: sample-wide and genome-wide mean CF (95% credibility), SD of mean sample-wide CF across runs {1,2,3,4|5,6} 0.591(0.562,0.621) 0.591(0.557,0.625) 0.001 {1,2|3,4,5,6} 0.562(0.533,0.586) 0.562(0.529,0.592) 0.001 {1,2,5,6|3,4} 0.351(0.320,0.383) 0.351(0.316,0.387) 0.001 Splits NOT in the Primary Concordance Tree but with estimated CF > 0.050: {1,2,4|3,5,6} 0.131(0.108,0.156) 0.131(0.105,0.159) 0.001 {1,2,3|4,5,6} 0.126(0.104,0.148) 0.126(0.101,0.151) 0.000 {1,4,5,6|2,3} 0.123(0.105,0.144) 0.123(0.102,0.147) 0.000 {1,3|2,4,5,6} 0.119(0.100,0.143) 0.119(0.097,0.145) 0.002 {1,3,5,6|2,4} 0.111(0.092,0.129) 0.111(0.090,0.132) 0.001 {1,4|2,3,5,6} 0.107(0.085,0.127) 0.107(0.083,0.130) 0.000 {1,5,6|2,3,4} 0.098(0.079,0.115) 0.098(0.077,0.118) 0.001 {1,3,4|2,5,6} 0.086(0.069,0.104) 0.086(0.067,0.107) 0.000 {1,2,4,5|3,6} 0.062(0.047,0.079) 0.063(0.045,0.082) 0.001
%%bash
head -n 40 nex_files2/ins/BUCKY.1.concordance
translate 1 _triphyll, 2 _jamesoni, 3 _sulcatum, 4 _acutifol, 5 _dentatum, 6 _recognit; Population Tree: ((((1,2),3),4),5,6); Primary Concordance Tree Topology: (((1,2),(3,4)),5,6); Population Tree, With Branch Lengths In Estimated Coalescent Units: ((((1:10.000,2:10.000):0.315,3:10.000):0.080,4:10.000):0.540,5:10.000,6:10.000); Primary Concordance Tree with Sample Concordance Factors: (((1:1.000,2:1.000):0.389,(3:1.000,4:1.000):0.293):0.378,5:1.000,6:1.000); Four-way partitions in the Population Tree: sample-wide CF, coalescent units and Ties(if present) {1; 2|3; 4,5,6} 0.513, 0.315, {1,2,3; 4|5; 6} 0.612, 0.540, {1,2; 3|4; 5,6} 0.385, 0.080, Splits in the Primary Concordance Tree: sample-wide and genome-wide mean CF (95% credibility), SD of mean sample-wide CF across runs {1,2|3,4,5,6} 0.389(0.308,0.470) 0.389(0.295,0.484) 0.003 {1,2,3,4|5,6} 0.378(0.294,0.467) 0.377(0.280,0.480) 0.002 {1,2,5,6|3,4} 0.293(0.206,0.393) 0.292(0.196,0.402) 0.006 Splits NOT in the Primary Concordance Tree but with estimated CF > 0.050: {1,2,3|4,5,6} 0.192(0.107,0.277) 0.191(0.101,0.289) 0.005 {1,4,5,6|2,3} 0.188(0.129,0.253) 0.188(0.120,0.267) 0.004 {1,3,5,6|2,4} 0.137(0.082,0.198) 0.137(0.074,0.208) 0.002 {1,3|2,4,5,6} 0.121(0.066,0.203) 0.122(0.059,0.211) 0.002 {1,5,6|2,3,4} 0.117(0.066,0.181) 0.117(0.059,0.190) 0.003 {1,4|2,3,5,6} 0.109(0.060,0.165) 0.109(0.053,0.175) 0.002 {1,2,4|3,5,6} 0.109(0.058,0.181) 0.109(0.051,0.189) 0.003 {1,3,4|2,5,6} 0.101(0.055,0.162) 0.101(0.048,0.172) 0.002 {1,5|2,3,4,6} 0.089(0.047,0.137) 0.089(0.040,0.148) 0.001
! head -n 45 nex_files3/ins/BUCKY.0.1.concordance
translate 1 clemensia, 2 davidii_D, 3 taiwanian, 4 lantanoid, 5 amplifica, 6 lutescens, 7 carlesii_, 8 dentatum_; Population Tree: ((1,(3,(4,((5,6),7)))),2,8); Primary Concordance Tree Topology: ((1,((3,4),((5,6),7))),2,8); Population Tree, With Branch Lengths In Estimated Coalescent Units: ((1:10.000,(3:10.000,(4:10.000,((5:10.000,6:10.000):0.968,7:10.000):0.092):0.029):0.134):2.022,2:10.000,8:10.000); Primary Concordance Tree with Sample Concordance Factors: ((1:1.000,((3:1.000,4:1.000):0.201,((5:1.000,6:1.000):0.660,7:1.000):0.175):0.181):0.877,2:1.000,8:1.000); Four-way partitions in the Population Tree: sample-wide CF, coalescent units and Ties(if present) {1,2,3,4,8; 7|5; 6} 0.747, 0.968, {1; 3,4,5,6,7|2; 8} 0.912, 2.022, {1; 2,8|3; 4,5,6,7} 0.417, 0.134, {1,2,3,8; 4|5,6; 7} 0.392, 0.092, {1,2,8; 3|4; 5,6,7} 0.353, 0.029, Splits in the Primary Concordance Tree: sample-wide and genome-wide mean CF (95% credibility), SD of mean sample-wide CF across runs {1,3,4,5,6,7|2,8} 0.877(0.847,0.904) 0.877(0.842,0.908) 0.010 {1,2,3,4,7,8|5,6} 0.660(0.593,0.710) 0.660(0.590,0.715) 0.022 {1,2,5,6,7,8|3,4} 0.201(0.149,0.250) 0.201(0.146,0.255) 0.008 {1,2,8|3,4,5,6,7} 0.181(0.140,0.240) 0.181(0.136,0.244) 0.015 {1,2,3,4,8|5,6,7} 0.175(0.111,0.230) 0.175(0.109,0.233) 0.014 Splits NOT in the Primary Concordance Tree but with estimated CF > 0.050: {1,2,4,5,6,8|3,7} 0.140(0.096,0.205) 0.140(0.094,0.207) 0.024 {1,7|2,3,4,5,6,8} 0.137(0.067,0.194) 0.137(0.065,0.197) 0.019 {1,2,3,8|4,5,6,7} 0.136(0.088,0.184) 0.136(0.086,0.187) 0.012 {1,4|2,3,5,6,7,8} 0.132(0.089,0.181) 0.132(0.086,0.185) 0.019 {1,3,4,5,6|2,7,8} 0.131(0.061,0.194) 0.131(0.060,0.198) 0.031 {1,2,3,7,8|4,5,6} 0.124(0.053,0.175) 0.124(0.051,0.178) 0.013 {1,3|2,4,5,6,7,8} 0.122(0.082,0.183) 0.122(0.079,0.186) 0.018 {1,3,4|2,5,6,7,8} 0.120(0.046,0.196) 0.120(0.045,0.199) 0.041
! head -n 45 nex_files4/ins/BUCKY.0.1.concordance
translate 1 clemensia, 2 davidii_D, 3 taiwanian, 4 lantanoid, 5 amplifica, 6 lutescens, 7 carlesii_, 8 dentatum_; Population Tree: ((1,(3,((4,(5,6)),7))),2,8); Primary Concordance Tree Topology: ((1,((3,(4,(5,6))),7)),2,8); Population Tree, With Branch Lengths In Estimated Coalescent Units: ((1:10.000,(3:10.000,((4:10.000,(5:10.000,6:10.000):0.946):0.054,7:10.000):0.057):0.220):1.283,2:10.000,8:10.000); Primary Concordance Tree with Sample Concordance Factors: ((1:1.000,((3:1.000,(4:1.000,(5:1.000,6:1.000):0.660):0.180):0.171,7:1.000):0.270):0.744,2:1.000,8:1.000); Four-way partitions in the Population Tree: sample-wide CF, coalescent units and Ties(if present) {1,2,3,7,8; 4|5; 6} 0.741, 0.946, {1; 3,4,5,6,7|2; 8} 0.815, 1.283, {1; 2,8|3; 4,5,6,7} 0.465, 0.220, {1,2,3,8; 7|4; 5,6} 0.368, 0.054, {1,2,8; 3|4,5,6; 7} 0.370, 0.057, Splits in the Primary Concordance Tree: sample-wide and genome-wide mean CF (95% credibility), SD of mean sample-wide CF across runs {1,3,4,5,6,7|2,8} 0.744(0.621,0.834) 0.744(0.595,0.856) 0.058 {1,2,3,4,7,8|5,6} 0.660(0.485,0.805) 0.660(0.463,0.818) 0.092 {1,2,8|3,4,5,6,7} 0.270(0.160,0.402) 0.270(0.142,0.425) 0.062 {1,2,3,7,8|4,5,6} 0.180(0.083,0.325) 0.180(0.069,0.344) 0.065 {1,2,7,8|3,4,5,6} 0.171(0.065,0.320) 0.171(0.053,0.340) 0.070 Splits NOT in the Primary Concordance Tree but with estimated CF > 0.050: {1,4,5,6,7|2,3,8} 0.180(0.089,0.278) 0.180(0.077,0.297) 0.031 {1,4|2,3,5,6,7,8} 0.164(0.101,0.225) 0.164(0.083,0.252) 0.010 {1,4,7|2,3,5,6,8} 0.160(0.089,0.225) 0.159(0.076,0.249) 0.013 {1,3,4,5,6|2,7,8} 0.134(0.018,0.237) 0.134(0.018,0.257) 0.053 {1,2,3,4,8|5,6,7} 0.121(0.000,0.260) 0.121(0.000,0.271) 0.068 {1,2,3,8|4,5,6,7} 0.110(0.000,0.195) 0.110(0.000,0.212) 0.033 {1,3,4|2,5,6,7,8} 0.106(0.000,0.213) 0.106(0.000,0.234) 0.071 {1,2,3,4,6,8|5,7} 0.103(0.000,0.201) 0.103(0.000,0.220) 0.024
! head -n 45 nex_files5/ins/BUCKY.0.1.concordance
translate 1 clemensia, 2 tinus_D33, 3 taiwanian, 4 lantanoid, 5 amplifica, 6 lutescens, 7 lentago_E, 8 dentatum_; Population Tree: ((1,((3,4),((5,6),7))),2,8); Primary Concordance Tree Topology: ((1,((3,4),((5,6),7))),2,8); Population Tree, With Branch Lengths In Estimated Coalescent Units: ((1:10.000,((3:10.000,4:10.000):0.067,((5:10.000,6:10.000):0.776,7:10.000):0.158):0.199):1.736,2:10.000,8:10.000); Primary Concordance Tree with Sample Concordance Factors: ((1:1.000,((3:1.000,4:1.000):0.186,((5:1.000,6:1.000):0.619,7:1.000):0.181):0.228):0.844,2:1.000,8:1.000); Four-way partitions in the Population Tree: sample-wide CF, coalescent units and Ties(if present) {1; 3,4,5,6,7|2; 8} 0.883, 1.736, {1,2,3,4,8; 7|5; 6} 0.693, 0.776, {1; 2,8|3,4; 5,6,7} 0.454, 0.199, {1,2,8; 3,4|5,6; 7} 0.431, 0.158, {1,2,8; 5,6,7|3; 4} 0.377, 0.067, Splits in the Primary Concordance Tree: sample-wide and genome-wide mean CF (95% credibility), SD of mean sample-wide CF across runs {1,3,4,5,6,7|2,8} 0.844(0.806,0.869) 0.844(0.801,0.877) 0.007 {1,2,3,4,7,8|5,6} 0.619(0.524,0.712) 0.619(0.518,0.719) 0.056 {1,2,8|3,4,5,6,7} 0.228(0.188,0.271) 0.228(0.182,0.278) 0.012 {1,2,5,6,7,8|3,4} 0.186(0.145,0.234) 0.186(0.140,0.240) 0.011 {1,2,3,4,8|5,6,7} 0.181(0.112,0.267) 0.181(0.108,0.272) 0.047 Splits NOT in the Primary Concordance Tree but with estimated CF > 0.050: {1,2,3,5,6,8|4,7} 0.166(0.111,0.220) 0.166(0.107,0.225) 0.028 {1,3|2,4,5,6,7,8} 0.165(0.113,0.229) 0.165(0.108,0.234) 0.031 {1,2,4,7,8|3,5,6} 0.128(0.076,0.186) 0.128(0.074,0.190) 0.017 {1,4|2,3,5,6,7,8} 0.124(0.062,0.180) 0.124(0.059,0.185) 0.033 {1,3,4,5,6|2,7,8} 0.121(0.076,0.171) 0.121(0.072,0.176) 0.027 {1,2,3,8|4,5,6,7} 0.119(0.082,0.171) 0.119(0.078,0.174) 0.015 {1,2,4,5,6,8|3,7} 0.112(0.072,0.162) 0.112(0.070,0.166) 0.019 {1,3,4,7|2,5,6,8} 0.088(0.055,0.141) 0.088(0.052,0.144) 0.010
! head -n 45 nex_files6/ins/BUCKY.0.1.concordance
translate 1 clemensia, 2 tinus_D33, 3 taiwanian, 4 lantanoid, 5 amplifica, 6 lutescens, 7 lentago_E, 8 dentatum_; Population Tree: ((1,((3,4),((5,6),7))),2,8); Primary Concordance Tree Topology: ((1,((3,4),((5,6),7))),2,8); Population Tree, With Branch Lengths In Estimated Coalescent Units: ((1:10.000,((3:10.000,4:10.000):0.060,((5:10.000,6:10.000):0.654,7:10.000):0.279):0.385):1.135,2:10.000,8:10.000); Primary Concordance Tree with Sample Concordance Factors: ((1:1.000,((3:1.000,4:1.000):0.164,((5:1.000,6:1.000):0.564,7:1.000):0.327):0.288):0.706,2:1.000,8:1.000); Four-way partitions in the Population Tree: sample-wide CF, coalescent units and Ties(if present) {1,2,3,4,8; 7|5; 6} 0.653, 0.654, {1; 3,4,5,6,7|2; 8} 0.786, 1.135, {1,2,8; 3,4|5,6; 7} 0.496, 0.279, {1; 2,8|3,4; 5,6,7} 0.547, 0.385, {1,2,8; 5,6,7|3; 4} 0.372, 0.060, Splits in the Primary Concordance Tree: sample-wide and genome-wide mean CF (95% credibility), SD of mean sample-wide CF across runs {1,3,4,5,6,7|2,8} 0.706(0.594,0.849) 0.705(0.553,0.867) 0.032 {1,2,3,4,7,8|5,6} 0.564(0.434,0.708) 0.564(0.403,0.729) 0.040 {1,2,3,4,8|5,6,7} 0.327(0.170,0.481) 0.327(0.151,0.504) 0.012 {1,2,8|3,4,5,6,7} 0.288(0.104,0.453) 0.288(0.092,0.483) 0.068 {1,2,5,6,7,8|3,4} 0.164(0.000,0.321) 0.164(0.000,0.350) 0.039 Splits NOT in the Primary Concordance Tree but with estimated CF > 0.050: {1,4,5,6,7|2,3,8} 0.195(0.066,0.321) 0.195(0.054,0.353) 0.037 {1,2,3,7,8|4,5,6} 0.140(0.000,0.321) 0.139(0.000,0.343) 0.041 {1,2|3,4,5,6,7,8} 0.137(0.057,0.274) 0.137(0.038,0.285) 0.032 {1,2,4,8|3,5,6,7} 0.124(0.000,0.358) 0.124(0.000,0.367) 0.088 {1,2,3,4,5,8|6,7} 0.123(0.000,0.255) 0.123(0.000,0.273) 0.036 {1,2,3,4,6,8|5,7} 0.120(0.000,0.208) 0.120(0.000,0.239) 0.038 {1,2,3,5,6,8|4,7} 0.114(0.000,0.283) 0.114(0.000,0.298) 0.054 {1,4|2,3,5,6,7,8} 0.104(0.000,0.245) 0.104(0.000,0.269) 0.036
For this I start raxml to get the info and then quit. Kind of lazy but simpler than calculating it myself.
%%bash
## raxml argumement w/ ...
raxmlHPC-PTHREADS-AVX -f a -m GTRGAMMA -N 100 -x 12345 -p 12345 -T 20 \
-w /home/deren/Documents/RADmissing/empirical_1/fullrun \
-n empirical_1_full_m2 -s empirical_1/fullrun/outfiles/empirical_1_m2.phy
%%bash
head -n 20 empirical_1/fullrun/RAxML_info.empirical_1_full_m2
This is RAxML version 8.0.16 released by Alexandros Stamatakis on March 21 2014. With greatly appreciated code contributions by: Andre Aberer (HITS) Simon Berger (HITS) Alexey Kozlov (HITS) Nick Pattengale (Sandia) Wayne Pfeiffer (SDSC) Akifumi S. Tanabe (NRIFS) David Dao (KIT) Charlie Taylor (UF) Alignment has 502491 distinct alignment patterns Proportion of gaps and completely undetermined characters in this alignment: 88.66% RAxML rapid bootstrapping and subsequent ML search
%%R
mean(cophenetic.phylo(ltre))
[1] 0.06114884