Notebook

Notebook 1:¶

In [1]:

### Notebook 1
### Data set 1 (Viburnum)
### Language: Bash
### Data Location: NCBI SRA PRJNA299402 & PRJNA299407

In [3]:

%%bash
## make a new directory for this analysis
mkdir -p empirical_1/
mkdir -p empirical_1/halfrun
mkdir -p empirical_1/fullrun

In [87]:

## import Python libraries
import pandas as pd
import numpy as np
import ipyparallel
import urllib2
import glob
import os

Download the sequence data¶

Sequence data for this study is archived on the NCBI sequence read archive (SRA). The data were run in two separate Illumina runs, but are combined under a single project id number.

Project SRA: SRP055977
Project number: PRJNA277574
Biosample numbers: SAMN03394519 -- SAMN03394561
Runs: SRR1915524 -- SRR1915566
The barcodes file is in the github repository for this project.

The library contains 95 samples. We uploaded the two demultiplexed samples for each individual separately, so each sample has 2 files. Below we examine just the first library (the "half" data set) and then both libraries combined (the "full" data set). We analyze on 64 samples since the remaining samples are replicate individuals within species that are part of a separate project.

You can download the data set using the script below:

In [ ]:

%%bash
## get the data from Dryad
for run in $(seq 24 28);

  do
wget -q -r -nH --cut-dirs=9 \
ftp://ftp-trace.ncbi.nlm.nih.gov/\
sra/sra-instant/reads/ByRun/sra/SRR/\
SRR191/SRR19155$run/SRR19155$run".sra";

done

In [ ]:

%%bash
## convert sra files to fastq using fastq-dump tool
fastq-dump *.sra

In [ ]:

## IPython code

## This reads in a table mapping sample names to SRA numbers
## that is hosted on github

## open table from github url
url = "https://raw.githubusercontent.com/"+\
      "dereneaton/virentes/master/SraRunTable.txt"
intable = urllib2.urlopen(url)

## make name xfer dictionary
DF = pd.read_table(intable, sep="\t")
D = {DF.Run_s[i]:DF.Library_Name_s[i] for i in DF.index}

## change file names and move to fastq dir/
for fname in glob.glob("*.fastq"):
    os.rename(fname, "analysis_pyrad/fastq/"+\
                     D[fname.replace(".fastq",".fq")])

Create a set with reads concatenated from both technical replicates of each sample¶

In [1]:

%%bash
mkdir -p fastq_combined

In [26]:

## IPython code that makes a bash call w/ (!)
## get all the data from the two libraries and concatenate it
lib1tax = glob.glob("/home/deren/Documents/Vib_Lib1/fastq_Lib1/*.gz")
lib2tax = glob.glob("/home/deren/Documents/Vib_Lib1/fastq_Lib2/*.gz")

## names had to be modified to match
taxa = [i.split("/")[-1].split("_", 1)[1] for i in lib1tax]

for tax in taxa:
    ! cat /home/deren/Documents/Vib_Lib1/fastq_Lib1/Lib1_$tax \
          /home/deren/Documents/Vib_Lib1/fastq_Lib2/Lib2_$tax \
          > /home/deren/Documents/Vib_Lib1/fastq_combined/$tax 

cat: /home/deren/Documents/Vib_Lib1/fastq_Lib2/Lib2_lantan_combined_R1_.gz: No such file or directory
cat: /home/deren/Documents/Vib_Lib1/fastq_Lib2/Lib2_propiinquum_combined_R1_.gz: No such file or directory

Make a params file¶

In [4]:

%%bash
pyrad --version

pyRAD 3.0.63

In [5]:

%%bash
## create a new default params file
rm params.txt
pyrad -n

	new params.txt file created

In [156]:

%%bash
## substitute new parameters into file
sed -i '/## 1. /c\empirical_1/halfrun    ## 1. working directory ' params.txt
sed -i '/## 6. /c\TGCAG                  ## 6. cutters ' params.txt
sed -i '/## 7. /c\30                     ## 7. N processors      ' params.txt
sed -i '/## 9. /c\6                      ## 9. NQual             ' params.txt
sed -i '/## 10./c\.85                    ## 10. clust threshold  ' params.txt
sed -i '/## 12./c\4                      ## 12. MinCov           ' params.txt
sed -i '/## 13./c\10                     ## 13. maxSH            ' params.txt
sed -i '/## 14./c\empirical_1_half_m4    ## 14. output name      ' params.txt
sed -i '/## 18./c\/home/deren/Documents/Vib_Lib1/fastq_Lib1/*.fastq ## 18. data location    ' params.txt
sed -i '/## 29./c\2,2                    ## 29. trim overhang    ' params.txt
sed -i '/## 30./c\p,n,s,a                ## 30. output formats   ' params.txt

In [157]:

cat params.txt

==** parameter inputs for pyRAD version 3.0.63  **======================== affected step ==
empirical_1/halfrun    ## 1. working directory 
./*.fastq.gz              ## 2. Loc. of non-demultiplexed files (if not line 18)  (s1)
./*.barcodes              ## 3. Loc. of barcode file (if not line 18)             (s1)
vsearch                   ## 4. command (or path) to call vsearch (or usearch)    (s3,s6)
muscle                    ## 5. command (or path) to call muscle                  (s3,s7)
TGCAG                  ## 6. cutters 
30                     ## 7. N processors      
6                         ## 8. Mindepth: min coverage for a cluster              (s4,s5)
6                      ## 9. NQual             
.85                    ## 10. clust threshold  
rad                       ## 11. Datatype: rad,gbs,pairgbs,pairddrad,(others:see docs)(all)
4                      ## 12. MinCov           
10                     ## 13. maxSH            
empirical_1_half_m4    ## 14. output name      
==== optional params below this line ===================================  affected step ==
                       ## 15.opt.: select subset (prefix* only selector)            (s2-s7)
                       ## 16.opt.: add-on (outgroup) taxa (list or prefix*)         (s6,s7)
                       ## 17.opt.: exclude taxa (list or prefix*)                   (s7)
/home/deren/Documents/Vib_Lib1/fastq_Lib1/*.fastq ## 18. data location    
                       ## 19.opt.: maxM: N mismatches in barcodes (def= 1)          (s1)
                       ## 20.opt.: phred Qscore offset (def= 33)                    (s2)
                       ## 21.opt.: filter: def=0=NQual 1=NQual+adapters. 2=strict   (s2)
                       ## 22.opt.: a priori E,H (def= 0.001,0.01, if not estimated) (s5)
                       ## 23.opt.: maxN: max Ns in a cons seq (def=5)               (s5)
                       ## 24.opt.: maxH: max heterozyg. sites in cons seq (def=5)   (s5)
                       ## 25.opt.: ploidy: max alleles in cons seq (def=2;see docs) (s4,s5)
                       ## 26.opt.: maxSNPs: (def=100). Paired (def=100,100)         (s7)
                       ## 27.opt.: maxIndels: within-clust,across-clust (def. 3,99) (s3,s7)
                       ## 28.opt.: random number seed (def. 112233)              (s3,s6,s7)
2,2                    ## 29. trim overhang    
p,n,s,a                ## 30. output formats   
                       ## 31.opt.: maj. base call at depth>x<mindepth (def.x=mindepth) (s5)
                       ## 32.opt.: keep trimmed reads (def=0). Enter min length.    (s2)
                       ## 33.opt.: max stack size (int), def= max(500,mean+2*SD)    (s3)
                       ## 34.opt.: minDerep: exclude dereps with <= N copies, def=1 (s3)
                       ## 35.opt.: use hierarchical clustering (def.=0, 1=yes)      (s6)
                       ## 36.opt.: repeat masking (def.=1='dust' method, 0=no)      (s3,s6)
                       ## 37.opt.: vsearch max threads per job (def.=6; see docs)   (s3,s6)
==== optional: list group/clade assignments below this line (see docs) ==================

In [ ]:

%%bash
pyrad -p params.txt -s 234567 >> log.txt 2>&1

In [53]:

%%bash
sed -i '/## 12./c\2                         ## 12. MinCov           ' params.txt
sed -i '/## 14./c\empirical_1_half_m2       ## 14. output name      ' params.txt

In [35]:

%%bash
pyrad -p params.txt -s 7 >> log.txt 2>&1

Assemble the full data set¶

Added the 'a' option to output formats to build an ".alleles" file which will be used later for mrbayes/bucky analyses.

In [141]:

%%bash
## substitute new parameters into file
sed -i '/## 1. /c\empirical_1/fullrun    ## 1. working directory ' params.txt
sed -i '/## 6. /c\TGCAG                  ## 6. cutters ' params.txt
sed -i '/## 7. /c\30                     ## 7. N processors      ' params.txt
sed -i '/## 9. /c\6                      ## 9. NQual             ' params.txt
sed -i '/## 10./c\.85                    ## 10. clust threshold  ' params.txt
sed -i '/## 12./c\4                      ## 12. MinCov           ' params.txt
sed -i '/## 13./c\10                     ## 13. maxSH            ' params.txt
sed -i '/## 14./c\empirical_1_full_m4    ## 14. output name      ' params.txt
sed -i '/## 18./c\/home/deren/Documents/Vib_Lib1/fastq_combined/*.fastq ## 18. data location    ' params.txt
sed -i '/## 29./c\2,2                    ## 29. trim overhang    ' params.txt
sed -i '/## 30./c\p,n,s,a                ## 30. output formats   ' params.txt

In [ ]:

%%bash
pyrad -p params.txt -s 234567 >> log.txt 2>&1

In [49]:

%%bash
sed -i '/## 12./c\2                         ## 12. MinCov           ' params.txt
sed -i '/## 14./c\empirical_1_full_m2       ## 14. output name      ' params.txt

In [142]:

%%bash
pyrad -p params.txt -s 7 >> log.txt 2>&1

Results¶

We are interested in the relationship between the amount of input (raw) data between any two samples, the average coverage they recover when clustered together, and the phylogenetic distances separating samples.

Raw data amounts (1 sequence lane)¶

The average number of raw reads per sample is 1.37M.

In [3]:

## read in the data
s2dat = pd.read_table("empirical_1/halfrun/stats/s2.rawedit.txt", header=0, nrows=66)

## print summary stats
print s2dat["passed.total"].describe()

## find which sample has the most raw data
maxraw = s2dat["passed.total"].max()
print "\nmost raw data in sample:"
print s2dat['sample '][s2dat['passed.total']==maxraw]

count         65.000000
mean     1372359.907692
std       640115.353508
min       324478.000000
25%       961366.000000
50%      1249609.000000
75%      1602036.000000
max      3113934.000000
Name: passed.total, dtype: float64

most raw data in sample:
54    Lib1_sulcatum_D9_MEX_003
Name: sample , dtype: object

Raw data amounts (2 sequence lanes)¶

The average nreads now is 2.74M

In [4]:

## read in the data
s2dat = pd.read_table("empirical_1/fullrun/stats/s2.rawedit.txt", header=0, nrows=66)

## print summary stats
print s2dat["passed.total"].describe()

## find which sample has the most raw data
maxraw = s2dat["passed.total"].max()
print "\nmost raw data in sample:"
print s2dat['sample '][s2dat['passed.total']==maxraw]

count         65.000000
mean     2736777.215385
std      1262460.162891
min       643550.000000
25%      1918460.000000
50%      2519904.000000
75%      3215483.000000
max      6183790.000000
Name: passed.total, dtype: float64

most raw data in sample:
20    foetidum_D24_KFC_1942
Name: sample , dtype: object

Look at distributions of coverage¶

pyrad v.3.0.63 outputs depth information for each sample which I read in here and plot. First let's ask which sample has the highest depth of coverage. The std here is the std in means across samples. The std of depths within individuals is much higher.

In [33]:

## read in the s3 results
s3dat = pd.read_table("empirical_1/halfrun/stats/s3.clusters.txt", header=0, nrows=66)

## print summary stats
print "summary of means\n=================="
print s3dat['dpt.me'].describe()

## print summary stats
print "\nsummary of std\n=================="
print s3dat['dpt.sd'].describe()

## print summary stats
print "\nsummary of proportion lowdepth\n=================="
print pd.Series(1-s3dat['d>5.tot']/s3dat["total"]).describe()

## find which sample has the greatest depth of retained loci
maxdepth = s3dat["d>5.tot"].max()
print "\nhighest coverage in sample:"
print s3dat['taxa'][s3dat['d>5.tot']==maxdepth]

summary of means
==================
count    65.000000
mean     14.869446
std       3.033109
min      10.008000
25%      12.912000
50%      14.197000
75%      16.340000
max      23.558000
Name: dpt.me, dtype: float64

summary of std
==================
count     65.000000
mean      72.607323
std       57.365233
min        9.900000
25%       29.634000
50%       50.903000
75%      103.118000
max      262.167000
Name: dpt.sd, dtype: float64

summary of proportion lowdepth
==================
count    65.000000
mean      0.282502
std       0.031652
min       0.209510
25%       0.271586
50%       0.283439
75%       0.295572
max       0.392641
dtype: float64

highest coverage in sample:
54    Lib1_sulcatum_D9_MEX_003
Name: taxa, dtype: object

In [10]:

## read in the s3 results
s3dat = pd.read_table("empirical_1/fullrun/stats/s3.clusters.txt", header=0, nrows=66)

## print summary stats
print "summary of means\n=================="
print s3dat['dpt.me'].describe()

## print summary stats
print "\nsummary of std\n=================="
print s3dat['dpt.sd'].describe()

## print summary stats
print "\nsummary of proportion hidepth\n=================="
print pd.Series(1-s3dat['d>5.tot']/s3dat["total"]).describe()

## find which sample has the greatest depth of retained loci
max_hiprop = (s3dat["d>5.tot"]/s3dat["total"]).max()
print "\nhighest coverage in sample:"
print s3dat['taxa'][s3dat['d>5.tot']/s3dat["total"]==max_hiprop]

summary of means
==================
count    65.000000
mean     18.535369
std       5.564201
min      10.011000
25%      14.403000
50%      17.576000
75%      21.119000
max      36.199000
Name: dpt.me, dtype: float64

summary of std
==================
count     65.000000
mean     108.366000
std       89.468413
min       11.761000
25%       43.009000
50%       77.229000
75%      157.560000
max      400.431000
Name: dpt.sd, dtype: float64

summary of proportion hidepth
==================
count    65.000000
mean      0.252426
std       0.041188
min       0.153852
25%       0.229203
50%       0.254315
75%       0.278021
max       0.371032
dtype: float64

highest coverage in sample:
31    lantanoides_D15_Beartown_2
Name: taxa, dtype: object

In [13]:

## print mean and std of coverage for the highest coverage sample
with open("empirical_1/fullrun/clust.85/lantanoides_D15_Beartown_2.depths", 'rb') as indat:
    depths = np.array(indat.read().strip().split(","), dtype=int)
    
print depths.mean(), depths.std()

32.3991013699 199.423257009

In [16]:

import toyplot
import toyplot.svg
import numpy as np

## read in the depth information for this sample
with open("empirical_1/fullrun/clust.85/lantanoides_D15_Beartown_2.depths", 'rb') as indat:
    depths = np.array(indat.read().strip().split(","), dtype=int)
    
## make a barplot in Toyplot
canvas = toyplot.Canvas(width=350, height=300)
axes = canvas.axes(xlabel="Depth of coverage (N reads)", 
                   ylabel="N loci",
                   label="dataset1/sample=sulcatum_D9_MEX_003")

## select the loci with depth > 5 (kept)
keeps = depths[depths>5]

## plot kept and discarded loci
edat = np.histogram(depths, range(30)) # density=True)
kdat = np.histogram(keeps, range(30)) #, density=True)
axes.bars(edat)
axes.bars(kdat)

#toyplot.svg.render(canvas, "empirical_1_full_depthplot.svg")

Out[16]:

<toyplot.mark.BarMagnitudes at 0x7fdc49a9d190>

In [83]:

cat empirical_1/halfrun/stats/empirical_1_half_m4.stats


144161      ## loci with > minsp containing data
143948      ## loci with > minsp containing data & paralogs removed
143948      ## loci with > minsp containing data & paralogs removed & final filtering

## number of loci recovered in final data set for each taxon.
taxon	nloci
Lib1_acerfolium_ELS88          	41968
Lib1_acutifolium_DRY3_MEX_006  	31990
Lib1_amplificatum_D3_SAN_156003	24211
Lib1_anamensis_C6_PWS_2094     	10955
Lib1_beccarii_combinedX        	26054
Lib1_betulifolium              	23739
Lib1_bitchuense_combined       	19268
Lib1_carlesii_D1_BP_001        	42273
Lib1_cassinoides_ELS2          	25549
Lib1_cinnamomifolium_PWS2105X  	29976
Lib1_clemensiae_DRY6_PWS_2135  	31519
Lib1_coriaceum_combined        	26857
Lib1_cylindricum_DRY1_WC_268   	32747
Lib1_davidii_D32_WC_269        	37651
Lib1_dentatum_ELS4             	40230
Lib1_dilatatum_ELS45           	37037
Lib1_erosum_D23_MJDJP_4        	28689
Lib1_erubescens_RCW36          	22413
Lib1_farreri_RCW21             	27313
Lib1_foetens_ERAD10            	11437
Lib1_foetidum_D24_KFC_1942     	39936
Lib1_formosanum_C7_JH_2007     	3797
Lib1_furcatum_combined         	21882
Lib1_glaberrimum_D34_PWS_2323  	27587
Lib1_grandiflorum_ERAD11_Wendy 	12655
Lib1_hanceanum_D4_PWS_2195     	28252
Lib1_henryi_D22_WC_272         	16184
Lib1_integrifolium_D25_KFC_1946	34016
Lib1_jamesonii_D12_PWS_1636    	29455
Lib1_japonicum_D26_WC_273      	46474
Lib1_lantana_combined          	25166
Lib1_lantanoides_D15_Beartown_2	36966
Lib1_lentago_ELS85             	25759
Lib1_lutescens_D35_PWS_2077    	31943
Lib1_luzonicum_D27_9M_2005     	35934
Lib1_macrocephalum_D2_WC_284   	27360
Lib1_muhalla_D28_WC_274        	39682
Lib1_nervosum_C4_PWS_2298      	22191
Lib1_odoratissimum_combined    	17368
Lib1_opulus_D6_WC_250          	26534
Lib1_orientale_DRY2_MJD_GEORGIA	36299
Lib1_parvifolium_D29_KFC_1953  	46443
Lib1_plicatum_C1_MJDJP_12      	21943
Lib1_propinquum_DRY4_WC_276    	35837
Lib1_prunifolium_ELS57         	29395
Lib1_punctatum_D19_PWS_2097    	35392
Lib1_recognitum_AA_1471_83B    	27517
Lib1_rhytidophyllum            	26189
Lib1_rufidulum_ELS25           	22054
Lib1_sambucina_D20_PWS_2100    	31266
Lib1_sargentii_RCW19           	21166
Lib1_sempervirens_combined     	20609
Lib1_setigerum                 	38816
Lib1_sieboldii_AA_616_6B       	15182
Lib1_sulcatum_D9_MEX_003       	26087
Lib1_suspensum_C5_MJD_111711   	24471
Lib1_sympodiale_D18_KFC_1932   	37477
Lib1_taiwanianum_TW1_KFC_1952  	25293
Lib1_tashiori_D30_TET_YAH      	40603
Lib1_tinus_D33_WC_277          	35049
Lib1_triphyllum_D13_PWS_1783   	29055
Lib1_urceolatum_MJD_Japan_8    	18009
Lib1_veitchii                  	26094
Lib1_vernicosum_D21_PWS_2123   	31675
Lib1_wrightii_D31_MJDJP_1      	36417


## nloci = number of loci with data for exactly ntaxa
## ntotal = number of loci for which at least ntaxa have data
ntaxa	nloci	saved	ntotal
1	-
2	-		-
3	-		-
4	24191	*	143948
5	15346	*	119757
6	10842	*	104411
7	8551	*	93569
8	6854	*	85018
9	5781	*	78164
10	5066	*	72383
11	4729	*	67317
12	4282	*	62588
13	4017	*	58306
14	3664	*	54289
15	3495	*	50625
16	3397	*	47130
17	3251	*	43733
18	3065	*	40482
19	2990	*	37417
20	2911	*	34427
21	2880	*	31516
22	2732	*	28636
23	2665	*	25904
24	2697	*	23239
25	2444	*	20542
26	2379	*	18098
27	2131	*	15719
28	2087	*	13588
29	1897	*	11501
30	1686	*	9604
31	1503	*	7918
32	1299	*	6415
33	1106	*	5116
34	948	*	4010
35	748	*	3062
36	602	*	2314
37	473	*	1712
38	335	*	1239
39	242	*	904
40	185	*	662
41	122	*	477
42	98	*	355
43	52	*	257
44	47	*	205
45	30	*	158
46	22	*	128
47	15	*	106
48	11	*	91
49	9	*	80
50	10	*	71
51	7	*	61
52	9	*	54
53	6	*	45
54	10	*	39
55	5	*	29
56	4	*	24
57	4	*	20
58	5	*	16
59	3	*	11
60	2	*	8
61	3	*	6
62	1	*	3
63	0	*	2
64	2	*	2


## nvar = number of loci containing n variable sites (pis+autapomorphies).
## sumvar = sum of variable sites (SNPs).
## pis = number of loci containing n parsimony informative sites.
## sumpis = sum of parsimony informative sites.
	nvar	sumvar	PIS	sumPIS
0	858	0	15884	0
1	1602	1602	14623	14623
2	2305	6212	13691	42005
3	2953	15071	13104	81317
4	3553	29283	12290	130477
5	4163	50098	11363	187292
6	4625	77848	10732	251684
7	5057	113247	9733	319815
8	5622	158223	8394	386967
9	6085	212988	7230	452037
10	6286	275848	6313	515167
11	6314	345302	5023	570420
12	6641	424994	4204	620868
13	6842	513940	3229	662845
14	6621	606634	2371	696039
15	6566	705124	1787	722844
16	6379	807188	1304	743708
17	6125	911313	925	759433
18	6037	1019979	616	770521
19	5562	1125657	442	778919
20	5251	1230677	250	783919
21	4906	1333703	165	787384
22	4581	1434485	112	789848
23	4113	1529084	61	791251
24	3708	1618076	49	792427
25	3271	1699851	24	793027
26	2950	1776551	8	793235
27	2533	1844942	12	793559
28	2263	1908306	2	793615
29	1992	1966074	1	793644
30	1662	2015934	3	793734
31	1343	2057567	3	793827
32	1144	2094175	0	793827
33	935	2125030	0	793827
34	731	2149884	0	793827
35	556	2169344	0	793827
36	502	2187416	0	793827
37	347	2200255	0	793827
38	295	2211465	0	793827
39	208	2219577	0	793827
40	128	2224697	0	793827
41	98	2228715	0	793827
42	64	2231403	0	793827
43	53	2233682	0	793827
44	46	2235706	0	793827
45	21	2236651	0	793827
46	16	2237387	0	793827
47	10	2237857	0	793827
48	12	2238433	0	793827
49	9	2238874	0	793827
50	0	2238874	0	793827
51	2	2238976	0	793827
52	2	2239080	0	793827
total var= 2239080
total pis= 793827
sampled unlinked SNPs= 143090
sampled unlinked bi-allelic SNPs= 142267

get average number of loci per sample¶

In [51]:

import numpy as np
indat = open("empirical_1/halfrun/stats/empirical_1_half_m4.stats").readlines()
counts = [int(i.strip().split("\t")[1]) for i in indat[8:73]]
print np.mean(counts)
print np.std(counts)

28605.4615385
8731.83305115

get average number of samples with data for a locus¶

In [47]:

import numpy as np
import itertools
indat = open("empirical_1/halfrun/stats/empirical_1_half_m4.stats").readlines()
counts = [i.strip().split("\t") for i in indat[81:142]]
#print counts
ntax = [int(i[0]) for i in counts]
ncounts = [int(i[1]) for i in counts]
tots = list(itertools.chain(*[[i]*n for i,n in zip(ntax, ncounts)]))
print np.mean(tots)
print np.std(tots)

12.9168519187
9.09974422662

In [85]:

cat empirical_1/fullrun/stats/empirical_1_full_m4.stats


199691      ## loci with > minsp containing data
199094      ## loci with > minsp containing data & paralogs removed
199094      ## loci with > minsp containing data & paralogs removed & final filtering

## number of loci recovered in final data set for each taxon.
taxon	nloci
acerfolium_ELS88          	59188
acutifolium_DRY3_MEX_006  	52932
amplificatum_D3_SAN_156003	38141
anamensis_C6_PWS_2094     	20597
beccarii_combinedX        	41837
betulifolium              	39533
bitchuense_combined       	34338
carlesii_D1_BP_001        	58449
cassinoides_ELS2          	40767
cinnamomifolium_PWS2105X  	47043
clemensiae_DRY6_PWS_2135  	45144
coriaceum_combined        	43822
cylindricum_DRY1_WC_268   	50762
davidii_D32_WC_269        	53023
dentatum_ELS4             	59443
dilatatum_ELS45           	57384
erosum_D23_MJDJP_4        	47927
erubescens_RCW36          	39132
farreri_RCW21             	44604
foetens_ERAD10            	21914
foetidum_D24_KFC_1942     	60151
formosanum_C7_JH_2007     	7638
furcatum_combined         	37959
glaberrimum_D34_PWS_2323  	43979
grandiflorum_ERAD11_Wendy 	24058
hanceanum_D4_PWS_2195     	45482
henryi_D22_WC_272         	29692
integrifolium_D25_KFC_1946	54254
jamesonii_D12_PWS_1636    	48113
japonicum_D26_WC_273      	64083
lantana_combined          	42060
lantanoides_D15_Beartown_2	52090
lentago_ELS85             	42188
lutescens_D35_PWS_2077    	48126
luzonicum_D27_9M_2005     	54891
macrocephalum_D2_WC_284   	43983
muhalla_D28_WC_274        	57264
nervosum_C4_PWS_2298      	37230
odoratissimum_combined    	31480
opulus_D6_WC_250          	41578
orientale_DRY2_MJD_GEORGIA	54201
parvifolium_D29_KFC_1953  	67338
plicatum_C1_MJDJP_12      	37339
propinquum_DRY4_WC_276    	51434
prunifolium_ELS57         	45913
punctatum_D19_PWS_2097    	49862
recognitum_AA_1471_83B    	45155
rhytidophyllum            	43744
rufidulum_ELS25           	36741
sambucina_D20_PWS_2100    	47901
sargentii_RCW19           	34723
sempervirens_combined     	35475
setigerum                 	58325
sieboldii_AA_616_6B       	27714
sulcatum_D9_MEX_003       	45500
suspensum_C5_MJD_111711   	41110
sympodiale_D18_KFC_1932   	54400
taiwanianum_TW1_KFC_1952  	41711
tashiori_D30_TET_YAH      	58700
tinus_D33_WC_277          	50680
triphyllum_D13_PWS_1783   	47654
urceolatum_MJD_Japan_8    	31220
veitchii                  	43123
vernicosum_D21_PWS_2123   	48356
wrightii_D31_MJDJP_1      	55891


## nloci = number of loci with data for exactly ntaxa
## ntotal = number of loci for which at least ntaxa have data
ntaxa	nloci	saved	ntotal
1	-
2	-		-
3	-		-
4	40036	*	199094
5	23235	*	159058
6	15581	*	135823
7	11312	*	120242
8	8843	*	108930
9	7183	*	100087
10	5858	*	92904
11	5064	*	87046
12	4440	*	81982
13	4041	*	77542
14	3703	*	73501
15	3402	*	69798
16	3177	*	66396
17	3054	*	63219
18	2789	*	60165
19	2655	*	57376
20	2534	*	54721
21	2391	*	52187
22	2361	*	49796
23	2216	*	47435
24	2123	*	45219
25	2121	*	43096
26	2074	*	40975
27	2049	*	38901
28	2023	*	36852
29	1933	*	34829
30	2020	*	32896
31	1893	*	30876
32	1916	*	28983
33	1907	*	27067
34	1916	*	25160
35	1984	*	23244
36	1857	*	21260
37	1824	*	19403
38	1755	*	17579
39	1764	*	15824
40	1724	*	14060
41	1583	*	12336
42	1567	*	10753
43	1476	*	9186
44	1347	*	7710
45	1206	*	6363
46	1098	*	5157
47	959	*	4059
48	790	*	3100
49	683	*	2310
50	468	*	1627
51	383	*	1159
52	285	*	776
53	201	*	491
54	118	*	290
55	76	*	172
56	43	*	96
57	28	*	53
58	16	*	25
59	4	*	9
60	2	*	5
61	2	*	3
62	1	*	1


## nvar = number of loci containing n variable sites (pis+autapomorphies).
## sumvar = sum of variable sites (SNPs).
## pis = number of loci containing n parsimony informative sites.
## sumpis = sum of parsimony informative sites.
	nvar	sumvar	PIS	sumPIS
0	2213	0	26451	0
1	3244	3244	20005	20005
2	4394	12032	17249	54503
3	5198	27626	15381	100646
4	5702	50434	14286	157790
5	6188	81374	13036	222970
6	6480	120254	11956	294706
7	6940	168834	10930	371216
8	7207	226490	10132	452272
9	7473	293747	9442	537250
10	7456	368307	8372	620970
11	7756	453623	7508	703558
12	7885	548243	6655	783418
13	8098	653517	5735	857973
14	7717	761555	4896	926517
15	7699	877040	4029	986952
16	7562	998032	3200	1038152
17	7171	1119939	2719	1084375
18	7081	1247397	2068	1121599
19	6630	1373367	1555	1151144
20	6291	1499187	1143	1174004
21	5930	1623717	763	1190027
22	5459	1743815	565	1202457
23	5189	1863162	386	1211335
24	4862	1979850	245	1217215
25	4552	2093650	164	1221315
26	4223	2203448	94	1223759
27	3946	2309990	54	1225217
28	3610	2411070	29	1226029
29	3397	2509583	18	1226551
30	3150	2604083	11	1226881
31	2772	2690015	9	1227160
32	2500	2770015	3	1227256
33	2322	2846641	2	1227322
34	2040	2916001	3	1227424
35	1717	2976096	0	1227424
36	1482	3029448	0	1227424
37	1185	3073293	0	1227424
38	997	3111179	0	1227424
39	876	3145343	0	1227424
40	642	3171023	0	1227424
41	498	3191441	0	1227424
42	394	3207989	0	1227424
43	260	3219169	0	1227424
44	224	3229025	0	1227424
45	140	3235325	0	1227424
46	124	3241029	0	1227424
47	64	3244037	0	1227424
48	46	3246245	0	1227424
49	39	3248156	0	1227424
50	28	3249556	0	1227424
51	19	3250525	0	1227424
52	8	3250941	0	1227424
53	8	3251365	0	1227424
54	2	3251473	0	1227424
55	1	3251528	0	1227424
56	1	3251584	0	1227424
57	1	3251641	0	1227424
58	1	3251699	0	1227424
total var= 3251699
total pis= 1227424
sampled unlinked SNPs= 196881
sampled unlinked bi-allelic SNPs= 195117

In [49]:

import numpy as np
indat = open("empirical_1/fullrun/stats/empirical_1_full_m4.stats").readlines()
counts = [int(i.strip().split("\t")[1]) for i in indat[8:73]]
print np.mean(counts)
print np.std(counts)

44638.7619048
11037.958308

In [50]:

import numpy as np
import itertools
indat = open("empirical_1/fullrun/stats/empirical_1_full_m4.stats").readlines()
counts = [i.strip().split("\t") for i in indat[81:140]]
#print counts
ntax = [int(i[0]) for i in counts]
ncounts = [int(i[1]) for i in counts]
tots = list(itertools.chain(*[[i]*n for i,n in zip(ntax, ncounts)]))
print np.mean(tots)
print np.std(tots)

14.6488040825
12.690632427

Infer an ML phylogeny¶

In [ ]:

%%bash
## raxml argumement w/ ...
raxmlHPC-PTHREADS-AVX -f a -m GTRGAMMA -N 100 -x 12345 -p 12345 -T 35 \
                      -w /home/deren/Documents/RADmissing/empirical_1/halfrun \
                      -n empirical_1_halfrun -s empirical_1/halfrun/outfiles/empirical_1_half_m4.phy \
                      -o "Lib1_clemensiae_DRY6_PWS_2135"

In [ ]:

%%bash
raxmlHPC-PTHREADS-AVX -f a -m GTRGAMMA -N 100 -x 12345 -p 12345 -T 35 \
                      -w /home/deren/Documents/RADmissing/empirical_1/fullrun \
                      -n empirical_1_fullrun -s empirical_1/fullrun/outfiles/empirical_1_full_m4.phy \
                      -o "clemensiae_DRY6_PWS_2135"

In [86]:

%%bash 
head -n 20 empirical_1/halfrun/RAxML_info.empirical_1_half_m4


This is RAxML version 8.0.0 released by Alexandros Stamatakis on Dec 13 2013.

With greatly appreciated code contributions by:
Andre Aberer      (HITS)
Simon Berger      (HITS)
Alexey Kozlov     (HITS)
Nick Pattengale   (Sandia)
Wayne Pfeiffer    (SDSC)
Akifumi S. Tanabe (NRIFS)
David Dao         (KIT)
Charlie Taylor    (UF)


Alignment has 1010208 distinct alignment patterns

Proportion of gaps and completely undetermined characters in this alignment: 80.36%

RAxML rapid bootstrapping and subsequent ML search

In [88]:

%%bash 
head -n 20 empirical_1/fullrun/RAxML_info.empirical_1_full_m4


This is RAxML version 8.0.0 released by Alexandros Stamatakis on Dec 13 2013.

With greatly appreciated code contributions by:
Andre Aberer      (HITS)
Simon Berger      (HITS)
Alexey Kozlov     (HITS)
Nick Pattengale   (Sandia)
Wayne Pfeiffer    (SDSC)
Akifumi S. Tanabe (NRIFS)
David Dao         (KIT)
Charlie Taylor    (UF)


Alignment has 1035341 distinct alignment patterns

Proportion of gaps and completely undetermined characters in this alignment: 77.64%

RAxML rapid bootstrapping and subsequent ML search

Plot the tree in R using `ape`¶

In [1]:

%load_ext rpy2.ipython

In [132]:

%%R -w 600 -h 1000
library(ape)
tre_half <- read.tree("empirical_1/halfrun/RAxML_bipartitions.empirical_1_halfrun")
#rtre <- root(tre, "Lib1_clemensiae_DRY6_PWS_2135", resolve.root=T)
#rtre <- root(rtre, "Lib1_clemensiae_DRY6_PWS_2135", resolve.root=T)
ltre_half <- ladderize(tre_half)
plot(ltre_half, cex=0.8, edge.width=2)
nodelabels(ltre_half$node.label)

In [6]:

%%R -w 600 -h 1000
library(ape)
svg("outtree.svg", height=11, width=8)
tre_full <- read.tree("empirical_1/fullrun/RAxML_bipartitions.empirical_1_fullrun")
#rtre <- root(tre, "Lib1_clemensiae_DRY6_PWS_2135", resolve.root=T)
#rtre <- root(rtre, "Lib1_clemensiae_DRY6_PWS_2135", resolve.root=T)
ltre_full <- ladderize(tre_full)
plot(ltre_full, cex=0.8, edge.width=3)
#nodelabels(ltre_full$node.label)
dev.off()

png 
  2

BUCKY -- write mrbayes nexus blocks for each locus¶

The functions nexmake and subsample are used to split the .loci file into individual nexus files for each locus within a new directory. Each nexus file is given a mrbayes command to run. Then we run the bucky tool mbsum to summarize the mrbayes output, and finally run bucky to infer concordance trees from the posterior distributions of trees across all loci.

Loci are selected on the basis that they have coverage across all tips of the selected subtree and that they contain at least 1 SNP.

In [93]:

def nexmake(taxadict, loc, nexdir, trim):
    outloc = open(nexdir+"/"+str(loc)+".nex", 'w')
    header = """
#NEXUS 
begin data;
dimensions ntax={} nchar={};
format datatype=dna interleave=yes missing=N gap=-;
matrix

""".format(len(taxadict), len(taxadict.values()[0]))
    outloc.write(header)
    
    for tax, seq in taxadict.items():
        outloc.write("{}{}{}\n"\
                     .format(tax[trim:trim+9],
                             " "*(10-len(tax[0:9])),
                             "".join(seq)))
    
    mbstring = """
    ;
end;
begin mrbayes;
set autoclose=yes nowarn=yes;
lset nst=6 rates=gamma;
mcmc ngen=2200000 samplefreq=2000;
sump burnin=200000;
sumt burnin=200000;
end;    
"""
    outloc.write(mbstring)
    outloc.close()
    

In [102]:

def unstruct(amb):
    " returns bases from ambiguity code"
    D = {"R":["G","A"],
         "K":["G","T"],
         "S":["G","C"],
         "Y":["T","C"],
         "W":["T","A"],
         "M":["C","A"]}
    if amb in D:
        return D.get(amb)
    else:
        return [amb,amb]
            

def resolveambig(subseq):
    N = []
    for col in subseq:
        N.append([unstruct(i)[np.random.binomial(1, 0.5)] for i in col])
    return np.array(N)

    
def newPIS(seqsamp):
    counts = [Counter(col) for col in seqsamp.T if not ("-" in col or "N" in col)]
    pis = [i.most_common(2)[1][1] > 1 for i in counts if len(i.most_common(2))>1]
    if sum(pis) >= 2:
        return sum(pis)
    else:
        return 0      

In [103]:

def parseloci(iloci, taxadict, nexdir, trim=0):
    nloc = 0
    ## create subsampled data set
    for loc in iloci:
        ## if all tip samples have data in this locus
        names = [line.split()[0] for line in loc.split("\n")[:-1]]

        ## check that locus has required samples for each subtree
        if all([i in names for i in taxadict.values()]):
            seqs = np.array([list(line.split()[1]) for line in loc.split("\n")[:-1]])
            seqsamp = seqs[[names.index(tax) for tax in taxadict.values()]]
            seqsamp = resolveambig(seqsamp)
            pis = newPIS(seqsamp)
            if pis:
                nloc += 1
                ## remove invariable columns given this subsampling
                keep = []
                for n, col in enumerate(seqsamp.T):
                    if all([i not in ["N","-"] for i in col]):
                        keep.append(n)
                subseq = seqsamp.T[keep].T
                ## write to a nexus file
                nexdict = dict(zip(taxadict.keys(), [i.tostring() for i in subseq]))
                nexmake(nexdict, nloc, nexdir, trim)
    print nloc, 'loci kept'            

Modify line endings of loci string for easier parsing¶

In [117]:

def getloci(locifile):
    ## parse the loci file by new line characters
    locifile = open(locifile)
    lines = locifile.readlines()

    ## add "|" to end of lines that contain "|"
    for idx in range(len(lines)):
        if "|" in lines[idx]:
            lines[idx] = lines[idx].strip()+"|\n"
        
    ## join lines back together into one large string
    locistr = "".join(lines)

    ## break string into loci at the "|\n" character
    loci = locistr.split("|\n")[:-1]

    ## how many loci?
    print len(loci), "loci"
    return loci
    
## run on both files
loci_full = getloci("empirical_1/fullrun/outfiles/empirical_1_full_m4.loci")
loci_half = getloci("empirical_1/halfrun/outfiles/empirical_1_half_m4.loci")

199094 loci
143948 loci

Make nexus files¶

In [118]:

parseloci(loci_full[:], deep_dict_f, "deep_dict_full", 0)
parseloci(loci_half[:], deep_dict_h, "deep_dict_half", 0)
#parseloci(loci[:], shallow_dict, "shallow_dict", 0)

744 loci kept
71 loci kept

In [119]:

## create a parallel client
ipclient = ipyparallel.Client()
lview = ipclient.load_balanced_view()

## call function across all engines
def mrbayes(infile):
    import subprocess
    cmd = "mb %s" % infile
    subprocess.check_call(cmd, shell=True)

In [120]:

## submit all nexus files to run mb
allnex = glob.glob("deep_dict_full/*.nex")
for nex in allnex:
    lview.apply(mrbayes, nex)

ipclient.wait_interactive()

 744/744 tasks finished after 3759 s
done

Summarize posteriors with `mbsum`¶

In [126]:

def mbsum(nexdir, nloci):     
    import subprocess
    ## combine trees from the two replicate runs
    for n in range(1, nloci+1):
        cmd = "mbsum -n 101 -o {}{}.in {}{}.nex.run1.t {}{}.nex.run2.t".\
              format(nexdir, n, nexdir, n, nexdir, n)
        subprocess.check_call(cmd, shell=True)

In [ ]:

Run Bucky to infer concordance factors¶

In [3]:

import os
import numpy as np
from collections import Counter


def subsample(infile, requires, outgroup, nexdir, trim):
    """ sample n taxa from infile to create nex file"""
    ## counter
    loc = 0
    
    ## create output directory
    if not os.path.exists(nexdir):
        os.mkdir(nexdir)
    
    ## input .alleles file
    loci = open(infile, 'r').read().strip().split("//")
    
    ## create a dictionary of {names:seqs}
    for locus in xrange(len(loci)):
        locnex = [""]*len(requires)
        for line in loci[locus].strip().split("\n"):
            tax = line.split()[0]
            seq = line.split()[-1]
            if ">" in tax:
                if tax in requires:
                    locnex[requires.index(tax)] = seq
        ## if all tips
        if len([i for i in locnex if i]) == len(requires):
            ## if locus is variable
            ## count how many times each char occurs in each site
            ccs = [Counter(i) for i in np.array([list(i) for i in locnex]).T]
            ## remove N and - characters and the first most occurring base
            for i in ccs:
                del i['-']
                del i['N']
                if i:
                    del i[i.most_common()[0][0]]
                
            ## is anything left occuring more than once (minor allele=ma)?
            ma = max([max(i.values()) if i else 0 for i in ccs])
            if ma > 1:
                nexmake(requires, locnex, loc, outgroup, nexdir, trim)
                loc += 1
    
    return loc

Subtree 1 (Oreinodentinus) (full data set)¶

In [162]:

## inputs
requires = [">triphyllum_D13_PWS_1783_0", 
            ">jamesonii_D12_PWS_1636_0",
            ">sulcatum_D9_MEX_003_0",
            ">acutifolium_DRY3_MEX_006_0",
            ">dentatum_ELS4_0",
            ">recognitum_AA_1471_83B_0"]
outgroup = ""
infile = "empirical_1/fullrun/outfiles/empirical_1_full_m4.alleles"
nexdir = "nex_files1"

## run function
nloci = subsample(infile, requires, outgroup, nexdir, trim=1)
print nloci

Subtree 1 (Oreinodentinus) (half data set)¶

In [161]:

## inputs
requires = [">Lib1_triphyllum_D13_PWS_1783_0", 
            ">Lib1_jamesonii_D12_PWS_1636_0",
            ">Lib1_sulcatum_D9_MEX_003_0",
            ">Lib1_acutifolium_DRY3_MEX_006_0",
            ">Lib1_dentatum_ELS4_0",
            ">Lib1_recognitum_AA_1471_83B_0"]
outgroup = ""
infile = "empirical_1/halfrun/outfiles/empirical_1_half_m4.alleles"
nexdir = "nex_files2"
## run function

nloci = subsample(infile, requires, outgroup, nexdir, trim=6)
print nloci

Subtree 2 (Urceolata) (full data set)¶

In [16]:

## inputs
requires = [">clemensiae_DRY6_PWS_2135_0", 
            ">tinus_D33_WC_277_0",
            ">taiwanianum_TW1_KFC_1952_0",
            ">lantanoides_D15_Beartown_2_0",
            ">amplificatum_D3_SAN_156003_0",
            ">lutescens_D35_PWS_2077_0",
            ">lentago_ELS85_0",
            ">dentatum_ELS4_0"]
outgroup = ""
infile = "empirical_1/fullrun/outfiles/empirical_1_full_m4.alleles"
nexdir = "nex_files5"

## run function
nloci = subsample(infile, requires, outgroup, nexdir, trim=1)
print nloci

Subtree 2 (Urceolata) (half data set)¶

In [17]:

## inputs
requires = [">Lib1_clemensiae_DRY6_PWS_2135_0", 
            ">Lib1_tinus_D33_WC_277_0",
            ">Lib1_taiwanianum_TW1_KFC_1952_0",
            ">Lib1_lantanoides_D15_Beartown_2_0",
            ">Lib1_amplificatum_D3_SAN_156003_0",
            ">Lib1_lutescens_D35_PWS_2077_0",
            ">Lib1_lentago_ELS85_0",
            ">Lib1_dentatum_ELS4_0"]
outgroup = ""
infile = "empirical_1/halfrun/outfiles/empirical_1_half_m4.alleles"
nexdir = "nex_files6"

## run function
nloci = subsample(infile, requires, outgroup, nexdir, trim=6)
print nloci

Run mrbayes on all nex files¶

In [19]:

import ipyparallel
import subprocess
import glob

## create a parallel client
ipclient = ipyparallel.Client()
lview = ipclient.load_balanced_view()

## call function across all engines
def mrbayes(infile):
    import subprocess
    cmd = "mb %s" % infile
    subprocess.check_call(cmd, shell=True)

In [164]:

## run on the full data set
res = lview.map_async(mrbayes, glob.glob("nex_files1/*"))
_ = res.get()

## run on the half data set
res = lview.map_async(mrbayes, glob.glob("nex_files2/*"))
_ = res.get()

In [72]:

## run on the half data set
res = lview.map_async(mrbayes, glob.glob("nex_files3/*"))
_ = res.get()

## run on the half data set
res = lview.map_async(mrbayes, glob.glob("nex_files4/*"))
_ = res.get()

In [20]:

## run on the half data set
res = lview.map_async(mrbayes, glob.glob("nex_files5/*"))
_ = res.get()

## run on the half data set
res = lview.map_async(mrbayes, glob.glob("nex_files6/*"))
_ = res.get()

Run mbsum to summarize the results¶

In [21]:

import os
import subprocess

def mbsum(nexdir, nloci):
    ## create dir for bucky input files
    insdir = os.path.join(nexdir, "ins")
    if not os.path.exists(insdir):  
        os.mkdir(insdir)
                  
    ## combine trees from the two replicate runs
    for n in range(nloci):
        cmd = "mbsum -n 101 -o {}/{}.in {}{}.nex.run1.t {}{}.nex.run2.t".\
              format(insdir, n, nexdir, n, nexdir, n)
        subprocess.check_call(cmd, shell=True)
                          
#mbsum("nex_files1/", 3300)
#mbsum("nex_files2/", 364)
#mbsum("nex_files3/", 1692)
#mbsum("nex_files4/", 169)
mbsum("nex_files5/", 1203)
mbsum("nex_files6/", 106)

Run Bucky¶

In [22]:

args = []
for insdir in ["nex_files5/ins", "nex_files6/ins"]:
    ## independence test
    args.append("bucky --use-independence-prior -k 4 -n 500000 \
                -o {}/BUCKY.ind {}/*.in".format(insdir, insdir))   
    ## alpha at three levels
    for alpha in [0.1, 1, 10]:
        args.append("bucky -a {} -k 4 -n 500000 -c 4 -o {}/BUCKY.{} {}/*.in".\
                    format(alpha, insdir, alpha, insdir))
        
    
def bucky(arg):
    import subprocess
    subprocess.check_call(arg, shell=True)
    return arg
    
res = lview.map_async(bucky, args)
res.get()

Out[22]:

['bucky --use-independence-prior -k 4 -n 500000                 -o nex_files5/ins/BUCKY.ind nex_files5/ins/*.in',
 'bucky -a 0.1 -k 4 -n 500000 -c 4 -o nex_files5/ins/BUCKY.0.1 nex_files5/ins/*.in',
 'bucky -a 1 -k 4 -n 500000 -c 4 -o nex_files5/ins/BUCKY.1 nex_files5/ins/*.in',
 'bucky -a 10 -k 4 -n 500000 -c 4 -o nex_files5/ins/BUCKY.10 nex_files5/ins/*.in',
 'bucky --use-independence-prior -k 4 -n 500000                 -o nex_files6/ins/BUCKY.ind nex_files6/ins/*.in',
 'bucky -a 0.1 -k 4 -n 500000 -c 4 -o nex_files6/ins/BUCKY.0.1 nex_files6/ins/*.in',
 'bucky -a 1 -k 4 -n 500000 -c 4 -o nex_files6/ins/BUCKY.1 nex_files6/ins/*.in',
 'bucky -a 10 -k 4 -n 500000 -c 4 -o nex_files6/ins/BUCKY.10 nex_files6/ins/*.in']

Cleanup¶

In [ ]:

del lbview
ipclient.close()

check out the results¶

In [1]:

%%bash
head -n 40 nex_files1/ins/BUCKY.0.1.concordance

head: cannot open ‘nex_files1/ins/BUCKY.0.1.concordance’ for reading: No such file or directory

In [186]:

%%bash
head -n 40 nex_files1/ins/BUCKY.1.concordance

translate
 1 triphyllu,
 2 jamesonii,
 3 sulcatum_,
 4 acutifoli,
 5 dentatum_,
 6 recognitu;

Population Tree:
(((1,2),(3,4)),5,6);

Primary Concordance Tree Topology:
(((1,2),(3,4)),5,6);

Population Tree, With Branch Lengths In Estimated Coalescent Units:
(((1:10.000,2:10.000):0.478,(3:10.000,4:10.000):0.156):0.840,5:10.000,6:10.000);

Primary Concordance Tree with Sample Concordance Factors:
(((1:1.000,2:1.000):0.562,(3:1.000,4:1.000):0.351):0.591,5:1.000,6:1.000);

Four-way partitions in the Population Tree: sample-wide CF, coalescent units and Ties(if present)
{1; 2|3,4; 5,6}	0.586, 0.478,  
{1,2; 5,6|3; 4}	0.429, 0.156,  
{1,2; 3,4|5; 6}	0.712, 0.840,  

Splits in the Primary Concordance Tree: sample-wide and genome-wide mean CF (95% credibility), SD of mean sample-wide CF across runs
{1,2,3,4|5,6} 0.591(0.562,0.621) 0.591(0.557,0.625)	0.001
{1,2|3,4,5,6} 0.562(0.533,0.586) 0.562(0.529,0.592)	0.001
{1,2,5,6|3,4} 0.351(0.320,0.383) 0.351(0.316,0.387)	0.001

Splits NOT in the Primary Concordance Tree but with estimated CF > 0.050:
{1,2,4|3,5,6} 0.131(0.108,0.156) 0.131(0.105,0.159)	0.001
{1,2,3|4,5,6} 0.126(0.104,0.148) 0.126(0.101,0.151)	0.000
{1,4,5,6|2,3} 0.123(0.105,0.144) 0.123(0.102,0.147)	0.000
{1,3|2,4,5,6} 0.119(0.100,0.143) 0.119(0.097,0.145)	0.002
{1,3,5,6|2,4} 0.111(0.092,0.129) 0.111(0.090,0.132)	0.001
{1,4|2,3,5,6} 0.107(0.085,0.127) 0.107(0.083,0.130)	0.000
{1,5,6|2,3,4} 0.098(0.079,0.115) 0.098(0.077,0.118)	0.001
{1,3,4|2,5,6} 0.086(0.069,0.104) 0.086(0.067,0.107)	0.000
{1,2,4,5|3,6} 0.062(0.047,0.079) 0.063(0.045,0.082)	0.001

In [185]:

%%bash
head -n 40 nex_files2/ins/BUCKY.1.concordance

translate
 1 _triphyll,
 2 _jamesoni,
 3 _sulcatum,
 4 _acutifol,
 5 _dentatum,
 6 _recognit;

Population Tree:
((((1,2),3),4),5,6);

Primary Concordance Tree Topology:
(((1,2),(3,4)),5,6);

Population Tree, With Branch Lengths In Estimated Coalescent Units:
((((1:10.000,2:10.000):0.315,3:10.000):0.080,4:10.000):0.540,5:10.000,6:10.000);

Primary Concordance Tree with Sample Concordance Factors:
(((1:1.000,2:1.000):0.389,(3:1.000,4:1.000):0.293):0.378,5:1.000,6:1.000);

Four-way partitions in the Population Tree: sample-wide CF, coalescent units and Ties(if present)
{1; 2|3; 4,5,6}	0.513, 0.315,  
{1,2,3; 4|5; 6}	0.612, 0.540,  
{1,2; 3|4; 5,6}	0.385, 0.080,  

Splits in the Primary Concordance Tree: sample-wide and genome-wide mean CF (95% credibility), SD of mean sample-wide CF across runs
{1,2|3,4,5,6} 0.389(0.308,0.470) 0.389(0.295,0.484)	0.003
{1,2,3,4|5,6} 0.378(0.294,0.467) 0.377(0.280,0.480)	0.002
{1,2,5,6|3,4} 0.293(0.206,0.393) 0.292(0.196,0.402)	0.006

Splits NOT in the Primary Concordance Tree but with estimated CF > 0.050:
{1,2,3|4,5,6} 0.192(0.107,0.277) 0.191(0.101,0.289)	0.005
{1,4,5,6|2,3} 0.188(0.129,0.253) 0.188(0.120,0.267)	0.004
{1,3,5,6|2,4} 0.137(0.082,0.198) 0.137(0.074,0.208)	0.002
{1,3|2,4,5,6} 0.121(0.066,0.203) 0.122(0.059,0.211)	0.002
{1,5,6|2,3,4} 0.117(0.066,0.181) 0.117(0.059,0.190)	0.003
{1,4|2,3,5,6} 0.109(0.060,0.165) 0.109(0.053,0.175)	0.002
{1,2,4|3,5,6} 0.109(0.058,0.181) 0.109(0.051,0.189)	0.003
{1,3,4|2,5,6} 0.101(0.055,0.162) 0.101(0.048,0.172)	0.002
{1,5|2,3,4,6} 0.089(0.047,0.137) 0.089(0.040,0.148)	0.001

In [2]:

! head -n 45 nex_files3/ins/BUCKY.0.1.concordance

translate
 1 clemensia,
 2 davidii_D,
 3 taiwanian,
 4 lantanoid,
 5 amplifica,
 6 lutescens,
 7 carlesii_,
 8 dentatum_;

Population Tree:
((1,(3,(4,((5,6),7)))),2,8);

Primary Concordance Tree Topology:
((1,((3,4),((5,6),7))),2,8);

Population Tree, With Branch Lengths In Estimated Coalescent Units:
((1:10.000,(3:10.000,(4:10.000,((5:10.000,6:10.000):0.968,7:10.000):0.092):0.029):0.134):2.022,2:10.000,8:10.000);

Primary Concordance Tree with Sample Concordance Factors:
((1:1.000,((3:1.000,4:1.000):0.201,((5:1.000,6:1.000):0.660,7:1.000):0.175):0.181):0.877,2:1.000,8:1.000);

Four-way partitions in the Population Tree: sample-wide CF, coalescent units and Ties(if present)
{1,2,3,4,8; 7|5; 6}	0.747, 0.968,  
{1; 3,4,5,6,7|2; 8}	0.912, 2.022,  
{1; 2,8|3; 4,5,6,7}	0.417, 0.134,  
{1,2,3,8; 4|5,6; 7}	0.392, 0.092,  
{1,2,8; 3|4; 5,6,7}	0.353, 0.029,  

Splits in the Primary Concordance Tree: sample-wide and genome-wide mean CF (95% credibility), SD of mean sample-wide CF across runs
{1,3,4,5,6,7|2,8} 0.877(0.847,0.904) 0.877(0.842,0.908)	0.010
{1,2,3,4,7,8|5,6} 0.660(0.593,0.710) 0.660(0.590,0.715)	0.022
{1,2,5,6,7,8|3,4} 0.201(0.149,0.250) 0.201(0.146,0.255)	0.008
{1,2,8|3,4,5,6,7} 0.181(0.140,0.240) 0.181(0.136,0.244)	0.015
{1,2,3,4,8|5,6,7} 0.175(0.111,0.230) 0.175(0.109,0.233)	0.014

Splits NOT in the Primary Concordance Tree but with estimated CF > 0.050:
{1,2,4,5,6,8|3,7} 0.140(0.096,0.205) 0.140(0.094,0.207)	0.024
{1,7|2,3,4,5,6,8} 0.137(0.067,0.194) 0.137(0.065,0.197)	0.019
{1,2,3,8|4,5,6,7} 0.136(0.088,0.184) 0.136(0.086,0.187)	0.012
{1,4|2,3,5,6,7,8} 0.132(0.089,0.181) 0.132(0.086,0.185)	0.019
{1,3,4,5,6|2,7,8} 0.131(0.061,0.194) 0.131(0.060,0.198)	0.031
{1,2,3,7,8|4,5,6} 0.124(0.053,0.175) 0.124(0.051,0.178)	0.013
{1,3|2,4,5,6,7,8} 0.122(0.082,0.183) 0.122(0.079,0.186)	0.018
{1,3,4|2,5,6,7,8} 0.120(0.046,0.196) 0.120(0.045,0.199)	0.041

FINAL BUCKY RESULTS (DEEP_SCALE)¶

In [3]:

! head -n 45 nex_files4/ins/BUCKY.0.1.concordance

translate
 1 clemensia,
 2 davidii_D,
 3 taiwanian,
 4 lantanoid,
 5 amplifica,
 6 lutescens,
 7 carlesii_,
 8 dentatum_;

Population Tree:
((1,(3,((4,(5,6)),7))),2,8);

Primary Concordance Tree Topology:
((1,((3,(4,(5,6))),7)),2,8);

Population Tree, With Branch Lengths In Estimated Coalescent Units:
((1:10.000,(3:10.000,((4:10.000,(5:10.000,6:10.000):0.946):0.054,7:10.000):0.057):0.220):1.283,2:10.000,8:10.000);

Primary Concordance Tree with Sample Concordance Factors:
((1:1.000,((3:1.000,(4:1.000,(5:1.000,6:1.000):0.660):0.180):0.171,7:1.000):0.270):0.744,2:1.000,8:1.000);

Four-way partitions in the Population Tree: sample-wide CF, coalescent units and Ties(if present)
{1,2,3,7,8; 4|5; 6}	0.741, 0.946,  
{1; 3,4,5,6,7|2; 8}	0.815, 1.283,  
{1; 2,8|3; 4,5,6,7}	0.465, 0.220,  
{1,2,3,8; 7|4; 5,6}	0.368, 0.054,  
{1,2,8; 3|4,5,6; 7}	0.370, 0.057,  

Splits in the Primary Concordance Tree: sample-wide and genome-wide mean CF (95% credibility), SD of mean sample-wide CF across runs
{1,3,4,5,6,7|2,8} 0.744(0.621,0.834) 0.744(0.595,0.856)	0.058
{1,2,3,4,7,8|5,6} 0.660(0.485,0.805) 0.660(0.463,0.818)	0.092
{1,2,8|3,4,5,6,7} 0.270(0.160,0.402) 0.270(0.142,0.425)	0.062
{1,2,3,7,8|4,5,6} 0.180(0.083,0.325) 0.180(0.069,0.344)	0.065
{1,2,7,8|3,4,5,6} 0.171(0.065,0.320) 0.171(0.053,0.340)	0.070

Splits NOT in the Primary Concordance Tree but with estimated CF > 0.050:
{1,4,5,6,7|2,3,8} 0.180(0.089,0.278) 0.180(0.077,0.297)	0.031
{1,4|2,3,5,6,7,8} 0.164(0.101,0.225) 0.164(0.083,0.252)	0.010
{1,4,7|2,3,5,6,8} 0.160(0.089,0.225) 0.159(0.076,0.249)	0.013
{1,3,4,5,6|2,7,8} 0.134(0.018,0.237) 0.134(0.018,0.257)	0.053
{1,2,3,4,8|5,6,7} 0.121(0.000,0.260) 0.121(0.000,0.271)	0.068
{1,2,3,8|4,5,6,7} 0.110(0.000,0.195) 0.110(0.000,0.212)	0.033
{1,3,4|2,5,6,7,8} 0.106(0.000,0.213) 0.106(0.000,0.234)	0.071
{1,2,3,4,6,8|5,7} 0.103(0.000,0.201) 0.103(0.000,0.220)	0.024

In [23]:

! head -n 45 nex_files5/ins/BUCKY.0.1.concordance

translate
 1 clemensia,
 2 tinus_D33,
 3 taiwanian,
 4 lantanoid,
 5 amplifica,
 6 lutescens,
 7 lentago_E,
 8 dentatum_;

Population Tree:
((1,((3,4),((5,6),7))),2,8);

Primary Concordance Tree Topology:
((1,((3,4),((5,6),7))),2,8);

Population Tree, With Branch Lengths In Estimated Coalescent Units:
((1:10.000,((3:10.000,4:10.000):0.067,((5:10.000,6:10.000):0.776,7:10.000):0.158):0.199):1.736,2:10.000,8:10.000);

Primary Concordance Tree with Sample Concordance Factors:
((1:1.000,((3:1.000,4:1.000):0.186,((5:1.000,6:1.000):0.619,7:1.000):0.181):0.228):0.844,2:1.000,8:1.000);

Four-way partitions in the Population Tree: sample-wide CF, coalescent units and Ties(if present)
{1; 3,4,5,6,7|2; 8}	0.883, 1.736,  
{1,2,3,4,8; 7|5; 6}	0.693, 0.776,  
{1; 2,8|3,4; 5,6,7}	0.454, 0.199,  
{1,2,8; 3,4|5,6; 7}	0.431, 0.158,  
{1,2,8; 5,6,7|3; 4}	0.377, 0.067,  

Splits in the Primary Concordance Tree: sample-wide and genome-wide mean CF (95% credibility), SD of mean sample-wide CF across runs
{1,3,4,5,6,7|2,8} 0.844(0.806,0.869) 0.844(0.801,0.877)	0.007
{1,2,3,4,7,8|5,6} 0.619(0.524,0.712) 0.619(0.518,0.719)	0.056
{1,2,8|3,4,5,6,7} 0.228(0.188,0.271) 0.228(0.182,0.278)	0.012
{1,2,5,6,7,8|3,4} 0.186(0.145,0.234) 0.186(0.140,0.240)	0.011
{1,2,3,4,8|5,6,7} 0.181(0.112,0.267) 0.181(0.108,0.272)	0.047

Splits NOT in the Primary Concordance Tree but with estimated CF > 0.050:
{1,2,3,5,6,8|4,7} 0.166(0.111,0.220) 0.166(0.107,0.225)	0.028
{1,3|2,4,5,6,7,8} 0.165(0.113,0.229) 0.165(0.108,0.234)	0.031
{1,2,4,7,8|3,5,6} 0.128(0.076,0.186) 0.128(0.074,0.190)	0.017
{1,4|2,3,5,6,7,8} 0.124(0.062,0.180) 0.124(0.059,0.185)	0.033
{1,3,4,5,6|2,7,8} 0.121(0.076,0.171) 0.121(0.072,0.176)	0.027
{1,2,3,8|4,5,6,7} 0.119(0.082,0.171) 0.119(0.078,0.174)	0.015
{1,2,4,5,6,8|3,7} 0.112(0.072,0.162) 0.112(0.070,0.166)	0.019
{1,3,4,7|2,5,6,8} 0.088(0.055,0.141) 0.088(0.052,0.144)	0.010

In [24]:

! head -n 45 nex_files6/ins/BUCKY.0.1.concordance

translate
 1 clemensia,
 2 tinus_D33,
 3 taiwanian,
 4 lantanoid,
 5 amplifica,
 6 lutescens,
 7 lentago_E,
 8 dentatum_;

Population Tree:
((1,((3,4),((5,6),7))),2,8);

Primary Concordance Tree Topology:
((1,((3,4),((5,6),7))),2,8);

Population Tree, With Branch Lengths In Estimated Coalescent Units:
((1:10.000,((3:10.000,4:10.000):0.060,((5:10.000,6:10.000):0.654,7:10.000):0.279):0.385):1.135,2:10.000,8:10.000);

Primary Concordance Tree with Sample Concordance Factors:
((1:1.000,((3:1.000,4:1.000):0.164,((5:1.000,6:1.000):0.564,7:1.000):0.327):0.288):0.706,2:1.000,8:1.000);

Four-way partitions in the Population Tree: sample-wide CF, coalescent units and Ties(if present)
{1,2,3,4,8; 7|5; 6}	0.653, 0.654,  
{1; 3,4,5,6,7|2; 8}	0.786, 1.135,  
{1,2,8; 3,4|5,6; 7}	0.496, 0.279,  
{1; 2,8|3,4; 5,6,7}	0.547, 0.385,  
{1,2,8; 5,6,7|3; 4}	0.372, 0.060,  

Splits in the Primary Concordance Tree: sample-wide and genome-wide mean CF (95% credibility), SD of mean sample-wide CF across runs
{1,3,4,5,6,7|2,8} 0.706(0.594,0.849) 0.705(0.553,0.867)	0.032
{1,2,3,4,7,8|5,6} 0.564(0.434,0.708) 0.564(0.403,0.729)	0.040
{1,2,3,4,8|5,6,7} 0.327(0.170,0.481) 0.327(0.151,0.504)	0.012
{1,2,8|3,4,5,6,7} 0.288(0.104,0.453) 0.288(0.092,0.483)	0.068
{1,2,5,6,7,8|3,4} 0.164(0.000,0.321) 0.164(0.000,0.350)	0.039

Splits NOT in the Primary Concordance Tree but with estimated CF > 0.050:
{1,4,5,6,7|2,3,8} 0.195(0.066,0.321) 0.195(0.054,0.353)	0.037
{1,2,3,7,8|4,5,6} 0.140(0.000,0.321) 0.139(0.000,0.343)	0.041
{1,2|3,4,5,6,7,8} 0.137(0.057,0.274) 0.137(0.038,0.285)	0.032
{1,2,4,8|3,5,6,7} 0.124(0.000,0.358) 0.124(0.000,0.367)	0.088
{1,2,3,4,5,8|6,7} 0.123(0.000,0.255) 0.123(0.000,0.273)	0.036
{1,2,3,4,6,8|5,7} 0.120(0.000,0.208) 0.120(0.000,0.239)	0.038
{1,2,3,5,6,8|4,7} 0.114(0.000,0.283) 0.114(0.000,0.298)	0.054
{1,4|2,3,5,6,7,8} 0.104(0.000,0.245) 0.104(0.000,0.269)	0.036

Get missing data percentatge for m2 data sets¶

For this I start raxml to get the info and then quit. Kind of lazy but simpler than calculating it myself.

In [ ]:

%%bash
## raxml argumement w/ ...
raxmlHPC-PTHREADS-AVX -f a -m GTRGAMMA -N 100 -x 12345 -p 12345 -T 20 \
                      -w /home/deren/Documents/RADmissing/empirical_1/fullrun \
                      -n empirical_1_full_m2 -s empirical_1/fullrun/outfiles/empirical_1_m2.phy

In [27]:

%%bash 
head -n 20 empirical_1/fullrun/RAxML_info.empirical_1_full_m2


This is RAxML version 8.0.16 released by Alexandros Stamatakis on March 21 2014.

With greatly appreciated code contributions by:
Andre Aberer      (HITS)
Simon Berger      (HITS)
Alexey Kozlov     (HITS)
Nick Pattengale   (Sandia)
Wayne Pfeiffer    (SDSC)
Akifumi S. Tanabe (NRIFS)
David Dao         (KIT)
Charlie Taylor    (UF)


Alignment has 502491 distinct alignment patterns

Proportion of gaps and completely undetermined characters in this alignment: 88.66%

RAxML rapid bootstrapping and subsequent ML search

Get average phylo dist (GTRgamma dist)¶

In [37]:

%%R
mean(cophenetic.phylo(ltre))

[1] 0.06114884

Notebook 1:¶

Download the sequence data¶

Create a set with reads concatenated from both technical replicates of each sample¶

Make a params file¶

Assemble the full data set¶

Results¶

Raw data amounts (1 sequence lane)¶

Raw data amounts (2 sequence lanes)¶

Look at distributions of coverage¶

get average number of loci per sample¶

get average number of samples with data for a locus¶

Infer an ML phylogeny¶

Plot the tree in R using ape¶

BUCKY -- write mrbayes nexus blocks for each locus¶

Modify line endings of loci string for easier parsing¶

Make nexus files¶

Summarize posteriors with mbsum¶

Run Bucky to infer concordance factors¶

Subtree 1 (Oreinodentinus) (full data set)¶

Subtree 1 (Oreinodentinus) (half data set)¶

Subtree 2 (Urceolata) (full data set)¶

Subtree 2 (Urceolata) (half data set)¶

Run mrbayes on all nex files¶

Run mbsum to summarize the results¶

Run Bucky¶

Cleanup¶

check out the results¶

FINAL BUCKY RESULTS (DEEP_SCALE)¶

Get missing data percentatge for m2 data sets¶

Get average phylo dist (GTRgamma dist)¶

Plot the tree in R using `ape`¶

Summarize posteriors with `mbsum`¶