#!/usr/bin/env python # coding: utf-8 # # Species-tree & species-delimitation using *bpp* (BP&P) # The program *bpp* by Rannala & Yang (2010; 2015) is a powerful tool for inferring species tree parameters and testing species delimitation hypotheses. It is *relatively* easy to use, and best of all, it's *quite fast*, although not highly parallelizable. This notebook describes a streamlined approach to easily setup input files for testing different hypthotheses in *bpp*, and to do so in a clear programmatic way that makes it easy to perform many tests over many different parameter settings. We also show how to distribute many separate jobs to run in parallel on a cluster. # ## Notebook setup # This is a Jupyter notebook, a reproducible and executable document. The code in this notebook is Python (2.7), and should be executed in a jupyter-notebook like this one. Execute each cell in order to reproduce our entire analysis. The example data set used in this analysis is from the [empirical example ipyrad tutorial](http://ipyrad.readthedocs.io/pedicularis_.html). # ### Install required software # All software required for this notebook can be installed using conda. # In[1]: ## conda install -c ipyrad ipyrad ## conda install -c ipyrad bpp ## conda install -c eaton-lab toytree # In[2]: import ipyrad.analysis as ipa ## ipyrad analysis tools import ipyparallel as ipp ## parallelization import pandas as pd ## DataFrames import numpy as np ## data generation import toytree ## tree plotting import toyplot ## data plotting # ### Connect to an ipyparallel cluster # We will use the `ipyparallel` library to submit jobs to run in parallel on a cluster. You will need to have an `ipcluster` instance running in a separate terminal on your machine (or ideally, it is running on your HPC cluster). The code below simply connects to that cluster and prints how many CPUs are available for use. # In[4]: ## Connect to a running ipcluster instance ipyclient = ipp.Client() ## print information about our cluster print "Connected to {} cores".format(len(ipyclient)) # ## Analysis setup # # You must define a tree with the "species" names in your analysis which will act either as a fixed-tree or as a guide-tree for your analysis. You must also define an `IMAP` dictionary which maps sample names to "species" names. And you can optionally also define a `MINMAP` dictionary which is used to filter RAD loci to include only loci that have at least N samples with data from each species in a locus. # In[5]: ## set the location of our input .loci file locifile = "./analysis-ipyrad/pedicularis_outfiles/pedicularis.alleles.loci" ## a tree hypothesis (guidetree) (here based on tetrad results) newick = "((((((rex, lip),rck),tha),cup),(cys,(cya, sup))),prz);" ## a dictionary mapping sample names to 'species' names imap = { "prz": ["32082_przewalskii", "33588_przewalskii"], "cys": ["41478_cyathophylloides", "41954_cyathophylloides"], "cya": ["30686_cyathophylla"], "sup": ["29154_superba"], "cup": ["33413_thamno"], "tha": ["30556_thamno"], "rck": ["35236_rex"], "rex": ["35855_rex", "40578_rex"], "lip": ["39618_rex", "38362_rex"], } ## optional: loci will be filtered if they do not have data for at ## least N samples/individuals in each species. minmap = { "prz": 2, "cys": 2, "cya": 1, "sup": 1, "cup": 1, "tha": 1, "rck": 1, "rex": 2, "lip": 2, } # In[8]: ## check your (starting) tree hypothesis toytree.tree(newick).draw(use_edge_lengths=False, width=250); # ## The *bpp* Class object # # To simplify the creation of input files for *bpp* analyses we've created a bpp job generator object that can be accessed from `ipa.bpp()`. Running *bpp* requires three input files (.ctl, .imap, and .seq) of which the .ctl file is the most important since it contains the parameters for a run and points to the location of the other two files. The `ipa.bpp()` object can be used to easily modify parameter settings for a run, to generate the input files, and if desired, to submit the bpp jobs to run on a cluster (your ipyclient cluster). # In[9]: ## create a bpp object to run algorithm 00 b = ipa.bpp( name="test", data=locifile, guidetree=newick, imap=imap, minmap=minmap, ) # In[10]: ## set some optional params, leaving others at their defaults b.params.burnin = 2000 b.params.nsample = 20000 b.params.sampfreq = 2 b.params.seed = 33333 b.params.tauprior = (2, 200, 1) ## print params b.params # In[11]: ## set some optional filters leaving others at their defaults b.filters.maxloci=100 b.filters.minsnps=0 ## print filters b.filters # ### Generating files &/or submitting jobs # When you create a *bpp* object you save it with a variable name (in this example named `test`) which is simply the name you will use to reference the object. You must also provide a *name* (prefix for output files) that will be used by either of the two main functions of the bpp object: **write_bpp_files()** or **run()**. Both functions make it easy to sample different distributions of loci to include in different replicate bpp analyses. Each replicate will start from a different random seed after the initial `seed`. If you used a `maxloci` argument to limit the number of loci that will used in the analysis then you can also use the `randomize_order` argument to select a different random number of N loci in each rep, otherwise just the first N loci that pass filtering will be sampled. # #### `write_bpp_files()` # This writes the .ctl, .seq, and .imap files for the specified run. # In[12]: ## write files (filtered based on test.filters and test.params) b.write_bpp_files() # #### `run()` # This writes the files for each job and submits the bpp jobs to run on the cluster designated by the *ipyclient* object. You can efficiently submit many replicate jobs in this way. # In[13]: ## or, write files and submit job to run on ipyclient b.run( nreps=2, ipyclient=ipyclient, randomize_order=True, ) # #### Accessing job results # When you submit jobs the results files will be stored in the bpp objects `.files` attribute. In addition, the asychronous result objects (a representation of the running job) from each submitted job will be accessible from the `.asyncs` attribute of the bpp object. You can view these objects to see if your job has finished or use them to trace errors if an error arises, as we do below. # In[15]: ## check async objects from the bpp object for job in b.asyncs: if job.ready(): print 'job finished' else: print 'job running' # In[29]: ## uncomment and run this to block until all running jobs are finished #ipyclient.wait() # #### Results # In[34]: ## set options to print prettier tables ## (this suppresses scientific notation) pd.set_option('display.float_format', lambda x: '%.4f' % x) # In[35]: ## uncomment to print a subset of the table above b.summarize_results() # In[33]: ## uncomment to print a subset of the table above #b.summarize_results()[["mean", "std", "min", "max"]] # -------------- # ## Examples # -------------- # ## Algorithm 00 - fixed tree parameter inference # # The 00 algorithm means `'infer_sptree=0'` and `'infer_delimit=0'`, thus the tree that you enter will be treated as the fixed species tree and the analysis will infer parameters for the tree under the multispecies coalescent model. This will yield values of $\Theta$ for each branch of the tree, and divergence times ($\tau$) for each split in the tree. # In[16]: ## create two new copies of the bpp object above, but with new names A00 = b.copy(name="A00") A00n = b.copy(name="A00-nodata") ## set params on the 'nodata' object to not use seq data A00n.params.usedata = 0 # In[34]: ## submit a few replicate jobs from different random subsets of loci A00.run(nreps=3, randomize_order=True, ipyclient=ipyclient, force=True) A00n.run(nreps=1, randomize_order=True, ipyclient=ipyclient, force=True) # In[35]: ## wait for jobs to finish ipyclient.wait() # #### Summarize results tables for algorithm 00 # Different bpp algorithms produce different types of results files. For algorithm 00 the mcmc results file is simply a table of $\Theta$ and $\tau$ values so we can just parse it as a CSV file to summarize results. The same results will be available in the `.out.txt` file, but I find that parsing the results this way is a bit easier to view. Take note that these results tables are showing percentiles which is similar but not identical to posterior densities. You can get those from the `.out` file produced by bpp, or using the `individual_results=True` argument to `.summarize_results()` as shown below. # #### The 'no-data' results # We expect that when we use no data the priors will be returned. In this case we see no significant differences in theta values among samples, which is good. The tau values will be distributed according to the prior on the root divergence, with the other nodes distributed according to a dirichlet process. It will be of interest to compare these divergence time estimates to the ones produced by the analysis *with sequence data*, to assess whether the sequence data causing the divergence times estimates to diverge significantly from their priors. # In[17]: ## summary for the first 8 params of the no-data run A00n.summarize_results().head(8) # #### Results with data # You can see here that the theta and tau values are quite different from the priors, which is good and tells us that our sequence data is informative. If you check the `.out` file produced by bpp you can also see "effective-sample-size (ESS)" which is indicative of whether or not we've run the mcmc long enough. Ideally we would like each ESS value to be >100. # In[18]: ## summary for the first 8 params of the with-data run A00.summarize_results().head(8) # In[20]: ## get individual results for each rep reps = A00.summarize_results(individual_results=True) ## check convergence (ESS*) for each param in rep 1 reps[1][["mean", "S.D.", "ESS*"]] # ### Draw some histograms # Below we draw the posterior parameter distributions for the analyses with and without sequence data. In a real analysis you will probably want to run the analyses for longer, but you can see here that the results depart significantly from the prior expectation. # In[21]: ## get full MCMC table from the .files attribute table_1 = pd.read_csv(A00n.files.mcmcfiles[0], sep='\t', index_col=0) table_2 = pd.read_csv(A00.files.mcmcfiles[0], sep='\t', index_col=0) ## draw barplots canvas = toyplot.Canvas(width=800, height=800) for aidx in range(table_1.shape[1]): ## get data param = table_1.columns[aidx] hist1 = np.histogram(table_1[param], density=True) hist2 = np.histogram(table_2[param], density=True) ## build plot nx = int(np.ceil(np.sqrt(table_1.shape[1]))) axes = canvas.cartesian(grid=(nx, nx, aidx), margin=25) axes.bars(hist1, opacity=0.65, style={"stroke-width": 0}) axes.bars(hist2, opacity=0.65, style={"stroke-width": 0}) ## style axes (hide tick labels for clarity) axes.label.text = param[:15] axes.label.style["font-size"] = "12px" axes.x.ticks.labels.show = False axes.y.ticks.labels.show = False ## save plot to pdf import toyplot.pdf toyplot.pdf.render(canvas, "analysis-bpp/bpp-A00.pdf") canvas # #### The FigTree file # Only algorithm 00 produces a FigTree file, so be aware that you will not find one when you run the other algorithms. There are other types of results files that are produced instead. # In[40]: ## the tree file -- can be input to the program FigTree tre = A00n.files.treefiles[0] get_ipython().system(' cat $tre') # ---------------------------------------- # # ## Algorithm 10 - species tree inference # # The algorithm 10 aims to infer the correct species tree from the data by implemented a tree search method, thus the input tree is treated only as a starting tree. Based on the results above I'm also going to increase the priors on the divergence times (tau) by about 10X. # In[55]: ## create new bpp objects A10 = A00.copy("A10") A10n = A00.copy("A10-nodata") ## set new params A10.params.infer_sptree = 1 A10.params.infer_delimit = 0 A10.params.tauprior = (2, 200, 1) A10n.params.infer_sptree = 1 A10n.params.infer_delimit = 0 A10n.params.tauprior = (2, 200, 1) ## also set no-data on the 'n' run A10n.params.usedata = 0 # In[26]: ## submit job reps to the cluster A10.run(nreps=2, randomize_order=True, ipyclient=ipyclient, force=True) A10n.run(nreps=1, randomize_order=True, ipyclient=ipyclient, force=True) # In[27]: ## block until finished ipyclient.wait() # #### Plot the distribution of species trees from algorithm 10 # Here we use the `multitree()` object of toytree to plot the posterior distribution of trees. The code for drawing 'cloudtrees' here is still in development and a littl clunky, but will improve in the future. From the example here (which has not been run long enough, btw), you can see clearly that the two replicates yield quite different trees. In this case the replicates are starting from different random seeds, but probably more importantly they are sampling a different random subset of 100 RAD loci that met our filtering requirements (e.g., minsamples, minSNPs). # In[52]: ## load trees slicing out every Nth: (start:end:by) trees1 = toytree.multitree( newick=A10.files.mcmcfiles[0], treeslice=(500, 5000, 30), ) ## build canvas canvas = toyplot.Canvas(width=800, height=400) axes1 = canvas.cartesian(grid=(1, 2, 0), padding=25) axes2 = canvas.cartesian(grid=(1, 2, 1), padding=25) ## draw cloud tree trees1.draw_cloudtree( axes=axes1, use_edge_lengths=True, orient='right', edge_style={"opacity": 0.025}, ); ## draw consensus (doesn't save edge lengths currently) cons = trees1.get_consensus_tree() cons.root(wildcard="prz") cons.draw( axes=axes2, node_labels=cons.get_node_values("support"), ); ## style axes axes1.y.show = False axes2.y.show = False # In[54]: ## load trees slicing out every N: [100:10000:100] trees2 = toytree.multitree( newick=A10.files.mcmcfiles[1], treeslice=(500, 5000, 30), ) ## build canvas canvas = toyplot.Canvas(width=800, height=400) axes1 = canvas.cartesian(grid=(1, 2, 0), padding=25) axes2 = canvas.cartesian(grid=(1, 2, 1), padding=25) ## draw cloud tree trees2.draw_cloudtree( axes=axes1, use_edge_lengths=True, orient='right', edge_style={"opacity": 0.025}, ); ## draw consensus (doesn't save edge lengths currently) cons = trees2.get_consensus_tree() cons.root(wildcard="prz") cons.draw( axes=axes2, node_labels=cons.get_node_values("support"), ); ## style axes and save axes1.y.show = False axes2.y.show = False toyplot.html.render(canvas, "analysis-bpp/tree.html") canvas # -------- # ### Algorithm 01 -- Species delimitation # -------- # In[56]: ## create new bpp objects A01 = A00.copy("A01") A01n = A00.copy("A01-nodata") ## set new params A01.params.infer_sptree = 0 A01.params.infer_delimit = 1 A01.params.tauprior = (2, 200, 1) A01n.params.infer_sptree = 0 A01n.params.infer_delimit = 1 A01n.params.tauprior = (2, 200, 1) ## also set no-data on the 'n' run A01n.params.usedata = 0 # In[57]: ## submit job reps to the cluster A01.run(nreps=2, randomize_order=True, ipyclient=ipyclient, force=True) A01n.run(nreps=1, randomize_order=True, ipyclient=ipyclient, force=True) # In[58]: ## block until jobs finish ipyclient.wait() # #### Summarize results # In[64]: ## obviously this was not run long enough b/c when there is no data ## the posterior should match the prior !!! A01n.summarize_results() # In[65]: ## results from two replicate runs A01.summarize_results(individual_results=True) # In[66]: ## or, averaged results across independent replicates A01.summarize_results(individual_results=False) # ## Other Examples: Setting up tests of multiple prior settings # Rannala and Yang suggest that you try out several species delimitation algorithms and that you do so over a range of params for the two algorithms. They suggest that you run algorithm 0 with $\epsilon$=(2, 5, 10, 20), and algorithm 1 with $\alpha$=(1, 1.5, 2) and $m$=(1, 1.5, 2). And also to do this with different starting trees. Using our programmatic approach you can easily set up all of these tests and run them in parallel using a simple for-loop setup. # In[297]: ## create a bpp object with algorithm 01 settings A01 = A10.copy(name="A01") A01.params.infer_sptree = 0 A01.params.infer_delimit = 1 # In[298]: ## submit jobs using a range of param options DELIMIT_TESTS = [ (0, 2), (0, 5), (0, 10), (1, 1.0, 1.0), (1, 1.0, 1.5), (1, 1.0, 2.0), (1, 1.5, 1.0), (1, 1.5, 1.5), (1, 1.5, 2.0), (1, 2.0, 1.0), (1, 2.0, 1.5), (1, 2.0, 2.0) ] ## a dictionary to store our bpp objects bpps = {} ## iterate over delimit args for test in DELIMIT_TESTS: ## make a copy name = "A10-{}".format("-".join([str(i) for i in test])) bpps[name] = A10.copy(name) ## set the delimit algorithm bpps[name].params.delimit_alg = test ## submit the job bpps[name].run(nreps=1, seed=123, ipyclient=ipyclient)