Cookbook for running BUCKy in parallel in a Jupyter notebook

This notebook uses the Pedicularis example data set from the first empirical ipyrad tutorial. Here I show how to run BUCKy on a large set of loci parsed from the output file with the .alleles.loci ending. All code in this notebook is Python. You can simply follow along and execute this same code in a Jupyter notebook of your own.

Software requirements for this notebook

All required software can be installed through conda by running the commented out code below in a terminal.

In [1]:
## conda install -c BioBuilds mrbayes
## conda install -c ipyrad ipyrad
## conda install -c ipyrad bucky
In [2]:
## import Python libraries
import ipyrad.analysis as ipa
import ipyparallel as ipp

Cluster setup

To execute code in parallel we will use the ipyparallel Python library. A quick guide to starting a parallel cluster locally can be found here, and instructions for setting up a remote cluster on a HPC cluster is available here. In either case, this notebook assumes you have started an ipcluster instance that this notebook can find, which the cell below will test.

In [3]:
## look for running ipcluster instance, and create load-balancer
ipyclient = ipp.Client()
print "{} engines found".format(len(ipyclient))
4 engines found

Create a bucky analysis object

The two required arguments are the name and data arguments. The data argument should be a .loci file or a .alleles.loci file. The name will be used to name output files, which will be written to {workdir}/{name}/{number}.nexus. Bucky doesn't deal well with missing data, so loci will only be included if they contain data for all samples in the analysis. By default, all samples found in the loci file will be used, unless you enter a list of names (the samples argument) to subsample taxa, which we do here. It is best to select one individual per species or subspecies. You can set a number of additional parameters in the .params dictionary.

In [4]:
## make a list of sample names you wish to include in your BUCKy analysis 
samples = [
    "29154_superba", 
    "30686_cyathophylla", 
    "41478_cyathophylloides", 
    "33413_thamno", 
    "30556_thamno",
    "35236_rex",
    "40578_rex", 
    "38362_rex", 
    "33588_przewalskii",
]
In [5]:
## initiate a bucky object
b = ipa.bucky(
    name="test",
    data="analysis-ipyrad/pedic_outfiles/pedic.alleles.loci",
    workdir="analysis-bucky",
    samples=samples,
    minsnps=2,
    maxloci=100,
)
In [6]:
## print the params dictionary
b.params
Out[6]:
bucky_alpha           [0.1, 1.0, 10.0]    
bucky_nchains         4                   
bucky_niter           1000000             
bucky_nreps           4                   
maxloci               100                 
mb_mcmc_burnin        100000              
mb_mcmc_ngen          1000000             
mb_mcmc_sample_freq   1000                
minsnps               2                   
seed                  869236347           
In [7]:
## This will write nexus files to {workdir}/{name}/[number].nex
b.write_nexus_files(force=True)
wrote 100 nexus files to ~/Documents/ipyrad/tests/analysis-bucky/test

An example nexus file

In [8]:
## print an example nexus file
! cat analysis-bucky/test/1.nex
#NEXUS
begin data;
dimensions ntax=9 nchar=64;
format datatype=dna interleave=yes gap=- missing=N;
matrix
30686_cyathophylla      TCCTCGGCAGCCATTAGACCGGTGGAATATGCACCATGTACCGATCCTGGATAATCAAAACTCG
33413_thamno            TCCTCGGCAGCCATTAAACCGGTGGAGTATGCACCATGTACTGATCCTGGATAATCAAAACTTG
33588_przewalskii       TCCTCGGCAGCCATCAGACCGGTGGAGTGTGCACCATGCACCGATCCCGCATAATCAAAACTCG
29154_superba           TCCTCGGCAGCCATTAGACCGGTGGAGTATGCACCATGTACCGATCCTGGATAATCAAAACTCG
41478_cyathophylloides  TCCTCGGCAGCCATTAGACCAGTGGAGTATGCACCATGTACCGATCCTGGATAATCAAAACTCG
40578_rex               TCCTCGGCAGCCATTAAACCAGTGGAGTATGCACCATGTACCGATCCTGGATAATCAAAACTTG
30556_thamno            TCCTCGGCAGCCATTAAACCAGTGGAGTATGCACCATGTACCGATCCTGGATAATCAAAACTTG
38362_rex               TCCTCGGCAGCCATTAAACCGGTGGAGTATGCACCATGTACCGATCCTGGATAATCAAAACTTG
35236_rex               TCCTCGGCAGCCATTAAACCAGTGGAGTATGCACCATGTACCGATCCTGGATAATCAAAACTTG

    ;

begin mrbayes;
set autoclose=yes nowarn=yes;
lset nst=6 rates=gamma;
mcmc ngen=1000000 samplefreq=1000 printfreq=1000000;
sump burnin=100000;
sumt burnin=100000;
end;

Complete a BUCKy analysis

There are three parts to a BUCKy analysis which we have broken into three distinct functions. The first is .run_mrbayes(), then .run_mbsum(), followed by .run_bucky(). Each uses the files produced by the previous function in order. You can use the force flag to overwrite existing files. An ipyclient should be provided to distribute the jobs in parallel. The parallelization is especially important for the mrbayes analyses, where more cores will lead to approximately linear speed improvements.

In [9]:
## distributes mrbayes jobs across the parallel client
b.run_mrbayes(force=True, ipyclient=ipyclient)
[####################] 100% [mb] infer gene-tree posteriors | 0:49:01 |  
In [10]:
## this step is fast, simply summing the gene-tree posteriors
b.run_mbsum(ipyclient=ipyclient)
[####################] 100% [mbsum] sum replicate runs      | 0:00:07 |  
In [11]:
## infer concordance factors with BUCKy. This will run in parallel 
## for however many alpha values are in b.params.bucky_alpha list
b.run_bucky(ipyclient=ipyclient)
[####################] 100% [bucky] infer CF posteriors     | 1:35:16 |  

Convenient access to results

For now, parse results from the .concordance output files produced by BUCKy which will be located in your working directory (default is "./analysis-bucky"). Coming soon we will add some convenience functions for parsing results as tables and trees.