#!/usr/bin/env python # coding: utf-8 # # Eaton & Ree (2013) single-end RAD data set # Here we demonstrate a denovo assembly for an empirical RAD data set using the ipyrad Python API. This example was run on a workstation with 20 cores and takes about 10 minutes to assemble, but you should be able to run it on a 4-core laptop in ~30-60 minutes. # # For our example data we will use the 13 taxa *Pedicularis* data set from [Eaton and Ree (2013) (Open Access)](https://academic.oup.com/sysbio/article/62/5/689/1684460/Inferring-Phylogeny-and-Introgression-using-RADseq). This data set is composed of single-end 75bp reads from a RAD-seq library prepared with the PstI enzyme. The data set also serves as an example for several of our [analysis cookbooks](http://ipyrad.readthedocs.io/analysis.html) that demonstrate methods for analyzing RAD-seq data files. At the end of this notebook there are also several examples of how to use the ipyrad analysis tools to run downstream analyses in parallel. # # The figure below shows the ingroup taxa from this study and their sampling locations. The study includes all species within a small monophyletic clade of *Pedicularis*, including multiple individuals from 5 species and several subspecies, as well as an outgroup species. The sampling essentially spans from population-level variation where species boundaries are unclear, to higher-level divergence where species boundaries are quite distinct. This is a common scale at which RAD-seq data are often very useful. # # ## Setup (software and data files) # If you haven't done so yet, start by installing `ipyrad` using conda (see ipyrad installation instructions) as well as the packages in the cell below. This is easiest to do in a terminal. Then open a jupyter-notebook, like this one, and follow along with the tuturial by copying and executing the code in the cells, and adding your own documentation between them using markdown. Feel free to modify parameters to see their effects on the downstream results. # In[1]: ## conda install ipyrad -c ipyrad ## conda install toytree -c eaton-lab ## conda install sra-tools -c bioconda ## conda install entrez-direct -c bioconda # In[2]: ## imports import ipyrad as ip import ipyrad.analysis as ipa import ipyparallel as ipp # In contrast to the ipyrad CLI, the ipyrad API gives users much more fine-scale control over the parallelization of their analysis, but this also requires learning a little bit about the library that we use to do this, called `ipyparallel`. This library is designed for use with jupyter-notebooks to allow massive-scale multi-processing while working *interactively*. # # Understanding the nuts and bolts of it might take a little while, but it is fairly easy to get started using it, especially in the way it is integrated with ipyrad. To start a parallel client to you must run the command-line program '`ipcluster`'. This will essentially start a number of independent Python processes (kernels) which we can then send bits of work to do. The cluster can be stopped and restarted independently of this notebook, which is convenient for working on a cluster where connecting to many cores is not always immediately available. # # Open a terminal and type the following command to start an ipcluster instance with N engines. # In[3]: ## ipcluster start --n=20 # In[4]: ## connect to cluster ipyclient = ipp.Client() ipyclient.ids # ### Download the data set (Pedicularis) # These data are archived on the NCBI sequence read archive (SRA) under accession id SRP021469. As part of the ipyrad analysis tools we have a wrapper around the SRAtools software that can be used to query NCBI and download sequence data based on accession IDs. Run the code below to download the fastq data files associated with this study. The data will be saved the specified directory which will be created if it does not already exist. The compressed file size of the data is a little over 1GB. If you pass your ipyclient to the `.run()` command below then the download will be parallelized. # In[5]: ## download the Pedicularis data set from NCBI sra = ipa.sratools(accession="SRP021469", workdir="fastqs-Ped") sra.run(force=True, ipyclient=ipyclient) # ## Create an Assembly object # This object stores the parameters of the assembly and the organization of data files. # In[6]: ## you must provide a name for the Assembly data = ip.Assembly("pedicularis") # Set parameters for the Assembly. This will raise an error if any of the parameters are not allowed because they are the wrong type, or out of the allowed range. # In[10]: ## set parameters data.set_params("project_dir", "analysis-ipyrad") data.set_params("sorted_fastq_path", "fastqs-Ped/*.fastq.gz") data.set_params("clust_threshold", "0.90") data.set_params("filter_adapters", "2") data.set_params("max_Hs_consens", (5, 5)) data.set_params("trim_loci", (0, 5, 0, 0)) data.set_params("output_formats", "psvnkua") ## see/print all parameters data.get_params() # ## Assemble the data set # # In[24]: ## run steps 1 & 2 of the assembly data.run("12") # In[25]: ## access the stats of the assembly (so far) from the .stats attribute data.stats # In[26]: ## run steps 3-6 of the assembly data.run("3456") # ### Branch to create several final data sets with different parameter settings # # # In[11]: ## create a branch for outputs with min_samples = 4 (lots of missing data) min4 = data.branch("min4") min4.set_params("min_samples_locus", 4) min4.run("7") ## create a branch for outputs with min_samples = 13 (no missing data) min13 = data.branch("min13") min13.set_params("min_samples_locus", 13) min13.run("7") ## create a branch with no missing data for ingroups, but allow ## missing data in the outgroups by setting population assignments. ## The population min-sample values overrule the min-samples-locus param pops = data.branch("min11-pops") pops.populations = { "ingroup": (11, [i for i in pops.samples if "prz" not in i]), "outgroup" : (0, [i for i in pops.samples if "prz" in i]), } pops.run("7") ## create a branch with no missing data and with outgroups removed nouts = data.branch("nouts_min11", subsamples=[i for i in pops.samples if "prz" not in i]) nouts.set_params("min_samples_locus", 11) nouts.run("7") # ### View final stats # The .stats attribute shows a stats summary for each sample, and a number of stats dataframes can be accessed for each step from the .stats_dfs attribute of the Assembly. # In[9]: ## we can access the stats summary as a pandas dataframes. min4.stats # In[25]: ## or print the full stats file cat $min4.stats_files.s7 # In[13]: ## and we can access parts of the full stats outputs as dataframes min4.stats_dfs.s7_samples # In[14]: ## compare this to the one above, coverage is more equal min13.stats_dfs.s7_samples # In[15]: ## similarly, coverage is equal here among ingroups, but allows missing in outgroups pops.stats_dfs.s7_samples # ## Analysis tools # We have a lot more information about analysis tools in the ipyrad documentation. But here I'll show just a quick example of how you can easily access the data files for these assemblies and use them in downstream analysis software. The ipyrad analysis tools include convenient wrappers to make it easier to parallelize analyses of RAD-seq data. Please see the full documentation for the ipyrad.analysis tools in the ipyrad documentation for more details. # In[6]: import ipyrad as ip import ipyrad.analysis as ipa # In[7]: ## you can re-load assemblies at a later time from their JSON file min4 = ip.load_json("analysis-ipyrad/min4.json") min13 = ip.load_json("analysis-ipyrad/min13.json") nouts = ip.load_json("analysis-ipyrad/nouts_min11.json") # #### *RAxML* -- ML concatenation tree inference # In[17]: ## conda install raxml -c bioconda ## conda install toytree -c eaton-lab # In[12]: ## create a raxml analysis object for the min13 data sets rax = ipa.raxml( name=min13.name, data=min13.outfiles.phy, workdir="analysis-raxml", T=20, N=100, o=[i for i in min13.samples if "prz" in i], ) # In[19]: ## print the raxml command and call it print rax.command rax.run(force=True) # In[13]: ## access the resulting tree files rax.trees # In[15]: ## plot a tree in the notebook with toytree import toytree tre = toytree.tree(rax.trees.bipartitions) tre.draw( width=350, height=400, node_labels=tre.get_node_values("support"), ); # ### *tetrad* -- quartet tree inference # In[16]: ## create a tetrad analysis object tet = ipa.tetrad( name=min4.name, seqfile=min4.outfiles.snpsphy, mapfile=min4.outfiles.snpsmap, nboots=100, ) # In[18]: ## run tree inference tet.run(ipyclient) # In[19]: ## access tree files tet.trees # In[22]: ## plot results (just like above, but unrooted by default) ## the consensus tree here differs from the ML tree above. import toytree qtre = toytree.tree(tet.trees.nhx) qtre.root(wildcard="prz") qtre.draw( width=350, height=400, node_labels=qtre.get_node_values("support"), ); # #### *STRUCTURE* -- population cluster inference # In[ ]: ## conda install structure clumpp -c ipyrad # In[25]: ## create a structure analysis object for the no-outgroup data set struct = ipa.structure( name=nouts.name, data=nouts.outfiles.str, mapfile=nouts.outfiles.snpsmap, ) ## set params for analysis (should be longer in real analyses) struct.mainparams.burnin=1000 struct.mainparams.numreps=8000 # In[96]: ## run structure across 10 random replicates of sampled unlinked SNPs for kpop in [2, 3, 4, 5, 6]: struct.run(kpop=kpop, nreps=10, ipyclient=ipyclient) # In[115]: ## wait for all of these jobs to finish ipyclient.wait() # In[26]: ## collect results tables = {} for kpop in [2, 3, 4, 5, 6]: tables[kpop] = struct.get_clumpp_table(kpop) # In[27]: ## custom sorting order myorder = [ "41478_cyathophylloides", "41954_cyathophylloides", "29154_superba", "30686_cyathophylla", "33413_thamno", "30556_thamno", "35236_rex", "40578_rex", "35855_rex", "39618_rex", "38362_rex", ] # In[33]: ## import toyplot (packaged with toytree) import toyplot ## plot bars for each K-value (mean of 10 reps) for kpop in [2, 3, 4, 5, 6]: table = tables[kpop] table = table.ix[myorder] ## plot barplot w/ hover canvas, axes, mark = toyplot.bars( table, title=[[i] for i in table.index.tolist()], width=400, height=200, yshow=False, style={"stroke": toyplot.color.near_black}, ) # #### *TREEMIX* -- ML tree & admixture co-inference # In[15]: ## conda install treemix -c ipyrad # In[39]: ## group taxa into 'populations' imap = { "prz": ["32082_przewalskii", "33588_przewalskii"], "cys": ["41478_cyathophylloides", "41954_cyathophylloides"], "cya": ["30686_cyathophylla"], "sup": ["29154_superba"], "cup": ["33413_thamno"], "tha": ["30556_thamno"], "rck": ["35236_rex"], "rex": ["35855_rex", "40578_rex"], "lip": ["39618_rex", "38362_rex"], } ## optional: loci will be filtered if they do not have data for at ## least N samples in each species. Minimums cannot be <1. minmap = { "prz": 2, "cys": 2, "cya": 1, "sup": 1, "cup": 1, "tha": 1, "rck": 1, "rex": 2, "lip": 2, } ## sets a random number seed import numpy numpy.random.seed(12349876) ## create a treemix analysis object tmix = ipa.treemix( name=min13.name, data=min13.outfiles.snpsphy, mapfile=min13.outfiles.snpsmap, imap=imap, minmap=minmap, ) ## you can set additional parameter args here tmix.params.root = "prz" tmix.params.global_ = 1 ## print the full params tmix.params # In[40]: ## a dictionary for storing treemix objects tdict = {} ## iterate over values of m for rep in xrange(4): for mig in xrange(4): ## create new treemix object copy name = "mig-{}-rep-{}".format(mig, rep) tmp = tmix.copy(name) ## set params on new object tmp.params.m = mig ## run treemix analysis tmp.run() ## store the treemix object tdict[name] = tmp # In[43]: import toyplot ## select a single result tmp = tdict["mig-1-rep-1"] ## draw the tree similar to the Treemix plotting R code ## this code is rather new and will be expanded in the future. canvas = toyplot.Canvas(width=350, height=350) axes = canvas.cartesian(padding=25, margin=75) axes = tmp.draw(axes) # In[44]: import toyplot import numpy as np ## plot many results canvas = toyplot.Canvas(width=800, height=1200) idx = 0 for mig in range(4): for rep in range(4): tmp = tdict["mig-{}-rep-{}".format(mig, rep)] ax = canvas.cartesian(grid=(4, 4, idx), padding=25, margin=(25, 50, 100, 25)) ax = tmp.draw(ax) idx += 1 # #### ABBA-BABA admixture inference # In[47]: bb = ipa.baba( data=min4.outfiles.loci, newick="analysis-raxml/RAxML_bestTree.min13", ) # In[55]: ## check params bb.params # In[53]: ## generate all tests from the tree where 32082 is p4 bb.generate_tests_from_tree( constraint_dict={ "p4": ["32082_przewalskii"], "p3": ["30556_thamno"], } ) # In[54]: ## run the tests in parallel bb.run(ipyclient=ipyclient) # In[59]: bb.results_table.sort_values(by="Z", ascending=False).head() # In[63]: ## most significant result (more ABBA than BABA) bb.tests[12] # In[64]: ## the next most signif (more BABA than ABBA) bb.tests[27] # ### BPP -- species tree inference/delim # In[68]: ## a dictionary mapping sample names to 'species' names imap = { "prz": ["32082_przewalskii", "33588_przewalskii"], "cys": ["41478_cyathophylloides", "41954_cyathophylloides"], "cya": ["30686_cyathophylla"], "sup": ["29154_superba"], "cup": ["33413_thamno"], "tha": ["30556_thamno"], "rck": ["35236_rex"], "rex": ["35855_rex", "40578_rex"], "lip": ["39618_rex", "38362_rex"], } ## optional: loci will be filtered if they do not have data for at ## least N samples/individuals in each species. minmap = { "prz": 2, "cys": 2, "cya": 1, "sup": 1, "cup": 1, "tha": 1, "rck": 1, "rex": 2, "lip": 2, } ## a tree hypothesis (guidetree) (here based on tetrad results) ## for the 'species' we've collapsed samples into. newick = "((((((rex, lip), rck), tha), cup), (cys, (cya, sup))), prz);" # In[69]: ## initiata a bpp object b = ipa.bpp( name=min4.name, locifile=min4.outfiles.alleles, imap=imap, minmap=minmap, guidetree=newick, ) # In[70]: ## set some optional params, leaving others at their defaults ## you should definitely run these longer for real analyses b.params.burnin = 1000 b.params.nsample = 2000 b.params.sampfreq = 20 ## print params b.params # In[71]: ## set some optional filters leaving others at their defaults b.filters.maxloci=100 b.filters.minsnps=4 ## print filters b.filters # #### run BPP # You can either call '`write_bpp_files()`' to write input files for this data set to be run in BPP, and then call BPP on those files, or you can use the '`.run()`' command to run the data files directly, and in parallel on the cluster. If you specify multiple reps then a different random sample of loci will be selected, and different random seeds applied to each replicate. # In[75]: b.write_bpp_files() # In[76]: b.run() # In[77]: ## wait for all ipyclient jobs to finish ipyclient.wait() # In[90]: ## check results ## parse the mcmc table with pandas library import pandas as pd btable = pd.read_csv(b.files.mcmcfiles[0], sep="\t", index_col=0) btable.describe().T # ## Success -- you're finished # Now that you have an idea of the myriad ways you can assembly your data, and some of the downstream analysis tools you are ready to explore RAD-seq data. #