(📗) ipyrad Cookbook: abba-baba admixture tests

The ipyrad.analysis Python module includes functions to calculate abba-baba admixture statistics (including several variants of these measures), to perform signifance tests, and to produce plots of results. All code in this notebook is written in Python, which you can copy/paste into an IPython terminal to execute, or, preferably, run in a Jupyter notebook like this one. See the other analysis cookbooks for instructions on using Jupyter notebooks. All of the software required for this tutorial is included with ipyrad (v.6.12+). Finally, we've written functions to generate plots for summarizing and interpreting results.

Load packages

In [1]:
import ipyrad.analysis as ipa
import ipyparallel as ipp
import toytree

Connect to cluster

The code can be easily parallelized across cores on your machine, or many nodes of an HPC cluster using the ipyparallel library (see our ipyparallel tutorial). An ipcluster instance must be started for you to connect to, which can be started by running 'ipcluster start' in a terminal.

In [5]:
ipyclient = ipp.Client()
len(ipyclient)
Out[5]:
4

Load in your .loci data file and a tree hypothesis

In [26]:
locifile = "./analysis-ipyrad/pedic_outfiles/pedic.loci"
newick = "./analysis-raxml/RAxML_bestTree.pedic"
In [28]:
## parse the newick tree, re-root it, and plot it.
tre = toytree.tree(newick=newick)
tre.root(wildcard="prz")
tre.draw(node_labels=True, node_size=10);

## store rooted tree back into a newick string.
newick = tre.tree.write()
32082_przewalskii33588_przewalskii41478_cyathophylloides41954_cyathophylloides29154_superba30686_cyathophylla33413_thamno30556_thamno40578_rex35855_rex35236_rex39618_rex38362_rexidx: 0 name: 0 dist: 0.0 support: 100idx: 1 name: 1 dist: 0.0358742 support: 100idx: 2 name: 2 dist: 0.0358742 support: 100idx: 3 name: 3 dist: 0.00297626 support: 100idx: 4 name: 4 dist: 0.00941021 support: 100idx: 5 name: 5 dist: 0.00237995 support: 100idx: 6 name: 6 dist: 0.00538723 support: 100idx: 7 name: 7 dist: 0.0010338 support: 100idx: 8 name: 8 dist: 0.000783365 support: 100idx: 9 name: 9 dist: 0.00223 support: 100idx: 10 name: 10 dist: 0.0007389 support: 100idx: 11 name: 11 dist: 0.00617527 support: 100idx: 12 name: 32082_przewalskii dist: 0.00259326 support: 100idx: 13 name: 33588_przewalskii dist: 0.00247134 support: 100idx: 14 name: 41478_cyathophylloides dist: 5.28218e-05 support: 100idx: 15 name: 41954_cyathophylloides dist: 8.88803e-05 support: 100idx: 16 name: 29154_superba dist: 0.00634237 support: 100idx: 17 name: 30686_cyathophylla dist: 0.00669945 support: 100idx: 18 name: 33413_thamno dist: 0.00565358 support: 100idx: 19 name: 30556_thamno dist: 0.00653218 support: 100idx: 20 name: 40578_rex dist: 0.00335406 support: 100idx: 21 name: 35855_rex dist: 0.00339963 support: 100idx: 22 name: 35236_rex dist: 0.00580525 support: 100idx: 23 name: 39618_rex dist: 0.000962081 support: 100idx: 24 name: 38362_rex dist: 0.00109218 support: 100

Short tutorial: calculating abba-baba statistics

To give a gist of what this code can do, here is a quick tutorial version, each step of which we explain in greater detail below. We first create a 'baba' analysis object that is linked to our data file, in this example we name the variable bb. Then we tell it which tests to perform, here by automatically generating a number of tests using the generate_tests_from_tree() function. And finally, we calculate the results and plot them.

In [29]:
## create a baba object linked to a data file and newick tree
bb = ipa.baba(data=locifile, newick=newick)

## generate all possible abba-baba tests meeting a set of constraints
bb.generate_tests_from_tree(
    constraint_dict={
        "p4": ["32082_przewalskii", "33588_przewalskii"],
        "p3": ["33413_thamno"],
    })

## run all tests linked to bb 
bb.run(ipyclient)
44 tests generated from tree
[####################] 100%  calculating D-stats  | 0:02:35 |  

Look at the results

In [31]:
## save the results table to a csv file
bb.results_table.to_csv("bb.abba-baba.csv", sep="\t")

## show the results table in notebook
bb.results_table
Out[31]:
dstat bootmean bootstd Z ABBA BABA nloci
0 -0.099 -0.097 0.043 2.311 258.844 315.438 6718
1 -0.114 -0.115 0.039 2.886 303.938 381.906 8221
2 -0.089 -0.090 0.042 2.131 258.750 309.500 6694
3 -0.101 -0.102 0.040 2.544 302.188 370.188 8173
4 -0.098 -0.097 0.042 2.331 238.688 290.750 6347
5 -0.122 -0.123 0.040 3.079 279.375 356.812 7725
6 0.068 0.068 0.036 1.907 408.969 356.719 8899
7 0.080 0.082 0.037 2.178 390.312 332.312 8390
8 0.089 0.091 0.044 2.022 309.500 258.750 6694
9 0.098 0.098 0.045 2.173 290.750 238.688 6347
10 0.101 0.100 0.039 2.610 370.188 302.188 8173
11 0.122 0.122 0.039 3.112 356.812 279.375 7725
12 0.129 0.128 0.031 4.129 474.500 366.031 9390
13 0.142 0.144 0.035 4.073 457.250 343.625 9151
14 0.119 0.118 0.033 3.554 434.500 342.062 9048
15 0.218 0.217 0.028 7.666 560.750 359.958 9455
16 0.278 0.277 0.032 8.650 567.500 320.750 9113
17 0.178 0.177 0.033 5.322 501.938 350.500 9211
18 0.160 0.159 0.036 4.387 403.750 292.438 7909
19 0.176 0.177 0.035 5.043 491.625 344.312 9105
20 0.164 0.163 0.033 4.987 477.984 343.266 9526
21 0.057 0.057 0.033 1.737 438.625 391.469 9671
22 0.034 0.034 0.036 0.939 357.750 334.094 8268
23 0.050 0.051 0.033 1.528 418.812 379.156 9533
24 0.173 0.173 0.034 5.070 436.375 307.625 9282
25 0.044 0.043 0.035 1.263 401.969 368.031 9394
26 0.014 0.015 0.038 0.373 324.750 315.688 8051
27 0.039 0.040 0.035 1.114 389.625 360.062 9272
28 0.188 0.189 0.036 5.257 460.688 314.875 9172
29 0.080 0.079 0.036 2.252 420.906 358.469 9292
30 0.074 0.074 0.037 2.003 349.438 301.250 7964
31 0.067 0.066 0.036 1.866 395.875 346.438 9171
32 -0.086 -0.087 0.030 2.923 373.188 443.708 9652
33 -0.118 -0.118 0.032 3.738 367.552 465.927 9537
34 -0.173 -0.174 0.034 5.129 307.625 436.375 9282
35 -0.188 -0.190 0.035 5.326 314.875 460.688 9172
36 -0.044 -0.045 0.035 1.263 368.031 401.969 9394
37 -0.080 -0.082 0.037 2.182 358.469 420.906 9292
38 -0.014 -0.014 0.037 0.382 315.688 324.750 8051
39 -0.074 -0.077 0.040 1.854 301.250 349.438 7964
40 -0.039 -0.039 0.035 1.114 360.062 389.625 9272
41 -0.067 -0.067 0.036 1.858 346.438 395.875 9171
42 -0.130 -0.131 0.040 3.215 296.938 385.750 8044
43 -0.123 -0.124 0.036 3.434 338.938 434.375 9235

Plotting and interpreting results

Interpreting the results of D-statistic tests is actually very complicated. You cannot treat every test as if it were independent because introgression between one pair of species may cause one or both of those species to appear as if they have also introgressed with other taxa in your data set. This problem is described in great detail in this paper (Eaton et al. 2015). A good place to start, then, is to perform many tests and focus on those which have the strongest signal of admixture. Then, perform additional tests, such as partitioned D-statistics (described further below) to tease apart whether a single or multiple introgression events are likely to have occurred.

In the example plot below we find evidence of admixture between the sample 33413_thamno (black) with several other samples, but the signal is strongest with respect to 30556_thamno (test 33). It also appears that admixture is consistently detected with samples of (40578_rex & 35855_rex) when contrasted against 35236_rex (tests 14, 20, 29, 32).

In [32]:
## plot the results, showing here some plotting options.
bb.plot(height=900,
        width=600,
        pct_tree_y=0.1, 
        ewidth=2, 
        alpha=4.,
        style_test_labels={"font-size":"10px"},
        );
123456789101112131415161718192021222324252627282930313233343536373839404142434432082_przewalskii33588_przewalskii41478_cyathophylloides41954_cyathophylloides29154_superba30686_cyathophylla33413_thamno30556_thamno40578_rex35855_rex35236_rex39618_rex38362_rex10.00.0-0.40.4Z-scoresD-statistics

Full Tutorial

Creating a baba object

The fundamental object for running abba-baba tests is the ipa.baba() object. This stores all of the information about the data, tests, and results of your analysis, and is used to generate plots. If you only have one data file that you want to run many tests on then you will only need to enter the path to your data once. The data file must be a '.loci' file from an ipyrad analysis. In general, you will probably want to use the largest data file possible for these tests (min_samples_locus=4), to maximize the amount of data available for any test. Once an initial baba object is created you create different copies of that object that will inherit its parameter setttings, and which you can use to perform different tests on, like below.

In [11]:
## create an initial object linked to your data in 'locifile'
aa = ipa.baba(data=locifile)

## create two other copies
bb = aa.copy()
cc = aa.copy()

## print these objects
print aa
print bb
print cc
<ipyrad.analysis.baba.Baba object at 0x7fccb85601d0>
<ipyrad.analysis.baba.Baba object at 0x7fccb8560250>
<ipyrad.analysis.baba.Baba object at 0x7fccb8560410>

Linking tests to the baba object

The next thing we need to do is to link a 'test' to each of these objects, or a list of tests. In the Short tutorial above we auto-generated a list of tests from an input tree, but to be more explicit about how things work we will write out each test by hand here. A test is described by a Python dictionary that tells it which samples (individuals) should represent the 'p1', 'p2', 'p3', and 'p4' taxa in the ABBA-BABA test. You can see in the example below that we set two samples to represent the outgroup taxon (p4). This means that the SNP frequency for those two samples combined will represent the p4 taxon. For the baba object named 'cc' below we enter two tests using a list to show how multiple tests can be linked to a single baba object.

In [12]:
aa.tests = {
    "p4": ["32082_przewalskii", "33588_przewalskii"],
    "p3": ["29154_superba"], 
    "p2": ["33413_thamno"], 
    "p1": ["40578_rex"],
}

bb.tests = {
    "p4": ["32082_przewalskii", "33588_przewalskii"],
    "p3": ["30686_cyathophylla"], 
    "p2": ["33413_thamno"], 
    "p1": ["40578_rex"],
}

cc.tests = [
    {
     "p4": ["32082_przewalskii", "33588_przewalskii"],
     "p3": ["41954_cyathophylloides"], 
     "p2": ["33413_thamno"], 
     "p1": ["40578_rex"],
    },
    {
     "p4": ["32082_przewalskii", "33588_przewalskii"],
     "p3": ["41478_cyathophylloides"], 
     "p2": ["33413_thamno"], 
     "p1": ["40578_rex"],
    },
]

Other parameters

Each baba object has a set of parameters associated with it that are used to filter the loci that will be used in the test and to set some other optional settings. If the 'mincov' parameter is set to 1 (the default) then loci in the data set will only be used in a test if there is at least one sample from every tip of the tree that has data for that locus. For example, in the tests above where we entered two samples to represent "p4" only one of those two samples needs to be present for the locus to be included in our analysis. If you want to require that both samples have data at the locus in order for it to be included in the analysis then you could set mincov=2. However, for the test above setting mincov=2 would filter out all of the data, since it is impossible to have a coverage of 2 for 'p3', 'p2', and 'p1', since they each have only one sample. Therefore, you can also enter the mincov parameter as a dictionary setting a different minimum for each tip taxon, which we demonstrate below for the baba object 'bb'.

In [13]:
## print params for object aa
aa.params
Out[13]:
database   None                
mincov     1                   
nboots     1000                
quiet      False               
In [14]:
## set the mincov value as a dictionary for object bb
bb.params.mincov = {"p4":2, "p3":1, "p2":1, "p1":1}
bb.params
Out[14]:
database   None                
mincov     {'p2': 1, 'p3': 1, 'p1': 1, 'p4': 2}
nboots     1000                
quiet      False               

Running the tests

When you execute the 'run()' command all of the tests for the object will be distributed to run in parallel on your cluster (or the cores available on your machine) as connected to your ipyclient object. The results of the tests will be stored in your baba object under the attributes 'results_table' and 'results_boots'.

In [15]:
## run tests for each of our objects
aa.run(ipyclient)
bb.run(ipyclient)
cc.run(ipyclient)
[####################] 100%  calculating D-stats  | 0:00:06 |  
[####################] 100%  calculating D-stats  | 0:00:05 |  
[####################] 100%  calculating D-stats  | 0:00:07 |  

The results table

The results of the tests are stored as a data frame (pandas.DataFrame) in results_table, which can be easily accessed and manipulated. The tests are listed in order and can be reference by their 'index' (the number in the left-most column). For example, below we see the results for object 'cc' tests 0 and 1. You can see which taxa were used in each test by accessing them as 'cc.tests[0]' or 'cc.tests[1]'. An even better way to see which individuals were involved in each test, however, is to use our plotting functions, which we describe further below.

In [17]:
## you can sort the results by Z-score
cc.results_table.sort_values(by="Z", ascending=False)

## save the table to a file 
cc.results_table.to_csv("cc.abba-baba.csv")

## show the results in notebook
cc.results_table
Out[17]:
dstat bootmean bootstd Z ABBA BABA nloci
0 -0.022 -0.019 0.048 0.466 225.875 236.188 8439
1 0.020 0.019 0.045 0.432 245.000 235.625 8993

Auto-generating tests

Entering all of the tests by hand can be pain, which is why we wrote functions to auto-generate tests given an input rooted tree, and a number of contraints on the tests to generate from that tree. It is important to add constraints on the tests otherwise the number that can be produced becomes very large very quickly. Calculating results runs pretty fast, but summarizing and interpreting thousands of results is pretty much impossible, so it is generally better to limit the tests to those which make some intuitive sense to run. You can see in this example that implementing a few contraints reduces the number of tests from 1608 to 13.

In [18]:
## create a new 'copy' of your baba object and attach a treefile
dd = bb.copy()
dd.newick = newick

## generate all possible tests
dd.generate_tests_from_tree()

## a dict of constraints
constraint_dict={
        "p4": ["32082_przewalskii", "33588_przewalskii"],
        "p3": ["40578_rex", "35855_rex"],
    }

## generate tests with contraints
dd.generate_tests_from_tree(
    constraint_dict=constraint_dict, 
    constraint_exact=False,
)

## 'exact' contrainst are even more constrained
dd.generate_tests_from_tree(
    constraint_dict=constraint_dict,
    constraint_exact=True,
)
1608 tests generated from tree
117 tests generated from tree
13 tests generated from tree
In [19]:
## run the dd tests
dd.run(ipyclient)
dd.plot(height=500, pct_tree_y=0.2, alpha=4., tree_style='c');
dd.results_table
[####################] 100%  calculating D-stats  | 0:00:12 |  
Out[19]:
dstat bootmean bootstd Z ABBA BABA nloci
0 -0.107 -0.107 0.035 3.095 348.016 431.828 9298
1 -0.130 -0.128 0.040 3.290 295.312 383.953 8537
2 -0.111 -0.113 0.041 2.725 266.414 333.086 6975
3 -0.138 -0.138 0.035 3.970 313.633 413.930 8715
4 -0.157 -0.156 0.041 3.865 268.891 369.172 8010
5 -0.141 -0.141 0.041 3.459 239.750 318.250 6597
6 -0.105 -0.104 0.034 3.045 343.203 423.312 9275
7 -0.129 -0.126 0.040 3.257 291.266 377.641 8517
8 -0.109 -0.111 0.041 2.694 262.266 326.672 6961
9 -0.021 -0.021 0.052 0.397 171.125 178.406 6468
10 0.141 0.142 0.055 2.569 39.234 29.516 9218
11 0.014 0.016 0.030 0.462 564.281 548.625 9456
12 -0.036 -0.038 0.059 0.607 104.875 112.656 8480
1234567891011121333588_przewalskii32082_przewalskii30686_cyathophylla29154_superba41954_cyathophylloides41478_cyathophylloides33413_thamno35236_rex30556_thamno35855_rex40578_rex38362_rex39618_rex5.00.0-0.30.3Z-scoresD-statistics

More about input file paths (i/o)

The default (required) input data file is the .loci file produced by ipyrad. When performing D-statistic calculations this file will be parsed to retain the maximal amount of information useful for each test.

An additional (optional) file to provide is a newick tree file. While you do not need a tree in order to run ABBA-BABA tests, you do need at least need a hypothesis for how your samples are related in order to setup meaningful tests. By loading in a tree for your data set we can use it to easily set up hypotheses to test, and to plot results on the tree.

In [20]:
## path to a locifile created by ipyrad
locifile = "./analysis-ipyrad/pedicularis_outfiles/pedicularis.loci"

## path to an unrooted tree inferred with tetrad
newick = "./analysis-tetrad/tutorial.tree"

Interpreting results

You can see in the results_table below that the D-statistic ranges between 0.2 and 0.4 in these tests. These values are not too terribly informative, and so we instead generally focus on the Z-score representing how far the distribution of D-statistic values across bootstrap replicates deviates from its expected value of zero. The default number of bootstrap replicates to perform per test is 1000. Each replicate resamples nloci with replacement.

In these tests ABBA and BABA occurred with pretty equal frequency. The values are calculated using SNP frequencies, which is why they are floats instead of integers, and this is also why we were able to combine multiple samples to represent a single tip in the tree (e.g., see the test we setup, above).

In [17]:
print cc.results_table
   dstat  bootmean  bootstd      Z     ABBA     BABA  nloci
0 -0.022    -0.018    0.046  0.393  225.875  236.188   8439
1  0.020     0.020    0.044  0.466  245.000  235.625   8993

Running 5-taxon (partitioned) D-statistics

Coming soon.