## Interactive visualization of RADseq data¶

RADseq data has a reputation for containing a lot of missing data, particularly among more distantly related samples, which has led to concern over how different distributions of missing data might influence downstream analyses. But there hasn't yet been a lot exploration of how missing data are distributed in empirical (or simulated) RADseq data sets (though I've been working on this), and for people working with RADseq data sets, there aren't many tools available for assessing this type of information.

At the most basic level, it would be useful to have some code available that anyone could use to parse their assembled data files, calculate data sharing, and visualize it. While there are many possible ways to visaulize data sharing, and these extend far beyond calculating pairwise data sharing (e.g., heirarchical measures would be more appropriate for phylogenetic data, but more on that another in another post) the simplest way to visualize shared data is with a heatmap showing the amount of data shared by any two samples in the data set. I will focus on this type of visualization here.

I'm also using this post to document some code using the new Python plotting library Toyplot, which I'm really starting to love. In addition to the normal output formats (e.g., PDF, PNG, SVG), the default format in toyplot is html, which allows it to create scalable, interactive plots that are also easily embeddable. The default aesthetic is also really quite pleasing, as is the minimalist ethos of the code. What's not to like.

#### Import Python libraries¶

In [1]:
import numpy as np   ## array operations (v.1.9.2)
import ete2          ## tree manipulations (v.2.2.1072)
import toyplot       ## plotting (v.0.7.0)
import urllib2       ## grabbing data from the web
import itertools     ## iterate over big files


#### Let's grab some data¶

We'll grab data from one of my recent publications that are available online. This includes a newick tree file showing the relationships of 34 oak trees, and a .loci file containing a RADseq data set assembled in pyrad that was used to infer the phylogeny for these samples.

In [2]:
## retrieve tree file from github
treedata = urllib2.urlopen("https://raw.githubusercontent.com/"+\
"dereneaton/virentes/v.1.0/analysis_raxml/"+\

## convert newick string to Tree object
tree = ete2.Tree(treedata)

In [3]:
## retrieve loci data from zenodo
locidata = urllib2.urlopen("https://zenodo.org/record/19475/files/"+\
"virentes_c85d6m20p5.loci")

## parse .loci file into a list
## NB: the "locidata" object here is equivalent to "open(locifile)"
##     if your data was in a local .loci file


#### Take a look at the tree¶

In [4]:
#print tree


#### A look at the .loci data¶

In [5]:
print loci[0]

>BJSL25      TTTATTTGTTGCAAACTTTTAATGTTTTCTTTCTGTAGGGTGTTTGGTCAGTGATTGTTTGAGAGAAACTGCAAAATTTGGTTGCATTTG
>BJVL19      TTTATTTGTTGCAAACTTTTAATGTTTTCTTTCTGTAGGGTGTTTGGTCAGTGATTGTTTGAGAGAAACTGCAAAATTTGGTTGCATTTG
>BZBB1       TTTATTTGTTGCAACCTTTTGATGTTTTCTTTCTGTAGGGTGTTTGGTCAGTGATTGTTTGAGAGAAACTGCAAAATTTGGTTGCATTTG
>CH          TTTAGTTGTTGCAAACTTTTAATGTTTTCTTTTTGTAGGGTGTTTGGTCAGTGATTGTTTRAGAGAAACTGCAAAATTTGGTTGCATTTG
>CRL0001     TTTATTTGTTGCAACCTTTTGATGTTTTCTTTCTGTAGGGTGTTTGGTCAGTGATTGTTTGAGAGAAACTGCAAAATTTGGTTGCATTTG
>CRL0030     TTTATTTGTTGCAACCTTTTGATGTTTTCTTTCTGTAGGGTGTTTGGTCAGTGATTGTTTGAGAGAAACTGCAAAATTTGGTTGCATTTG
>CUCA4       TTTATTTGTTGCAAMCTTTTRATGTTTTCTTTCTGTAGGGTGTTTGGTCAGTGATTGTTTGAGAGAAACTGCAAAATTTGGTTGCATTTG
>CUSV6       TTTATTTGTTGCAAMCTTTTRATGTTTTCTTTCTGTAGGGTGTTTGGTCAGTGATTGTTTGAGAGAAACTGCAAAATTTGGTTGCATTTG
>CUVN10      TTTATTTGTTGCAAACTTTTAATGTTTTCTTTCTGTAGGGTGTTTGGTCAGTGATTGTTTGAGAGAAACTGCAAAATTTGGTTGCATTTG
>DO          TTTATTTGTTGCAGACTTTTAATGTTTTCTTTTGGTAGGGTGTTTGGTCAGTGATTGTTTGAGAGAAACTGCAAAATTTGGTTGCATTTG
>FLAB109     TTTATTTGTTGCAAACTTTTAATGTTTTCTTTCTGTAGGGTGTTTGGTCAGTGATTGTTTGAGAGAAACTGCAAAATTTGGTTGCATTTG
>FLBA140     TTTATTTGTTNCAAACTTTTAATGTTTTCTTTCTGTAGGGTGTTTGGTCAGTGATTGTTTGAGAGAAACTGCAAAATTTGGTTGCATTTG
>FLCK18      TTTATTTGTTGCAAACTTTTAATGTTTTCTTTCTGTAGGGTGTTTGGTCAGTGATTGTTTGAGAGAAACTGCAAAATTTGGTTGCATTTG
>FLCK216     TTTATTTGTTGCAAACTTTT-ATGTTTTCTTT-TGTAGGGTGTTTGGTCAGTGATTGTTTGAGAGAAACTGCAAAATTTGGTTGCATTTG
>FLSA185     TTTATTTGTTGCAAACTTTTAATGTTTTCTTTCTGTAGGGTGTTTGGTCAGTGATTGTTTGAGAGAAACTGCAAAATTTGGTTGCATTTG
>FLSF33      TTTATTTGTTGCAAACTTTTAATGTTTTCTTTCTGTAGGGTGTTTGGTCAGTGATTGTTTGAGAGAAACTNCAAAATTTGGTTGCATTTG
>FLSF47      TTTATTTGTTGCAAACTTTTAATGTTTTCTTTCTGTAGGGTGTTTGGTCAGTGATTGTTTGAGAGAAACTGCAAAATTTGGTTGCATTTG
>FLSF54      TTTATTTGTTGCAAACTTTTAATGTTTTCTTTCTGTAGGGTGTTTGGTCAGTGATTGTTTGAGAGAAACTGCAAAATTTGGTTGCATTTG
>FLWO6       TTTATTTGTTGCAAACTTTTAATGTTTTCTTTCTGTAGGGTGTTTGGTCAGTGATTGTTTGAGAGAAACTGCAAAATTTGGTTGCATTTG
>HE          TTTAGTTGTTGCAAACTTTTAATGTYTTSTTTTTGTAGGGTGTTTGGTCAGWGATTGTTTGAGAGAAACTGCAAAATTTGGTTGCATTTG
>HNDA09      TTTATTTGTTGCAACCTTTTGATGTTTTCTTTCTGTAGGGTGTTTGGTCAGTGATTGTTTGAGAGAAACTGCAAAATTTGGTTGCATTTG
>MXED8       TTTATTTGTTGCAAACTTTTAATGTTTTCTTTCTGTAGGGTGTTTGGTCAGTGATTGTTTGAGAGAAACTGCAAAATTTGGTTGCATTTG
>MXGT4       TTTATTTGTTGCAAACTTTTAATGTTTTCTTTYTGTAGGGTGTTTGGTCAGTGATTGTTTGAGAGAAACTGCAAAATTTGGTTGCATTTG
>MXSA3017    TTTATTTGTTGCAACCTTTTGATGTTTTCTTTCTGTAGGGTGTTTGGTCAGTGATTGTTTGAGAGAAACTGCAAAATTTGGTTGCATTTG
>NI          TTTAGTTGTTGCAAACTTTTAATGTYTTSTTTTTGTAGGGTGTTTGGTCAGWGATTGTTTGAGAGAAACTGCAAAATTTGGTTGCATTTG
//               *        -*     *    *  *   *-                 *        -


### Functions to calculate shared data¶

The functions getarray and countmatrix will parse the .loci file and count the number of shared loci between any two samples in the data set. Note, the .loci format has changed slightly in pyrad v.3, so this code will need to be modified for parsing.

In [6]:
def getarray(loci, tree):
""" parse the loci list and return a
presence/absence matrix ordered by
the tips on the tree"""
## get tip names
names = tree.get_leaf_names()
## make empty matrix
lxs = np.zeros((len(names), len(loci)))
## fill the matrix
for loc in xrange(len(loci)):
for seq in loci[loc].split("\n"):
if ">" in seq:
lxs[names.index(seq.split()[0][1:]),loc] += 1
return lxs

In [7]:
def countmatrix(lxsabove, lxsbelow, max=0):
""" fill a matrix with pairwise data sharing
between each pair of samples. You could put
in two different 'share' matrices to have
different results above and below the diagonal.
Can enter a max value to limit fill along diagonal.
"""
share = np.zeros((lxsabove.shape[0],
lxsbelow.shape[0]))
## fill above
names = range(lxsabove.shape[0])
for row in lxsabove:
for samp1,samp2 in itertools.combinations(names,2):
shared = lxsabove[samp1, lxsabove[samp2,]>0].sum()
share[samp1,samp2] = shared
## fill below
for row in lxsbelow:
for samp2,samp1 in itertools.combinations(names,2):
shared = lxsabove[samp1, lxsabove[samp2,]>0].sum()
share[samp1,samp2] = shared
## fill diagonal
if not max:
for row in range(len(names)):
share[row,row] = lxsabove[row,].sum()
else:
for row in range(len(names)):
share[row,row] = max
return share


The data set here contains 27,369 loci in which at least four samples have data, out of a maximum of 34 samples. The matrix below has 34 rows and 27,369 columns, where '1' indicates a taxon has data, and '0' means it does not at that locus.

In [8]:
lxs = getarray(loci, tree)
print lxs.shape
print lxs

(34, 27369)
[[ 1.  1.  1. ...,  1.  1.  1.]
[ 1.  1.  0. ...,  1.  1.  1.]
[ 1.  1.  1. ...,  1.  1.  1.]
...,
[ 1.  1.  1. ...,  1.  1.  1.]
[ 1.  1.  1. ...,  1.  1.  1.]
[ 1.  0.  0. ...,  0.  0.  1.]]


The share matrix is 34x34, and is a symmetric matrix where the diagonal is the number of loci in the data set for any given sample, and the off-diagonal contains the number of loci that two samples share in the data set.

In [9]:
share = countmatrix(lxs, lxs)
print share.shape
print share

(34, 34)
[[ 20438.  19069.  18556. ...,  16980.  16509.  14003.]
[ 19069.  20151.  18308. ...,  16741.  16242.  13810.]
[ 18556.  18308.  22833. ...,  18991.  18539.  15694.]
...,
[ 16980.  16741.  18991. ...,  22794.  18611.  15863.]
[ 16509.  16242.  18539. ...,  18611.  22182.  15253.]
[ 14003.  13810.  15694. ...,  15863.  15253.  18902.]]


### Plot total data for each sample¶

Hold the cursor over the plot to see taxon names

In [10]:
print share.min(), share.max()

5216.0 24390.0

In [11]:
## create canvas and set dimensions
canvas = toyplot.Canvas(width=600, height=300)
## use default axes
axes = canvas.axes(label='Total loci per sample')

## Get ordered names of tips on the tree
names = tree.get_leaf_names()

## set interactive names to pop-up
floater = ["Taxon: %s" % i for i in names]

## plot bars of total data for each sample
mark = axes.bars(share.diagonal(),
baseline=None,
title=floater)

## turn off coordinates
axes.coordinates.show = False


### Plot heatmap of pairwise data sharing¶

In [12]:
## choose a colormap (or use the default, but I like this one)
colormap = toyplot.color.LinearMap(toyplot.color.brewer("Spectral"),
domain_min=share.min(),
domain_max=share.max())

## create heatmap
canvas, axtable = toyplot.matrix(share,
colormap=colormap,
step=3,
width=500, height=500)

## create interactive floating labels
for i,j in itertools.product(range(len(share)), repeat=2):
axtable.body.cell(i,j).title='%s, %s : %s' % (names[i],
names[j],
'10')

## TODO:
## - turn off axis labels on matrix
## - get real interactive values in table instead of '10'
## - plot barplot above on grid canvas with matrix
## - have barplot turned on its side


## Make combined plot (with interactive data)¶

In [13]:
colormap = toyplot.color.LinearMap(toyplot.color.brewer("Spectral"),
domain_min=share.min(),
domain_max=share.max())

canvas = toyplot.Canvas(width=600, height=400)

table = canvas.matrix(share,
colormap=colormap,
label="",
bounds=(50, 350, 50, 350),
step=5)

## make floater for grid
for i,j in itertools.product(range(len(share)), repeat=2):
table.body.cell(i,j).title='%s, %s : %s' % (names[i],
names[j],
int(share[i,j]))

## put box around grid
table.body.grid.vlines[...,[0,-1]] = 'single'
table.body.grid.hlines[[0,-1],...] = 'single'

## remove top and left labels
for j in range(share.shape[1]):
table.top.cell(0, j).data = ""

for i in range(share.shape[0]):
table.left.cell(i, 0).data = ""

## canvas for barplot
axes = canvas.axes(bounds=(400, 500, 60, 360),
label="",
xlabel="",
ylabel="")

## create barplot
axes.bars(share.diagonal()[::-1],
along="y",
title = floater)

## make floater for barplot
zf = zip(names[::-1], share.diagonal()[::-1])
barfloater = ["%s: %s" % (i,int(j)) for i,j in zf]

## Hide yspine, move labels to the left,
## use taxon names, rotate angle, align.
axes.y.spine.show = False
axes.y.ticks.labels.offset = -5
axes.y.ticks.locator = toyplot.locator.Explicit(range(34),
labels=names[::-1])
axes.y.ticks.labels.angle = 0
axes.y.ticks.labels.style = {"baseline-shift":0,
"text-anchor":"end",
"font-size":"9px"}

## Rotate xlabels, align with ticks,
## change to thousands, move up on canvas,
## show ticks, and hide popup coordinates
axes.x.ticks.labels.angle = 90
axes.x.ticks.labels.offset = 20
axes.x.ticks.locator = toyplot.locator.Explicit(
[0,5000,10000,15000,20000,25000],
["0", "5K", "10K", "15K", "20K", "25K"])
axes.x.ticks.labels.style = {"baseline-shift":0,
"text-anchor":"end",
"-toyplot-anchor-shift":"15px"}
axes.x.ticks.show = True
axes.coordinates.show = False


### Good news, tree plotting is on the way¶

Although I've explored quite a bit, I haven't yet found a tree/network plotting library in Python that I really like. Ete2 has some really great functionality, but as far as I can tell you can't combine ete2 trees with other plotted data very easily. According to the Toyplot developers they are working on incorporating igraph functionality to allow tree/network plotting. I'll definitely be trying this out. Interactive and scalable tree plots would be a huge improvement for phylogenetics.