All of the modules below are included in the ipyrad conda installation.
## import libraries
import ipyrad as ip ## ipyrad
import numpy as np ## array operations
import h5py ## access hdf5 database file
import toyplot ## my fav new plotting library
import toyplot.html ## toypot sublib for saving html plots
## print versions for posterity
print 'ipyrad', ip.__version__
print 'numpy', np.__version__
print 'h5py', h5py.__version__
print 'toyplot', toyplot.__version__
ipyrad 0.4.1 numpy 1.11.1 h5py 2.6.0 toyplot 0.13.0
I assembled the data set under three minimum cluster depth settings (6, 10, 20), and import their Assembly objects as data1, data2, and data3. These are ipyrad Assembly class objects which have many features and functions available to them. To see these type the object name (e.g., data1) followed by a period (.) and then press tab to see list of all the available options.
## import Assembly objects
data = ip.load_json("/home/deren/Downloads/pedicularis/pedictrim5.json")
loading Assembly: pedictrim5 from saved path: ~/Downloads/pedicularis/pedictrim5.json
We also open the database file for each data set. This is the file with the suffix ".hdf5" that should have a file name like: "[project_dir]/[name]_consens/[name].hdf5". The assembly class objects save the database file path under the attribute ".database", which can be used to access it more easily. Below we open a view to the hdf5 database file for each of the three assemblies.
## load the hdf5 database
io5 = h5py.File(data.database, 'r')
The hdf5 data base is compressed and sometimes quite large. If you moved your JSON file from a remote machine (e.g., HPC cluster) to a local machine you will have to update the data.database
path to the location of the database file on your local machine.
print 'location of my database file:\n ', data.database
print 'keys in the hdf5 database\n ', io5.keys()
location of my database file: /home/deren/Downloads/pedicularis/pedictrim5_outfiles/pedictrim5.hdf5 keys in the hdf5 database [u'edges', u'filters', u'snps']
The hdf5 data base contains the following five arrays with the following dimensions.
## This doesn't actually load them into memory, they can be very large.
## It just makes a reference for calling keys more easily
#hcatg = io5["catgs"] ## depth information (not edge filtered)
#hseqs = io5["seqs"] ## sequence data (not edge filtered)
hsnps = io5["snps"] ## snp locations (edge filtered)
hfilt = io5["filters"] ## locus filters
hedge = io5["edges"] ## edge filters
## arrays shapes and dtypes
#print hcatg
#print hseqs
print hsnps
print hfilt
print hedge
<HDF5 dataset "snps": shape (87955, 124, 2), type "|b1"> <HDF5 dataset "filters": shape (87955, 6), type "|b1"> <HDF5 dataset "edges": shape (87955, 5), type "<u2">
def filter_snps(data):
## get h5 database
io5 = h5py.File(data.database, 'r')
hsnps = io5["snps"] ## snp locations
hfilt = io5["filters"] ## locus filters
hedge = io5["edges"] ## edge filters
## read in a local copy of the full snps and edge arrays
snps = hsnps[:]
edge = hedge[:]
## print status
print "prefilter {}\nshape {} = (nloci, maxlen, [var,pis])"\
.format(data.name, snps.shape)
print "total vars = {}".format(snps[:,:,0].sum())
## apply edge filters to all loci in the snps array
for loc in xrange(snps.shape[0]):
a, b = edge[loc, :2]
mask = np.invert([i in range(a, b) for i in np.arange(snps.shape[1])])
snps[loc, mask, :] = 0
## get locus filter by summing across all filters
locfilter = hfilt[:].sum(axis=1).astype(np.bool)
## apply filter to snps array
fsnps = snps[~locfilter, ...]
## print new shape and sum
print "postfilter {}\nshape {} = (nloci, maxlen, [var,pis])"\
.format(data.name, fsnps.shape)
print "total vars = {}".format(fsnps[:,:,0].sum())
## clean up big objects
del snps
del edge
## return what we want
return fsnps
def filter_snps(data):
## get h5 database
io5 = h5py.File(data.database, 'r')
hsnps = io5["snps"] ## snp locations
hfilt = io5["filters"] ## locus filters
#hedge = io5["edges"] ## edge filters
## read in a local copy of the full snps and edge arrays
snps = hsnps[:]
#edge = hedge[:]
## get locus filter by summing across all filters
locfilter = hfilt[:].sum(axis=1).astype(np.bool)
## apply filter to snps array
fsnps = snps[~locfilter, ...]
## clean up big objects
del snps
## return what we want
return fsnps
fsnps.sum(axis=0)
array([[ 0, 0], [ 0, 0], [ 0, 0], [ 0, 0], [ 0, 0], [1258, 940], [1379, 1061], [1456, 1008], [1474, 999], [1551, 1055], [1596, 1089], [1532, 1103], [1530, 1180], [1594, 1116], [1532, 1128], [1620, 1148], [1565, 1205], [1642, 1216], [1676, 1231], [1667, 1123], [1625, 1115], [1620, 1217], [1614, 1185], [1594, 1247], [1548, 1215], [1641, 1230], [1616, 1145], [1676, 1202], [1643, 1227], [1636, 1167], [1641, 1237], [1661, 1304], [1640, 1165], [1696, 1272], [1680, 1173], [1683, 1191], [1703, 1214], [1619, 1253], [1692, 1236], [1704, 1269], [1773, 1164], [1641, 1236], [1710, 1308], [1646, 1255], [1671, 1232], [1682, 1279], [1755, 1212], [1670, 1285], [1715, 1279], [1721, 1229], [1808, 1251], [1757, 1294], [1725, 1259], [1731, 1227], [1765, 1279], [1764, 1266], [1784, 1283], [1821, 1396], [1864, 1340], [1796, 1397], [1921, 1362], [1902, 1363], [1936, 1432], [1990, 1511], [2060, 1471], [2111, 1514], [2048, 1644], [2244, 1688], [2310, 1818], [2480, 1951], [2554, 2053], [2634, 2105], [2776, 2237], [2971, 2473], [ 747, 552], [ 482, 329], [ 319, 236], [ 205, 131], [ 106, 96], [ 61, 54], [ 32, 23], [ 9, 7], [ 2, 1], [ 1, 1], [ 1, 0], [ 1, 0], [ 0, 1], [ 0, 0], [ 0, 0], [ 0, 0], [ 0, 0], [ 0, 0], [ 0, 0], [ 0, 0], [ 0, 0], [ 0, 0], [ 0, 0], [ 0, 0], [ 0, 0], [ 0, 0], [ 0, 0], [ 0, 0], [ 0, 0], [ 0, 0], [ 0, 0], [ 0, 0], [ 0, 0], [ 0, 0], [ 0, 0], [ 0, 0], [ 0, 0], [ 0, 0], [ 0, 0], [ 0, 0], [ 0, 0], [ 0, 0], [ 0, 0], [ 0, 0], [ 0, 0], [ 0, 0], [ 0, 0], [ 0, 0], [ 0, 0], [ 0, 0]])
## apply filter to each data set
fsnps = filter_snps(data)
a = np.arange(0, 40)#.reshape(10,4)
b = np.arange(10, 50)#.reshape(10, 4)
#np.concatenate((a,b), axis=1)
np.array([a,b]).max(axis=0)
array([10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49])
## the last dimension has two columns (var, pis)
## how many snps per locus
varlocs = fsnps[:, :, 0].sum(axis=1)
pislocs = fsnps[:, :, 1].sum(axis=1)
print varlocs[:5]
## how many snps per site (across loci)
persite = fsnps[:, :, :].sum(axis=0)
print persite[10:15]
[3 2 7 5 1] [[1606 1095] [1539 1106] [1543 1186] [1604 1119] [1539 1132]]
colormap = toyplot.color.Palette()
colormap
## deconstruct array into bins
vbars, vbins = np.histogram(varlocs, bins=range(0, varlocs.max()+2))
pbars, pbins = np.histogram(pislocs, bins=range(0, varlocs.max()+2))
## setup canvas and axes
canvas = toyplot.Canvas(width=350, height=300)
axes = canvas.cartesian(xlabel="n variable (or pis) sites",
ylabel="n nloci w/ n var (or pis) sites",
gutter=50)
## set up x axis
axes.x.domain.max = 16
axes.x.spine.show = False
axes.x.ticks.labels.style = {"baseline-shift":"10px"}
axes.x.ticks.locator = toyplot.locator.Explicit(
range(0, 16, 2),
map(str, range(0, 16, 2)))
## set up y axis
axes.y.ticks.show=True
axes.y.label.style = {"baseline-shift":"35px"}
axes.y.ticks.labels.style = {"baseline-shift":"5px"}
axes.y.ticks.below = 0
axes.y.ticks.above = 5
axes.y.domain.min = 0
axes.y.domain.max = 10000
axes.y.ticks.locator = toyplot.locator.Explicit(
range(0, 11000, 2500),
map(str, range(0, 11000, 2500)))
## add bars
axes.bars(vbars, color=colormap[1], opacity=0.5)
axes.bars(pbars, color=colormap[0], opacity=0.5)
## or as a filled/smoothed plot
#x = np.arange(0, len(pbars))
#fill = axes.fill(x, vbars, color=colormap[0], opacity=0.5)
#fill = axes.fill(x, pbars, color=colormap[1], opacity=0.5)
<toyplot.mark.BarMagnitudes at 0x7f6f576fe250>
This data includes 75bp reads sequenced on an Illumina GAIIx. We know that the error rate increases along the length of reads, and that the error rate was a bit higher in this older type of data than it is in more recent sequencing technology.
## the snps array is longer than the actual seq length (it's a bit padded)
## and so we want to lop the extra on the end off. Let's get the max values w/ data.
maxend = np.where(fsnps[:, :, :].sum(axis=0).sum(axis=1) != 0)[0].max()
## all variables (including autapomorphies)
distvar = np.sum(fsnps[:, :maxend+1, 0].astype(np.int), axis=0)
print(distvar)
## synapomorphies (just pis)
distpis = fsnps[:, :maxend+1, 1].sum(axis=0)
print(distpis)
## how long is the longest seq
print 'maxlen = ', maxend
[ 0 0 0 0 0 1266 1384 1462 1485 1560 1606 1539 1543 1604 1539 1631 1575 1655 1685 1678 1633 1624 1626 1602 1559 1648 1630 1687 1656 1641 1651 1673 1646 1708 1694 1689 1712 1624 1696 1712 1778 1638 1719 1653 1675 1685 1757 1670 1724 1730 1809 1760 1732 1736 1764 1769 1794 1827 1865 1784 1910 1889 1912 1954 2032 2042 528 334 255 166 83 47 16 7 1 1 0] [ 0 0 0 0 0 941 1068 1008 1001 1060 1095 1106 1186 1119 1132 1151 1209 1221 1235 1128 1118 1225 1192 1252 1226 1241 1153 1206 1232 1172 1241 1306 1169 1277 1179 1192 1222 1262 1248 1277 1170 1243 1315 1265 1232 1279 1223 1289 1287 1232 1257 1297 1260 1232 1285 1269 1289 1397 1341 1407 1372 1358 1425 1492 1443 1485 433 251 179 87 66 38 21 8 2 0 1] maxlen = 76
def SNP_position_plot(distvar, distpis):
## set color theme
colormap = toyplot.color.Palette()
## make a canvas
canvas = toyplot.Canvas(width=800, height=300)
## make axes
axes = canvas.cartesian(xlabel="Position along RAD loci",
ylabel="N variables sites",
gutter=65)
## x-axis
axes.x.ticks.show = True
axes.x.label.style = {"baseline-shift":"-40px", "font-size":"16px"}
axes.x.ticks.labels.style = {"baseline-shift":"-2.5px", "font-size":"12px"}
axes.x.ticks.below = 5
axes.x.ticks.above = 0
axes.x.domain.max = maxend
axes.x.ticks.locator = toyplot.locator.Explicit(
range(0, maxend, 5),
map(str, range(0, maxend, 5)))
## y-axis
axes.y.ticks.show=True
axes.y.label.style = {"baseline-shift":"40px", "font-size":"16px"}
axes.y.ticks.labels.style = {"baseline-shift":"5px", "font-size":"12px"}
axes.y.ticks.below = 0
axes.y.ticks.above = 5
## add fill plots
x = np.arange(0, maxend+1)
f1 = axes.fill(x, distvar, color=colormap[0], opacity=0.5, title="total variable sites")
f2 = axes.fill(x, distpis, color=colormap[1], opacity=0.5, title="parsimony informative sites")
## add a horizontal dashed line at the median Nsnps per site
axes.hlines(np.median(distvar), opacity=0.9, style={"stroke-dasharray":"4, 4"})
axes.hlines(np.median(distpis), opacity=0.9, style={"stroke-dasharray":"4, 4"})
return canvas, axes
canvas, axes = SNP_position_plot(distvar, distpis)
## save fig
toyplot.html.render(canvas, 'snp_positions.html')
## show fig
canvas
#axes
I think the more likely explanation is that poor alignment towards the end of reads is causing the excess SNPs for two reasons. First, if it was sequencing errors than we would expect an excess of autapomorphies at the end of reads equally or more so than we observe synapomorphies, but that is not the case. And second, increasing the minimum depth does fix the problem.
Positive values of the edge filter
## make a new branch of cyatho-d6-min4
data5 = data1.branch('cyatho-d6-min4-trim')
## edge filter order is (R1left, R1right, R2left, R2right)
## we set the R1right to -10, which trims 10 bases from the right side
data5.set_params('edge_filter', ("4, -10, 4, 4"))
## run step7 to fill new data base w/ filters
data5.run('7', force=True)
data5 = ip.load_json("~/Downloads/pedicularis/cyatho-d12-min4")
loading Assembly: cyatho-d12-min4 from saved path: ~/Downloads/pedicularis/cyatho-d12-min4.json
## filter the snps
fsnps = filter_snps(data5)
## trim non-data
maxend = np.where(fsnps[:, :, :].sum(axis=0).sum(axis=1) != 0)[0].max()
## all variables (including autapomorphies)
distvar = np.sum(fsnps[:, :maxend+1, 0].astype(np.int), axis=0)
print(distvar)
## synapomorphies (just pis)
distpis = fsnps[:, :maxend+1, 1].sum(axis=0)
print(distpis)
## how long is the longest seq
print 'maxlen = ', maxend
prefilter cyatho-d12-min4 shape (54652, 150, 2) = (nloci, locus_len, [var,pis]) total vars = 158199 postfilter cyatho-d12-min4 shape (29859, 150, 2) = (nloci, locus_len, [var,pis]) total vars = 86868 [ 0 0 0 0 0 922 1050 1022 982 1066 1108 1018 1090 1089 1104 1134 1082 1085 1182 1131 1133 1134 1127 1119 1157 1168 1133 1150 1174 1082 1162 1161 1084 1178 1186 1146 1128 1151 1149 1177 1218 1142 1196 1146 1172 1226 1211 1167 1215 1181 1239 1244 1188 1180 1217 1242 1222 1297 1321 1245 1313 1337 1322 1404 1436 1507 1519 1603 1624 1727 1871 1878 2041 2199 377 182 132 78 41 24 16 4] [ 0 0 0 0 0 428 503 479 484 544 535 522 608 554 523 556 595 588 628 535 535 584 579 576 576 541 551 629 555 597 584 602 543 596 585 540 584 591 616 601 578 588 597 597 553 591 599 606 631 580 617 623 625 594 646 593 606 672 599 673 657 659 675 717 721 686 800 776 844 957 954 1024 1120 1182 198 100 65 33 20 11 6 1] maxlen = 81
SNP_position_plot(distvar, distpis)
I found early on that leaving the cut site attached to the left side of reads improved assemblies by acting as an achor, which then allowed gap openings to arise in the early parts of the read but not to be treated differently, i.e., as terminal gap openings. For some reason it didn't occur to me to similarly anchor the right side of reads. Let's see what happens if I had an invariant anchor to the right side of reads.
Another point to note, though I don't show the results here, this increase in variation along the length of reads is not observed in simulated data, suggesting it is inherent to real data.
simdata = ip.load_json("~/Documents/ipyrad/tests/cli/cli.json")
## filter the snps
fsnps = filter_snps(simdata)
## trim non-data
maxend = np.where(fsnps[:, :, :].sum(axis=0).sum(axis=1) != 0)[0].max()
## all variables (including autapomorphies)
distvar = np.sum(fsnps[:, :maxend+1, 0].astype(np.int), axis=0)
## synapomorphies (just pis)
distpis = fsnps[:, :maxend+1, 1].sum(axis=0)
SNP_position_plot(distvar, distpis)
loading Assembly: cli from saved path: ~/Documents/ipyrad/tests/cli/cli.json
(<toyplot.canvas.Canvas at 0x7f7b1feb3dd0>, <toyplot.axes.Cartesian at 0x7f7b1feb3b90>)
I now add a five base anchor just before muscle alignment, and then strip it off again after aligning. I have an added check to make sure that the anchor is definitely not left on the read in case muscle did try to stick an opening into the anchor.
## import the new version of ipyrad w/ this update.
import ipyrad as ip
print ip.__version__
0.2.1