import time
import pandas as pd
import numpy as np
import gc
import MDAnalysis as mda
from MDAnalysis.topology.GROParser import GROParser
We will benchmark AtomGroup
s, ResidueGroup
s, and SegmentGroup
s in attribute access and assignment of the current development branch of MDAnalysis
, and our issue-363 branch of MDAnalysis
, which uses an entirely new topology system.
These benchmarks were carried out on a Thinkpad X260 with Intel(R) Core(TM) i5-6300U CPU @ 2.40GHz. We also used:
np.__version__
'1.12.0'
Our systems were vesicle systems using repeats of vesicles from the vesicle library publicly hosted on github. We used three systems, with approximately 10 million, 3.5 million, and 1.5 million atoms.
systems = {'10M' : 'systems/vesicles/10M/system.gro',
'3.5M' : 'systems/vesicles/3_5M/system.gro',
'1.5M' : 'systems/vesicles/1_5M/system.gro'}
How long does the GRO parser take? In the new implementation, we don't do any mass guessing, so it's already lighter; we defer guess methods to the Universe
after the Topology
has been built and attached. The final result is also a Topology
object instead of a list of Atom
objects.
def time_GROParser(gro, parser):
"""Time how long it takes to parse a GRO topology with a given parser.
:Arguments:
*gro*
path to the GRO file to parse
*parser*
the parser class to use
:Returns:
*dt*
total time in seconds required to parse file
"""
start = time.time()
parser(gro).parse()
dt = time.time() - start
gc.collect()
return dt
def time_multiple_GROParser(grofiles, iterations, parser):
"""Get parse timings for multiple grofiles over multiple iterations.
:Arguments:
*grofiles*
dictionary giving as values grofile paths
*iterations*
number of timings to do for each gro file
*parser*
GRO parser to use
:Returns:
*data*
dataframe giving the timings for each run
"""
data = {
'system': [],
'time': []
}
for gro in grofiles:
print "gro file: {}".format(gro)
for i in range(iterations):
print "\riteration: {}".format(i),
out = time_GROParser(grofiles[gro], parser)
data['system'].append(gro)
data['time'].append(out)
print '\n'
return pd.DataFrame(data)
df = time_multiple_GROParser(systems, iterations=2, parser=GROParser)
gro file: 3.5M iteration: 1 gro file: 1.5M iteration: 1 gro file: 10M iteration: 1
df
system | time | |
---|---|---|
0 | 3.5M | 25.463573 |
1 | 3.5M | 25.045070 |
2 | 1.5M | 12.297968 |
3 | 1.5M | 12.277537 |
4 | 10M | 70.155502 |
5 | 10M | 69.792134 |
df = time_multiple_GROParser(systems, iterations=2, parser=GROParser)
gro file: 3.5M iteration: 0
/home/alter/Library/mdanalysis/MDAnalysis/package/MDAnalysis/topology/guessers.py:56: UserWarning: Failed to guess the mass for the following atom types: G "".format(', '.join(misses)))
iteration: 1 gro file: 1.5M iteration: 1 gro file: 10M iteration: 1
df
system | time | |
---|---|---|
0 | 3.5M | 16.211191 |
1 | 3.5M | 15.910637 |
2 | 1.5M | 7.958214 |
3 | 1.5M | 7.935534 |
4 | 10M | 45.729256 |
5 | 10M | 45.610785 |
Our new parser is about 1.5 times faster. This might not mean much, however, since we made different choices on what it should do.
Our old parser yields a data structure that is about 3.6 GB in memory, and our new parser gives one that is only 1.3 GB. The new Topology
object is a lot smaller than a list of Atom
s.
Creating AtomGroup
s in the old implementation requires indexing a list of Atom
objects. This can get expensive for a large number of atoms.
def time_AtomGroup_slice(universe, slice_):
"""Time how long it takes to slice an AtomGroup out of all atoms in the system.
Parameters
----------
universe
Universe whose atoms will be sliced
slice_
the slice to apply; can also be a fancy or boolean index
:Returns:
df
total time in seconds required to create AtomGroup
"""
start = time.time()
universe.atoms[slice_]
dt = time.time() - start
return dt
def time_multiple_AtomGroup_slice(grofiles, iterations, slice_):
"""Get parse timings for multiple grofiles over multiple iterations.
:Arguments:
grofiles
dictionary giving as values grofile paths
iterations
number of timings to do for each gro file
slice_
AtomGroup slicing to use; can be a fancy or boolean index
:Returns:
data
dataframe giving the timings for each run
"""
data = {
'system': [],
'time': []
}
for gro in grofiles:
print "gro file: {}".format(gro)
u = mda.Universe(grofiles[gro])
for i in range(iterations):
print "\riteration: {}".format(i),
out = time_AtomGroup_slice(u, slice_)
data['system'].append(gro)
data['time'].append(out)
print '\n'
return pd.DataFrame(data)
df = time_multiple_AtomGroup_slice(systems, iterations=10, slice_=np.arange(3, 574393, 7))
gro file: 3.5M iteration: 9 gro file: 1.5M iteration: 9 gro file: 10M iteration: 9
df.groupby('system').mean()
time | |
---|---|
system | |
1.5M | 0.017631 |
10M | 0.016662 |
3.5M | 0.017449 |
df = time_multiple_AtomGroup_slice(systems, iterations=10, slice_=np.arange(3, 574393, 7))
gro file: 3.5M iteration: 9 gro file: 1.5M iteration: 9 gro file: 10M iteration: 9
df.groupby('system').mean()
time | |
---|---|
system | |
1.5M | 0.000441 |
10M | 0.000424 |
3.5M | 0.000420 |
About 40 times faster! Note that since the indexing we chose indexed the same number of atoms for all system sizes, the time it took didn't scale with system size here. Other indexes/slices could be applied, but because this is a fancy index it should be the worst case scenario for speed in our new scheme at any rate.
We often want to get attributes of an AtomGroup
's atoms. In the old scheme, this required iterating through the list of Atom
objects, filling an array with their attribute's values. In our new scheme, the AtomGroup
's indices are used to slice the corresponding TopologyAttr
array. Getting something like resids
from an AtomGroup
uses the AtomGroup
's indices to slice a translation table in Topology
to get the corresponding residue indices, and then uses these to slice the resids
array giving resids for each residue.
def time_AtomGroup_attr(atomgroup, attribute):
"""Time how long it takes to get an attribute of an AtomGroup.
Parameters
----------
atomgroup
atomgroup to use
attribute
attribute to get
:Returns:
df
total time in seconds required to get attribute
"""
start = time.time()
getattr(atomgroup, attribute)
dt = time.time() - start
return dt
def time_multiple_AtomGroup_attr(grofiles, iterations, attribute):
"""Get parse timings for multiple grofiles over multiple iterations.
Arguments
---------
grofiles
dictionary giving as values grofile paths
iterations
number of timings to do for each gro file
attribute
attribute to get
Returns
-------
data
dataframe giving the timings for each run
"""
data = {
'system': [],
'time': []
}
for gro in grofiles:
print "gro file: {}".format(gro)
u = mda.Universe(grofiles[gro])
ag = u.atoms
for i in range(iterations):
print "\riteration: {}".format(i),
out = time_AtomGroup_attr(ag, attribute)
data['system'].append(gro)
data['time'].append(out)
print '\n'
return pd.DataFrame(data)
df = time_multiple_AtomGroup_attr(systems, iterations=10, attribute='resnames')
gro file: 3.5M iteration: 9 gro file: 1.5M iteration: 9 gro file: 10M iteration: 9
df.groupby('system').mean()
time | |
---|---|
system | |
1.5M | 0.241742 |
10M | 1.538812 |
3.5M | 0.597206 |
df = time_multiple_AtomGroup_attr(systems, iterations=10, attribute='resnames')
gro file: 3.5M iteration: 9 gro file: 1.5M iteration: 9 gro file: 10M iteration: 9
df.groupby('system').mean()
time | |
---|---|
system | |
1.5M | 0.037288 |
10M | 0.244270 |
3.5M | 0.076125 |
About a 6x-8x speedup for accessing attributes.
def time_AtomGroup_setattr(atomgroup, attribute, values):
"""Time how long it takes to set an attribute of an AtomGroup.
Parameters
----------
atomgroup
atomgroup to use
attribute
attribute to set
values
values to set with
:Returns:
df
total time in seconds required to set attribute
"""
start = time.time()
setattr(atomgroup, attribute, values)
dt = time.time() - start
return dt
def time_multiple_AtomGroup_setatomids(grofiles, iterations):
"""Get timings for multiple grofiles over multiple iterations.
Arguments
---------
grofiles
dictionary giving as values grofile paths
iterations
number of timings to do for each gro file
Returns
-------
data
dataframe giving the timings for each run
"""
data = {
'system': [],
'time': []
}
for gro in grofiles:
print "gro file: {}".format(gro)
u = mda.Universe(grofiles[gro])
ag = u.atoms
for i in range(iterations):
print "\riteration: {}".format(i),
out = time_AtomGroup_setattr(ag, 'names', ag.names)
data['system'].append(gro)
data['time'].append(out)
print '\n'
return pd.DataFrame(data)
df = time_multiple_AtomGroup_setatomids(systems, iterations=10)
gro file: 3.5M iteration: 9 gro file: 1.5M iteration: 9 gro file: 10M iteration: 9
df.groupby('system').mean()
time | |
---|---|
system | |
1.5M | 0.751237 |
10M | 3.789588 |
3.5M | 1.318672 |
df = time_multiple_AtomGroup_setatomids(systems, iterations=10)
gro file: 3.5M iteration: 9 gro file: 1.5M iteration: 9 gro file: 10M iteration: 9
df.groupby('system').mean()
time | |
---|---|
system | |
1.5M | 0.017210 |
10M | 0.097012 |
3.5M | 0.035642 |
The new scheme gives about a 40x speedup for setting!