In [1]:
import time
import pandas as pd
import numpy as np
import gc

import MDAnalysis as mda
from MDAnalysis.topology.GROParser import GROParser

We will benchmark AtomGroups, ResidueGroups, and SegmentGroups in attribute access and assignment of the current development branch of MDAnalysis, and our issue-363 branch of MDAnalysis, which uses an entirely new topology system.

These benchmarks were carried out on a Thinkpad X260 with Intel(R) Core(TM) i5-6300U CPU @ 2.40GHz. We also used:

In [2]:
np.__version__
Out[2]:
'1.12.0'

Our systems were vesicle systems using repeats of vesicles from the vesicle library publicly hosted on github. We used three systems, with approximately 10 million, 3.5 million, and 1.5 million atoms.

In [3]:
systems = {'10M'  : 'systems/vesicles/10M/system.gro',
           '3.5M' : 'systems/vesicles/3_5M/system.gro',
           '1.5M' : 'systems/vesicles/1_5M/system.gro'}

Building the topology

How long does the GRO parser take? In the new implementation, we don't do any mass guessing, so it's already lighter; we defer guess methods to the Universe after the Topology has been built and attached. The final result is also a Topology object instead of a list of Atom objects.

In [4]:
def time_GROParser(gro, parser):
    """Time how long it takes to parse a GRO topology with a given parser.
    
    :Arguments:
        *gro*
            path to the GRO file to parse
        *parser*
            the parser class to use
            
    :Returns:
        *dt*
            total time in seconds required to parse file
    
    """
    start = time.time()
    parser(gro).parse()
    dt = time.time() - start
    gc.collect()
    return dt

In [5]:
def time_multiple_GROParser(grofiles, iterations, parser):
    """Get parse timings for multiple grofiles over multiple iterations.
    
    :Arguments:
        *grofiles*
            dictionary giving as values grofile paths
        *iterations*
            number of timings to do for each gro file
        *parser*
            GRO parser to use
            
    :Returns:
        *data*
            dataframe giving the timings for each run
    """
    data = {
            'system': [],
            'time': []
           }

    for gro in grofiles:
        print "gro file: {}".format(gro)
        for i in range(iterations):
            print "\riteration: {}".format(i),

            out = time_GROParser(grofiles[gro], parser)

            data['system'].append(gro)
            data['time'].append(out)
        print '\n'

    return pd.DataFrame(data)

the old GRO parser

In [6]:
df = time_multiple_GROParser(systems, iterations=2, parser=GROParser)
gro file: 3.5M
iteration: 1  

gro file: 1.5M
iteration: 1  

gro file: 10M
iteration: 1  


In [7]:
df
Out[7]:
system time
0 3.5M 25.463573
1 3.5M 25.045070
2 1.5M 12.297968
3 1.5M 12.277537
4 10M 70.155502
5 10M 69.792134

the new GRO parser

In [6]:
df = time_multiple_GROParser(systems, iterations=2, parser=GROParser)
gro file: 3.5M
iteration: 0
/home/alter/Library/mdanalysis/MDAnalysis/package/MDAnalysis/topology/guessers.py:56: UserWarning: Failed to guess the mass for the following atom types: G
  "".format(', '.join(misses)))
iteration: 1 

gro file: 1.5M
iteration: 1  

gro file: 10M
iteration: 1  

In [7]:
df
Out[7]:
system time
0 3.5M 16.211191
1 3.5M 15.910637
2 1.5M 7.958214
3 1.5M 7.935534
4 10M 45.729256
5 10M 45.610785

Our new parser is about 1.5 times faster. This might not mean much, however, since we made different choices on what it should do.

Our old parser yields a data structure that is about 3.6 GB in memory, and our new parser gives one that is only 1.3 GB. The new Topology object is a lot smaller than a list of Atoms.

Creating AtomGroups

Creating AtomGroups in the old implementation requires indexing a list of Atom objects. This can get expensive for a large number of atoms.

In [8]:
def time_AtomGroup_slice(universe, slice_):
    """Time how long it takes to slice an AtomGroup out of all atoms in the system.
    
    Parameters
    ----------
    universe
        Universe whose atoms will be sliced
    slice_
        the slice to apply; can also be a fancy or boolean index
        
    :Returns:
    df
        total time in seconds required to create AtomGroup
    
    """
    start = time.time()
    universe.atoms[slice_]
    dt = time.time() - start
    return dt

In [9]:
def time_multiple_AtomGroup_slice(grofiles, iterations, slice_):
    """Get parse timings for multiple grofiles over multiple iterations.
    
    :Arguments:
    grofiles
        dictionary giving as values grofile paths
    iterations
        number of timings to do for each gro file
    slice_
        AtomGroup slicing to use; can be a fancy or boolean index
            
    :Returns:
    data
        dataframe giving the timings for each run
    """
    data = {
            'system': [],
            'time': []
           }

    for gro in grofiles:
        print "gro file: {}".format(gro)
        u = mda.Universe(grofiles[gro])
        for i in range(iterations):
            print "\riteration: {}".format(i),

            out = time_AtomGroup_slice(u, slice_)

            data['system'].append(gro)
            data['time'].append(out)
        print '\n'

    return pd.DataFrame(data)

old implementation

In [10]:
df = time_multiple_AtomGroup_slice(systems, iterations=10, slice_=np.arange(3, 574393, 7))
gro file: 3.5M
iteration: 9 

gro file: 1.5M
iteration: 9 

gro file: 10M
iteration: 9 


In [11]:
df.groupby('system').mean()
Out[11]:
time
system
1.5M 0.017631
10M 0.016662
3.5M 0.017449

new implementation

In [10]:
df = time_multiple_AtomGroup_slice(systems, iterations=10, slice_=np.arange(3, 574393, 7))
gro file: 3.5M
iteration: 9 

gro file: 1.5M
iteration: 9 

gro file: 10M
iteration: 9 

In [11]:
df.groupby('system').mean()
Out[11]:
time
system
1.5M 0.000441
10M 0.000424
3.5M 0.000420

About 40 times faster! Note that since the indexing we chose indexed the same number of atoms for all system sizes, the time it took didn't scale with system size here. Other indexes/slices could be applied, but because this is a fancy index it should be the worst case scenario for speed in our new scheme at any rate.

Getting attributes

We often want to get attributes of an AtomGroup's atoms. In the old scheme, this required iterating through the list of Atom objects, filling an array with their attribute's values. In our new scheme, the AtomGroup's indices are used to slice the corresponding TopologyAttr array. Getting something like resids from an AtomGroup uses the AtomGroup's indices to slice a translation table in Topology to get the corresponding residue indices, and then uses these to slice the resids array giving resids for each residue.

In [12]:
def time_AtomGroup_attr(atomgroup, attribute):
    """Time how long it takes to get an attribute of an AtomGroup.
    
    Parameters
    ----------
    atomgroup
        atomgroup to use
    attribute
        attribute to get
        
    :Returns:
    df
        total time in seconds required to get attribute
    
    """
    start = time.time()
    getattr(atomgroup, attribute)
    dt = time.time() - start
    return dt

In [13]:
def time_multiple_AtomGroup_attr(grofiles, iterations, attribute):
    """Get parse timings for multiple grofiles over multiple iterations.
    
    Arguments
    ---------
    grofiles
        dictionary giving as values grofile paths
    iterations
        number of timings to do for each gro file
    attribute
        attribute to get
            
    Returns
    -------
    data
        dataframe giving the timings for each run
    """
    data = {
            'system': [],
            'time': []
           }

    for gro in grofiles:
        print "gro file: {}".format(gro)
        u = mda.Universe(grofiles[gro])
        ag = u.atoms
        for i in range(iterations):
            print "\riteration: {}".format(i),

            out = time_AtomGroup_attr(ag, attribute)

            data['system'].append(gro)
            data['time'].append(out)
        print '\n'

    return pd.DataFrame(data)

old attribute access

In [14]:
df = time_multiple_AtomGroup_attr(systems, iterations=10, attribute='resnames')
gro file: 3.5M
iteration: 9          

gro file: 1.5M
iteration: 9         

gro file: 10M
iteration: 9          


In [15]:
df.groupby('system').mean()
Out[15]:
time
system
1.5M 0.241742
10M 1.538812
3.5M 0.597206

new attribute access

In [14]:
df = time_multiple_AtomGroup_attr(systems, iterations=10, attribute='resnames')
gro file: 3.5M
iteration: 9    

gro file: 1.5M
iteration: 9  

gro file: 10M
iteration: 9        

In [15]:
df.groupby('system').mean()
Out[15]:
time
system
1.5M 0.037288
10M 0.244270
3.5M 0.076125

About a 6x-8x speedup for accessing attributes.

Setting atom attributes

In [16]:
def time_AtomGroup_setattr(atomgroup, attribute, values):
    """Time how long it takes to set an attribute of an AtomGroup.
    
    Parameters
    ----------
    atomgroup
        atomgroup to use
    attribute
        attribute to set
    values
        values to set with
        
    :Returns:
    df
        total time in seconds required to set attribute
    
    """
    start = time.time()
    setattr(atomgroup, attribute, values)
    dt = time.time() - start
    return dt

In [17]:
def time_multiple_AtomGroup_setatomids(grofiles, iterations):
    """Get timings for multiple grofiles over multiple iterations.
    
    Arguments
    ---------
    grofiles
        dictionary giving as values grofile paths
    iterations
        number of timings to do for each gro file
            
    Returns
    -------
    data
        dataframe giving the timings for each run
    """
    data = {
            'system': [],
            'time': []
           }

    for gro in grofiles:
        print "gro file: {}".format(gro)
        u = mda.Universe(grofiles[gro])
        ag = u.atoms
        for i in range(iterations):
            print "\riteration: {}".format(i),
            
            out = time_AtomGroup_setattr(ag, 'names', ag.names)

            data['system'].append(gro)
            data['time'].append(out)
        print '\n'

    return pd.DataFrame(data)

old implementation

In [18]:
df = time_multiple_AtomGroup_setatomids(systems, iterations=10)
gro file: 3.5M
iteration: 9          

gro file: 1.5M
iteration: 9          

gro file: 10M
iteration: 9          


In [19]:
df.groupby('system').mean()
Out[19]:
time
system
1.5M 0.751237
10M 3.789588
3.5M 1.318672

new implementation

In [19]:
df = time_multiple_AtomGroup_setatomids(systems, iterations=10)
gro file: 3.5M
iteration: 9    

gro file: 1.5M
iteration: 9   

gro file: 10M
iteration: 9  

In [20]:
df.groupby('system').mean()
Out[20]:
time
system
1.5M 0.017210
10M 0.097012
3.5M 0.035642

The new scheme gives about a 40x speedup for setting!