Using Sumatra with Pandas in IPython

This notebook demonstrates how to use Sumatra to capture simulation input data and meta data and then export these records into a Pandas data frame. Sumatra has a stand alone web interface built with Django which allows users to view the data. Data can also be imported into Python, but requires a lot of code to manipulate and display in useful custom formats. Pandas seems like the ideal solution for manipulating Sumatra's data. In particular the ability to easily and quickly combine input data, meta data, and output data into custom data frames is really powerful for data analysis, reproduciblity and sharing.

The first step in using Sumatra is to setup a simulation. Here the simulation just runs a diffusion problem using FiPy and outputs the time taken for a time step. The goal of the work is to test FiPy's parallel speed up based on different input parameters.

Setup the Simulations

In [1]:
%matplotlib inline
%load_ext autoreload
%autoreload 2

import numpy as np
import matplotlib.pyplot as plt

Sumatra requires a file with the parameters specified.

In [2]:
import json

params = {'N' : 10, 'suite' : 'trilinos', 'iterations' : 100}

with open('params.json', 'w') as fp:
    json.dump(params, fp)

The script file for running the simulation is fipy_timing.py. It reads the JSON file, runs the simulation and the stores the run times in data.txt.

In [3]:
%%writefile fipy_timing.py

"""
Usage: fipy_timing.py [<jsonfile>]

"""

from docopt import docopt
import json
import timeit
import numpy as np
import fipy as fp
import os

arguments = docopt(__doc__, version='Run FiPy timing')
jsonfile = arguments['<jsonfile>']

if jsonfile:
    with open(jsonfile, 'rb') as ff:
        params = json.load(ff)
else:
    params = dict()
    
N = params.get('N', 10)
iterations = params.get('iterations', 100)
suite = params.get('suite', 'trilinos')
sumatra_label = params.get('sumatra_label', '')

attempts = 3

setup_str = '''
import fipy as fp
import numpy as np
np.random.seed(1)
L = 1.
N = {N:d}
m = fp.GmshGrid3D(nx=N, ny=N, nz=N, dx=L / N, dy=L / N, dz=L / N)
v0 = np.random.random(m.numberOfCells)
v = fp.CellVariable(mesh=m)
v0 = np.resize(v0, len(v)) ## Gmsh doesn't always give us the correct sized grid!
eqn = fp.TransientTerm(1e-3) == fp.DiffusionTerm()
v[:] = v0.copy()

import fipy.solvers.{suite} as solvers
solver = solvers.linearPCGSolver.LinearPCGSolver(precon=None, iterations={iterations}, tolerance=1e-100)

eqn.solve(v, dt=1., solver=solver)
v[:] = v0.copy()
'''

timeit_str = '''
eqn.solve(v, dt=1., solver=solver)
fp.parallelComm.Barrier()
'''

timer = timeit.Timer(timeit_str, setup=setup_str.format(N=N, suite=suite, iterations=iterations))
times = timer.repeat(attempts, 1)

if fp.parallelComm.procID == 0:
    filepath = os.path.join('Data', sumatra_label)
    filename = 'data.txt'
    np.savetxt(os.path.join(filepath, filename), times)
Overwriting fipy_timing.py

Without using Sumatra and in serial this is run with

In [145]:
!python fipy_timing.py params.json

and the output data file is

In [146]:
!more Data/data.txt
1.253199577331542969e-02
1.225900650024414062e-02
1.175403594970703125e-02

Create a Git Repository

In this demo, I'm assuming that the working directory is a Git repository set up with

$ git init

$ git add fipy_timing.py $ git ci -m "Add timing script."

Sumatra requires that the script is sitting in the a working copy of a repository.

In [3]:
!git log -1
commit 6a830dac2ea45ea090ec91a4a0f5263be10e95f3
Author: Daniel Wheeler <[email protected]>
Date:   Wed Feb 26 13:50:21 2014 -0500

    Fix README.

Configure Sumatra

Once the repository is setup, the Sumatra repository can be configured. Here we are using the distributed launch mode as we want Sumatra to launch and record parallel jobs.

In [148]:
%%bash

\rm -rf .smt
smt init smt-demo
smt configure --executable=python --main=fipy_timing.py
smt configure --launch_mode=distributed
smt configure -g uuid
smt configure -c store-diff
smt configure --addlabel=parameters
Sumatra project successfully set up
Multiple versions found, using /home/wd15/anaconda/bin/python. If you wish to use a different version, please specify it explicitly
Multiple versions found, using /home/wd15/anaconda/bin/mpirun. If you wish to use a different version, please specify it explicitly

Sumatra requires that a Data/ directory exists in the working copy.

In [ ]:
!mkdir Data

If we were not using Sumatra, we would launch the job with

$ mpirun -n 2 python fipy_timing.py params.json

The equivalent command using Sumatra is

$ smt run -n 2 params.json

Run Simulations

In the following cell we just run a batch of simulations with varying parameters.

In [4]:
import itertools

nprocs = (1, 2, 4, 8)#
iterations_ = (100,)
Ns = (10, 40)
suites = ('trilinos',)
tag='demo4'

for nproc, iterations, N, suite in itertools.product(nprocs, iterations_, Ns, suites):
    !smt run --tag=$tag -n $nproc params.json N=$N iterations=$iterations suite=$suite
/home/wd15/anaconda/lib/python2.7/site-packages/matplotlib/__init__.py:740: UserWarning: Found matplotlib configuration in ~/.matplotlib/. To conform with the XDG base directory standard, this configuration location has been deprecated on Linux, and the new location is now '/home/wd15/.config'/matplotlib/. Please move your configuration there to ensure that matplotlib will continue to find it in the future.
  _get_xdg_config_dir())
/home/wd15/hg/sumatra/sumatra/launch.py:263: UserWarning: mpi4py is not available, so Sumatra is not able to obtain platform information for remote nodes.
  warnings.warn("mpi4py is not available, so Sumatra is not able to obtain platform information for remote nodes.")
Data keys are [data.txt(5a633ee751a2043deb828d28d4daefc0372c5b63)]
/home/wd15/anaconda/lib/python2.7/site-packages/matplotlib/__init__.py:740: UserWarning: Found matplotlib configuration in ~/.matplotlib/. To conform with the XDG base directory standard, this configuration location has been deprecated on Linux, and the new location is now '/home/wd15/.config'/matplotlib/. Please move your configuration there to ensure that matplotlib will continue to find it in the future.
  _get_xdg_config_dir())
/home/wd15/hg/sumatra/sumatra/launch.py:263: UserWarning: mpi4py is not available, so Sumatra is not able to obtain platform information for remote nodes.
  warnings.warn("mpi4py is not available, so Sumatra is not able to obtain platform information for remote nodes.")
Data keys are [data.txt(f66b8ea1a596f8f06321d389fc412e4809d50697)]
/home/wd15/anaconda/lib/python2.7/site-packages/matplotlib/__init__.py:740: UserWarning: Found matplotlib configuration in ~/.matplotlib/. To conform with the XDG base directory standard, this configuration location has been deprecated on Linux, and the new location is now '/home/wd15/.config'/matplotlib/. Please move your configuration there to ensure that matplotlib will continue to find it in the future.
  _get_xdg_config_dir())
/home/wd15/hg/sumatra/sumatra/launch.py:263: UserWarning: mpi4py is not available, so Sumatra is not able to obtain platform information for remote nodes.
  warnings.warn("mpi4py is not available, so Sumatra is not able to obtain platform information for remote nodes.")
Data keys are [data.txt(1d0effa7003d8e75bf998cce5a4f71a72dff7025)]
/home/wd15/anaconda/lib/python2.7/site-packages/matplotlib/__init__.py:740: UserWarning: Found matplotlib configuration in ~/.matplotlib/. To conform with the XDG base directory standard, this configuration location has been deprecated on Linux, and the new location is now '/home/wd15/.config'/matplotlib/. Please move your configuration there to ensure that matplotlib will continue to find it in the future.
  _get_xdg_config_dir())
/home/wd15/hg/sumatra/sumatra/launch.py:263: UserWarning: mpi4py is not available, so Sumatra is not able to obtain platform information for remote nodes.
  warnings.warn("mpi4py is not available, so Sumatra is not able to obtain platform information for remote nodes.")
Data keys are [data.txt(38d19ef93141666ffbabcb30c700028a87631ced)]
/home/wd15/anaconda/lib/python2.7/site-packages/matplotlib/__init__.py:740: UserWarning: Found matplotlib configuration in ~/.matplotlib/. To conform with the XDG base directory standard, this configuration location has been deprecated on Linux, and the new location is now '/home/wd15/.config'/matplotlib/. Please move your configuration there to ensure that matplotlib will continue to find it in the future.
  _get_xdg_config_dir())
/home/wd15/hg/sumatra/sumatra/launch.py:263: UserWarning: mpi4py is not available, so Sumatra is not able to obtain platform information for remote nodes.
  warnings.warn("mpi4py is not available, so Sumatra is not able to obtain platform information for remote nodes.")
Data keys are [data.txt(3fa95ba72c140b64dc07d4d2a7a9d0af38da06c8)]
/home/wd15/anaconda/lib/python2.7/site-packages/matplotlib/__init__.py:740: UserWarning: Found matplotlib configuration in ~/.matplotlib/. To conform with the XDG base directory standard, this configuration location has been deprecated on Linux, and the new location is now '/home/wd15/.config'/matplotlib/. Please move your configuration there to ensure that matplotlib will continue to find it in the future.
  _get_xdg_config_dir())
/home/wd15/hg/sumatra/sumatra/launch.py:263: UserWarning: mpi4py is not available, so Sumatra is not able to obtain platform information for remote nodes.
  warnings.warn("mpi4py is not available, so Sumatra is not able to obtain platform information for remote nodes.")
Data keys are [data.txt(2c702e292ec4da225cdfc414d29f80e9df2ccfd7)]
/home/wd15/anaconda/lib/python2.7/site-packages/matplotlib/__init__.py:740: UserWarning: Found matplotlib configuration in ~/.matplotlib/. To conform with the XDG base directory standard, this configuration location has been deprecated on Linux, and the new location is now '/home/wd15/.config'/matplotlib/. Please move your configuration there to ensure that matplotlib will continue to find it in the future.
  _get_xdg_config_dir())
/home/wd15/hg/sumatra/sumatra/launch.py:263: UserWarning: mpi4py is not available, so Sumatra is not able to obtain platform information for remote nodes.
  warnings.warn("mpi4py is not available, so Sumatra is not able to obtain platform information for remote nodes.")
Data keys are [data.txt(c63d4c44770cdc4329b13dc863762c8b5069dd2e)]
/home/wd15/anaconda/lib/python2.7/site-packages/matplotlib/__init__.py:740: UserWarning: Found matplotlib configuration in ~/.matplotlib/. To conform with the XDG base directory standard, this configuration location has been deprecated on Linux, and the new location is now '/home/wd15/.config'/matplotlib/. Please move your configuration there to ensure that matplotlib will continue to find it in the future.
  _get_xdg_config_dir())
/home/wd15/hg/sumatra/sumatra/launch.py:263: UserWarning: mpi4py is not available, so Sumatra is not able to obtain platform information for remote nodes.
  warnings.warn("mpi4py is not available, so Sumatra is not able to obtain platform information for remote nodes.")
Data keys are [data.txt(c4e026e885b64122b4b578c334ef9f078eac3df1)]

Import into Pandas Dataframe

The important part of this story is how to import the data into the Pandas data frame. This is actually trivial as Sumatra's default export format is a JSON file with all the records.

In [5]:
import json
import pandas

!smt export
with open('.smt/records_export.json') as ff:
    data = json.load(ff)

df = pandas.DataFrame(data)

The Sumatra data is now in a Pandas data frame, albeit a touch raw.

In [6]:
print df
<class 'pandas.core.frame.DataFrame'>
Int64Index: 18 entries, 0 to 17
Data columns (total 23 columns):
datastore           18  non-null values
dependencies        18  non-null values
diff                18  non-null values
duration            18  non-null values
executable          18  non-null values
input_data          18  non-null values
input_datastore     18  non-null values
label               18  non-null values
launch_mode         18  non-null values
main_file           18  non-null values
outcome             18  non-null values
output_data         18  non-null values
parameters          18  non-null values
platforms           18  non-null values
reason              18  non-null values
repeats             0  non-null values
repository          18  non-null values
script_arguments    18  non-null values
stdout_stderr       18  non-null values
tags                18  non-null values
timestamp           18  non-null values
user                18  non-null values
version             18  non-null values
dtypes: float64(1), object(22)
In [7]:
print df[['label', 'duration']]
           label   duration
0   ac32b9fc6df4  32.574247
1   f372719a0648   6.860735
2   9695c2529109  24.854690
3   bf710e5339ff   3.592334
4   179099003946  28.189878
5   eeffa50a08bd   3.995513
6   180b9c889f94  34.474531
7   0732f6d89fc4   4.214891
8   f2073ab41bd7  34.305814
9   27bb809fa8ad  24.290209
10  179247440765  28.439126
11  0731f5a8e231  32.452093
12  0330697ac505   2.569647
13  a04a49a2107b   1.873670
14  6b3d5ac075a6   1.283991
15  1b124fc57ced   5.125001
16  6b04488b14ed   5.207692
17  5cc0546270c9   5.027596

Reformat the Raw Imported Dataframe

While all the meta data is important, often we want the input and output data combined into a data frame in a digestible form. Typically, we want a graph of reduced input versus reduced output.

The first step is to introduce columns in the data frame for each of the input parameters (input data). The input data is buried in the launch_mode and parameters columns of the raw data frame.

In [8]:
import json
df = df.copy()
df['nproc'] = df.launch_mode.map(lambda x: x['parameters']['n'])
for p in 'N', 'iterations', 'suite':
    df[p] = df.parameters.map(lambda x: json.loads(x['content'])[p])

We now have the input data exposed as columns in the data frame.

In [9]:
columns = ['label', 'nproc', 'N', 'iterations', 'suite', 'tags']
print df[columns].sort('nproc')
           label  nproc   N  iterations     suite     tags
15  1b124fc57ced      1  10         100  trilinos  [demo2]
6   180b9c889f94      1  40         100  trilinos  [demo4]
7   0732f6d89fc4      1  10         100  trilinos  [demo4]
11  0731f5a8e231      1  40         100  trilinos  [demo3]
17  5cc0546270c9      2  10         100  trilinos       []
4   179099003946      2  40         100  trilinos  [demo4]
5   eeffa50a08bd      2  10         100  trilinos  [demo4]
16  6b04488b14ed      2  10         100  trilinos   [test]
10  179247440765      2  40         100  trilinos  [demo3]
14  6b3d5ac075a6      2  10         100  trilinos  [demo2]
2   9695c2529109      4  40         100  trilinos  [demo4]
3   bf710e5339ff      4  10         100  trilinos  [demo4]
9   27bb809fa8ad      4  40         100  trilinos  [demo3]
13  a04a49a2107b      4  10         100  trilinos  [demo2]
0   ac32b9fc6df4      8  40         100  trilinos  [demo4]
1   f372719a0648      8  10         100  trilinos  [demo4]
12  0330697ac505      8  10         100  trilinos  [demo2]
8   f2073ab41bd7      8  40         100  trilinos  [demo3]

The following pulls out the run times stored in the output files from each simulation into a run_time column.

In [10]:
import os

datafiles = df['output_data'].map(lambda x: x[0]['path'])
datapaths = df['datastore'].map(lambda x: x['parameters']['root'])
data = [np.loadtxt(os.path.join(x, y)) for x, y in zip(datapaths, datafiles)]
df['run_time'] = [min(d) for d in data]
In [11]:
columns.append('run_time')
print df[columns].sort('nproc')
           label  nproc   N  iterations     suite     tags  run_time
15  1b124fc57ced      1  10         100  trilinos  [demo2]  0.012017
6   180b9c889f94      1  40         100  trilinos  [demo4]  0.419316
7   0732f6d89fc4      1  10         100  trilinos  [demo4]  0.012037
11  0731f5a8e231      1  40         100  trilinos  [demo3]  0.402522
17  5cc0546270c9      2  10         100  trilinos       []  0.011014
4   179099003946      2  40         100  trilinos  [demo4]  0.252318
5   eeffa50a08bd      2  10         100  trilinos  [demo4]  0.011214
16  6b04488b14ed      2  10         100  trilinos   [test]  0.010802
10  179247440765      2  40         100  trilinos  [demo3]  0.253387
14  6b3d5ac075a6      2  10         100  trilinos  [demo2]  0.011340
2   9695c2529109      4  40         100  trilinos  [demo4]  0.173505
3   bf710e5339ff      4  10         100  trilinos  [demo4]  0.010188
9   27bb809fa8ad      4  40         100  trilinos  [demo3]  0.179195
13  a04a49a2107b      4  10         100  trilinos  [demo2]  0.010196
0   ac32b9fc6df4      8  40         100  trilinos  [demo4]  0.178471
1   f372719a0648      8  10         100  trilinos  [demo4]  0.016224
12  0330697ac505      8  10         100  trilinos  [demo2]  0.016702
8   f2073ab41bd7      8  40         100  trilinos  [demo3]  0.184142

Create masks based on simulations records that have been tagged with either demo2 or demo3. We want to plot these results as different curves on the same graph.

In [17]:
tag_mask = df.tags.map(lambda x: 'demo4' in x)
df_tmp = df[tag_mask]
m10 = df_tmp.N.map(lambda x: x == 10)
m40 = df_tmp.N.map(lambda x: x == 40)
df_N10 = df_tmp[m10]
df_N40 = df_tmp[m40]
print df_N10[columns].sort('nproc')
print df_N40[columns].sort('nproc')
          label  nproc   N  iterations     suite     tags  run_time
7  0732f6d89fc4      1  10         100  trilinos  [demo4]  0.012037
5  eeffa50a08bd      2  10         100  trilinos  [demo4]  0.011214
3  bf710e5339ff      4  10         100  trilinos  [demo4]  0.010188
1  f372719a0648      8  10         100  trilinos  [demo4]  0.016224
          label  nproc   N  iterations     suite     tags  run_time
6  180b9c889f94      1  40         100  trilinos  [demo4]  0.419316
4  179099003946      2  40         100  trilinos  [demo4]  0.252318
2  9695c2529109      4  40         100  trilinos  [demo4]  0.173505
0  ac32b9fc6df4      8  40         100  trilinos  [demo4]  0.178471

We can plot the results we're interested in. Larger system size gives better parallel speed up.

In [18]:
ax = df_N10.plot('nproc', 'run_time', label='N={0}'.format(df_N10.N.iat[0]))
df_N40.plot('nproc', 'run_time', ylim=0, ax=ax, label='N={0}'.format(df_N40.N.iat[0]))
plt.ylabel('Run Time (s)')
plt.xlabel('Number of Processes')
plt.legend()
Out[18]:
<matplotlib.legend.Legend at 0x46e9390>

Using Pandas it is easy to store a custom data frame.

In [19]:
df.to_hdf('store.h5', 'df')
/home/wd15/anaconda/lib/python2.7/site-packages/pandas/io/pytables.py:1992: PerformanceWarning: 
your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed,key->axis0] [items->None]

  warnings.warn(ws, PerformanceWarning)
/home/wd15/anaconda/lib/python2.7/site-packages/pandas/io/pytables.py:1992: PerformanceWarning: 
your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed,key->block0_items] [items->None]

  warnings.warn(ws, PerformanceWarning)
/home/wd15/anaconda/lib/python2.7/site-packages/pandas/io/pytables.py:1992: PerformanceWarning: 
your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed,key->block2_values] [items->[u'datastore', u'dependencies', u'diff', u'executable', u'input_data', u'input_datastore', u'label', u'launch_mode', u'main_file', u'outcome', u'output_data', u'parameters', u'platforms', u'reason', u'repeats', u'repository', u'script_arguments', u'stdout_stderr', u'tags', u'timestamp', u'user', u'version', 'suite']]

  warnings.warn(ws, PerformanceWarning)
/home/wd15/anaconda/lib/python2.7/site-packages/pandas/io/pytables.py:1992: PerformanceWarning: 
your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed,key->block2_items] [items->None]

  warnings.warn(ws, PerformanceWarning)
In [21]:
store = pandas.HDFStore('store.h5')
print store.df.dependencies
0     [{u'name': u'IPython', u'module': u'python', u...
1     [{u'name': u'IPython', u'module': u'python', u...
2     [{u'name': u'IPython', u'module': u'python', u...
3     [{u'name': u'IPython', u'module': u'python', u...
4     [{u'name': u'IPython', u'module': u'python', u...
5     [{u'name': u'IPython', u'module': u'python', u...
6     [{u'name': u'IPython', u'module': u'python', u...
7     [{u'name': u'IPython', u'module': u'python', u...
8     [{u'name': u'IPython', u'module': u'python', u...
9     [{u'name': u'IPython', u'module': u'python', u...
10    [{u'name': u'IPython', u'module': u'python', u...
11    [{u'name': u'IPython', u'module': u'python', u...
12    [{u'name': u'IPython', u'module': u'python', u...
13    [{u'name': u'IPython', u'module': u'python', u...
14    [{u'name': u'IPython', u'module': u'python', u...
15    [{u'name': u'IPython', u'module': u'python', u...
16    [{u'name': u'IPython', u'module': u'python', u...
17    [{u'name': u'IPython', u'module': u'python', u...
Name: dependencies, dtype: object

Conclusion

Sumatra stores data in an SQL style database and this isn't ideal for pulling data into Python for data manipulation. Pandas is good for data manipulation and pulling the records out of Sumatra and into Pandas is very easy.