Process Tabular Data with Pandas

In this notebook you will:

  • Load a data from a scan.
  • Visualize it.
  • Explore the dataset, performing basic filtering and statistic operations using pandas.

Recommended Prerequisites:

Configuration

Below, we will connect to EPICS IOC(s) controlling simulated hardware in lieu of actual motors, detectors. The IOCs should already be running in the background. Run this command to verify that they are running: it should produce output with RUNNING on each line. In the event of a problem, edit this command to replace status with restart all and run again.

In [ ]:
!supervisorctl -c supervisor/supervisord.conf status
In [ ]:
%run scripts/beamline_configuration.py

Quickly access and explore the data

Let's take sample data.

In [ ]:
# Creating a figure explicitly in advance helps with the
# top-to-bottom flow of this notebook, but it is not necessary.
# If this is omitted, bluesky will cause a figure to appear
# during the RE(...) execution below.
plt.figure('I slit_det vs motor_slit')
In [ ]:
RE(scan([slit, I], motor_slit, -10, 10, 15))
In [ ]:
plt.gcf()  # Display a snapshot of the current state of the figure.

Previously we used the syntax db(...) to search for Headers. We looped through each Header and extracted some of the metadata encapsulated in it. We use the shorthand db[-N] to retrieve exactly one header, the one corresponding to N runs ago.

In [ ]:
header = db[-1]  # Header for the most recent one (i.e. 1 run ago)

We can access the data as a table. This object is called a pandas.DataFrame, a high-performance, spreadsheet-like object in Python.

In [ ]:
data = header.table()

It display prettily, truncating if needed.

In [ ]:
data

For large tables, it's handy to know how to ask for just the first or last couple rows.

It is easy to visualize.

In [ ]:
data.plot(x='motor_slit', y='slit_det')
In [ ]:
plt.gcf()  # Display a snapshot of the current state of the figure.
In [ ]:
data.head()
In [ ]:
data.tail()

Statistics on columns are simple to compute.

In [ ]:
data.mean()
In [ ]:
data.min()

Or just ask for all the common stats at once:

In [ ]:
data.describe()

Do analysis, and plot

Normalize by beam current

We can do math on the columns, and we can add new columns.

In [ ]:
data['normalized'] = data['slit_det'] / data['I'] * data['I'].mean()

We can plot them conveniently, addressing the columns by name.

In [ ]:
plt.figure()
In [ ]:
plt.plot('motor_slit', 'slit_det', data=data, label='raw')
plt.plot('motor_slit', 'normalized', data=data, label='normalized')
plt.legend()
plt.gcf()

Select only rows with high beam current

Select rows where the current is above 500.

In [ ]:
data[data['I'] > 500]

...or, where the the current is above the 30th percentile.

In [ ]:
data[data['I'] > data['I'].quantile(0.30)]

...and plot just those rows, using a marker so it is easy to see which points have been disqualified due to low current.

In [ ]:
data[data['I'] > data['I'].quantile(0.30)].plot(x='motor_slit', y='slit_det', marker='o')

Histogram the current

In [ ]:
plt.figure()
In [ ]:
plt.hist(data['I'], bins='auto')
plt.xlabel('current')
plt.ylabel('count')
plt.gcf()

Not enough data to be interesting. Try collecting a bunch of current data by itself and then try this again.

In [ ]:
# Creating a figure explicitly in advance helps with the
# top-to-bottom flow of this notebook, but it is not necessary.
# If this is omitted, bluesky will cause a figure to appear
# during the RE(...) execution below.
plt.figure("I vs time")
In [ ]:
RE(count([I], num=200, delay=0.1))
In [ ]:
plt.gcf()  # Display a snapshot of the current state of the figure.
In [ ]:
plt.figure()
In [ ]:
more_current_data = db[-1].table()
plt.hist(more_current_data['I'], bins='auto')
plt.xlabel('current')
plt.ylabel('count')
plt.gcf()

Exercises

Spend some time reading 10 minutes to pandas, and use this notebook to try some of the examples there.

In [ ]:
# Hack away....