Access Saved Data

Data Broker solves two problems:

  • Search for data based on time, a unique identifer, or some other query.
  • Load data into standard scientific Python data structures without worrying about file formats.

We have used Bluesky to acquire several Runs and made them available in this example Catalog.

In [ ]:
from bluesky_tutorial_utils import get_example_catalog

catalog = get_example_catalog()

What can you do with a Bluesky Catalog?

A Catalog has a length.

In [ ]:
len(catalog)

Iterating over a Catalog gives the names of its entries.

In [ ]:
for name in catalog:
    ...

As with dict objects in Python, iterating over a Catalog's items() gives (name, entry) pairs.

In [ ]:
for name, entry in catalog.items():
    ...

The Catalogs support lookup by recency, scan_id, and globally unique ID.

catalog[-1]  # the most recent Run
catalog[-5]  # the fifth-most-recent Run
catalog[3]  # 'scan_id' == 3 (if ambiguous, returns the most recent match)
catalog["6f3ee9a1-ff4b-47ba-a439-9027cd9e6ced"]  # a full globally unique ID...
catalog["6f3ee9"]  # ...or just enough characters to uniquely identify it (6-8 usually suffices)

The globally unique ID is best for use in scripts, but the others are nice for interactive use. All of these incantations return a BlueskyRun.

In [ ]:
run = catalog[-1]
run

Catalog also support search.

In [ ]:
results = catalog.search({"plan_name": "count"})
len(results)

When you search on a Catalog, you get another Catalog with a subset of the entries. You can search on this in turn, progressively narrowing the results.

In [ ]:
from databroker.queries import TimeRange

jan_results = results.search(TimeRange(since="2020-01-01", until="2020-02-01", timezone="US/Eastern"))
len(jan_results)

The syntax for these queries is that of MongoDB. It is powerful and flexible, but it takes some getting used to, so databroker is growing higher-level utilities like TimeRange to compose common queries in a user-friendly way. We can peek inside if we like to see the MongoDB query that it generates.

In [ ]:
dict(TimeRange(since="2020-01-01", until="2020-02-01", timezone="US/Eastern"))

Exercise

Build some TimeRange queries, filling in ... below. Notice that you can specify the time with more or less specificity: try just giving YYYY or YYYY-MM or adding a time. Notice that all of the parameters are optional.

In [ ]:
# catalog.search(TimeRange(...))
In [ ]:
# catalog.search(TimeRange(...))
In [ ]:
# catalog.search(TimeRange(...))
In [ ]:
# catalog.search(TimeRange(...))

What can you with a BlueskyRun?

A BlueskyRun bundles together some metadata and several logical tables ("streams") of data. First, the metadata. It always comes in two sections, "start" and "stop".

In [ ]:
run.metadata["start"]  # Everything we know before the measurements start.

The above contains a mixture of things that bluesky automatically recorded (e.g. the time), things the bluesky plan reported (e.g. which motor(s) are scanned), and things the user told us (e.g. the name of the operator).

In [ ]:
run.metadata["stop"]  # Everything we only know after the measurements stop.

These objects Start and Stop are just dictionaries. You can dig into their contents in the usual way.

In [ ]:
run.metadata["start"]["num_points"]
In [ ]:
run.metadata["stop"]["exit_status"] == "success"

As we said, a Run bundles together any number of "streams" of data. Picture these as tables or spreadsheets. The stream names are shown when we print run.

In [ ]:
run

We can also list them programmatically.

In [ ]:
list(run)

We can access a particular stream like run["primary"].read(). Dot access also works — run.primary.read() — if the stream name is a valid Python identifier and does not collide with any other attributes.

In [ ]:
ds = run["primary"].read()
ds

This is an xarray.Dataset. At this point Bluesky and Data Broker have served their purpose and handed us a useful, general-purpose scientific Python data structure with our data in it.

What can you do with an xarray.Dataset?

We can easily generate scatter plots of one dimension vs another.

In [ ]:
ds.plot.scatter(x="time", y="ns_gap")

We can pull out specific columns. (Each column in an xarray.Dataset is called an xarray.DataArray.)

In [ ]:
image = ds["ns_image"]
image

Inside this xarray.DataArray is a plain old numpy array.

In [ ]:
type(image.values)

The extra context provided by xarray is very useful. Notice that the dimensions have names, so we can perform aggregations over named axes without remembering the order of the dimensions.

In [ ]:
image.sum("time")  # With just plain numpy, this would be image.sum(0) and we'd have to keep track ourselves that 0 = "time".

The plot method on xarray.DataArray often just "does the right thing" based on the dimensionality of the data. It even labels our axes for us!

In [ ]:
image.sum("time").plot()

For a quick overview of xarray see the xarray documentation. Also see these tutorials in particular for interesting usages of xarray:

Exercises

  1. Coming back to our run
In [ ]:
run

read the "baseline" stream. The baseline stream conventionally includes readings taken just before and after a scan to record all potentially-relevant positions and temperatures and note if they have drifted.

In [ ]:
# Try your solution here.
In [ ]:
%load solutions/access_baseline_data.py