Telemetry Hello World

This is a very a brief introduction to Spark and Telemetry in Python. You should have a look at the tutorial in Scala and the associated talk if you are interested to learn more about Spark. The goal of this example is to plot the startup distribution for each OS.

In [1]:
import ujson as json
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import plotly.plotly as py

from moztelemetry import get_pings, get_pings_properties, get_one_ping_per_client

%pylab inline
Populating the interactive namespace from numpy and matplotlib

Let's see how many parallel workers we have at our disposal:

In [2]:
sc.defaultParallelism
Out[2]:
16

Let's fetch 10% of Telemetry submissions for a given build-id...

In [3]:
pings = get_pings(sc, app="Firefox", channel="nightly", build_id=("20150401000000", "20150401999999"), fraction=0.1)

... and exctract only the attributes we need from the Telemetry submissions:

In [4]:
subset = get_pings_properties(pings, ["clientID", "info/OS", "simpleMeasurements/firstPaint"])

Let's filter out submissions with an invalid startup time:

In [5]:
subset = subset.filter(lambda p: p.get("firstPaint", -1) >= 0)

To prevent pseudoreplication, let's consider only a single submission for each client. As this step requires a distributed shuffle, it should always be run only after extracting the attributes of interest with get_pings_properties.

In [6]:
subset = get_one_ping_per_client(subset)

Caching is fundamental as it allows for an iterative, real-time development workflow:

In [7]:
cached = subset.cache()

How many pings are we looking at?

In [8]:
cached.count()
Out[8]:
9381

Let's group the startup timings by OS:

In [9]:
grouped = cached.map(lambda p: (p["OS"], p["firstPaint"])).groupByKey().collectAsMap()

And finally plot the data:

In [10]:
frame = pd.DataFrame({x: np.log(pd.Series(list(y))) for x, y in grouped.items()})
plt.figure(figsize=(18, 7))
frame.boxplot(return_type="axes")
plt.ylabel("log(firstPaint)")
plt.show()

You can also create interactive plots with plotly:

In [11]:
fig = plt.figure(figsize=(18, 7))
frame["WINNT"].plot(kind="hist", bins=50)
plt.title("startup distribution for Windows")
plt.ylabel("count")
plt.xlabel("log(firstPaint)")
py.iplot_mpl(fig, strip_style=True)
Out[11]: