import ujson as json
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import plotly.plotly as py
from moztelemetry import get_pings, get_pings_properties, get_one_ping_per_client
%pylab inline
Populating the interactive namespace from numpy and matplotlib
Let's see how many parallel workers we have at our disposal:
sc.defaultParallelism
16
Let's fetch 10% of Telemetry submissions for a given build-id...
pings = get_pings(sc, app="Firefox", channel="nightly", build_id=("20150401000000", "20150401999999"), fraction=0.1)
... and exctract only the attributes we need from the Telemetry submissions:
subset = get_pings_properties(pings, ["clientID", "info/OS", "simpleMeasurements/firstPaint"])
Let's filter out submissions with an invalid startup time:
subset = subset.filter(lambda p: p.get("firstPaint", -1) >= 0)
To prevent pseudoreplication, let's consider only a single submission for each client. As this step requires a distributed shuffle, it should always be run only after extracting the attributes of interest with get_pings_properties.
subset = get_one_ping_per_client(subset)
Caching is fundamental as it allows for an iterative, real-time development workflow:
cached = subset.cache()
How many pings are we looking at?
cached.count()
9381
Let's group the startup timings by OS:
grouped = cached.map(lambda p: (p["OS"], p["firstPaint"])).groupByKey().collectAsMap()
And finally plot the data:
frame = pd.DataFrame({x: np.log(pd.Series(list(y))) for x, y in grouped.items()})
plt.figure(figsize=(18, 7))
frame.boxplot(return_type="axes")
plt.ylabel("log(firstPaint)")
plt.show()
You can also create interactive plots with plotly:
fig = plt.figure(figsize=(18, 7))
frame["WINNT"].plot(kind="hist", bins=50)
plt.title("startup distribution for Windows")
plt.ylabel("count")
plt.xlabel("log(firstPaint)")
py.iplot_mpl(fig, strip_style=True)