# Probabilistic Programming for Sports Analytics¶

## @AustinRochford¶

### [email protected] • [email protected]¶

• Founded 2008, web optimization and personalization SaaS
• Observed 5B impressions and \$4.1B in revenue during Cyber Week 2017

## Agenda¶

• Sports analytics
• Heuristics and uncertainty
• First steps
• Batting average
• Save percentage
• NBA foul calls
• Future work

# Sports Analytics¶

## Heuristics and Uncertainty¶

The basic blueprint for predicting a player’s upcoming season from his past is the Marcel method, introduced by prominent baseball and hockey analyst Tom Tango back in 2005. It’s a three-step process that

1. starts with a weighted average of a player’s three most recent seasons;
2. regresses that player’s performance toward the league mean, based on games played (we’ll explain why and how in a moment); and
3. applies an age adjustment to account for developing rookies and declining veterans. With regard to the first step, Tango proposes a weighting of 5-4-3, while I prefer a 4-2-1 approach that assumes that every season’s data has twice the predictive power as the season previous.

Vollman, Rob, Awad, Tom, Fyffe, Iain. Hockey Abstract Presents Stat Shot. ECW Press. Kindle Edition.

# First Steps¶

## Batting Average¶

In [10]:
mlb_df = load_data(get_data_url('2018_batting.csv'), 'Name', ['AB', 'H'])

In [11]:
mlb_df.head()

Out[11]:
name ab h
player_id
abreujo02 Jose Abreu 499 132
acunaro01 Ronald Acuna 433 127
In [13]:
def hierarchical_normal(name, shape, μ=None):
if μ is None:
μ = pm.Normal(f"μ_{name}", 0., 5.)

Δ = pm.Normal(f"Δ_{name}", shape=shape)
σ = pm.HalfNormal(f"σ_{name}", 2.5)

return pm.Deterministic(name, μ + Δ * σ)

In [14]:
with pm.Model() as mlb_model:
η = hierarchical_normal("η", n_player)
ba = pm.Deterministic("ba", pm.math.sigmoid(η))

hits = pm.Binomial("hits", batter_df['ab'], ba, observed=batter_df['h'])

In [16]:
with mlb_model:
mlb_trace = pm.sample(**SAMPLE_KWARGS)

Auto-assigning NUTS sampler...
Multiprocess sampling (3 chains in 3 jobs)
NUTS: [σ_η, Δ_η, μ_η]
Sampling 3 chains: 100%|██████████| 6000/6000 [00:49<00:00, 120.10draws/s]

In [17]:
az.plot_energy(mlb_trace);

In [18]:
(az.rhat(mlb_trace).max().to_array().max())

Out[18]:
<xarray.DataArray ()>
array(1.01)

### Mike Trout¶

In [20]:
fig

Out[20]:

### Mike Trout vs. Bryce Harper¶

In [22]:
fig

Out[22]:
In [24]:
fig

Out[24]:
In [25]:
(mlb_trace['ba'][:, harper_ix] < mlb_trace['ba'][:, trout_ix]).mean()

Out[25]:
0.96666666666666667

### Bryce Harper vs. Rhys Hoskins¶

In [27]:
fig

Out[27]:
In [28]:
(mlb_trace['ba'][:, harper_ix] < mlb_trace['ba'][:, hoskins_ix]).mean()

Out[28]:
0.45700000000000002

### Uncertainty in batting average¶

In [32]:
ax.figure

Out[32]:

## Save Percentage¶

In [33]:
nhl_df = load_data(get_data_url('2017_2018_goalies.csv'), 'Player', ['SA', 'SV'])

In [34]:
nhl_df.head()

Out[34]:
name sa sv
player_id
allenja01 Jake Allen 1614 1462
andercr01 Craig Anderson 1768 1588
anderfr01 Frederik Andersen 2211 2029
appleke01 Ken Appleby 55 52
bernijo01 Jonathan Bernier 1092 997
In [36]:
with pm.Model() as nhl_model:
η = hierarchical_normal("η", n_goalie)
svp = pm.Deterministic("svp", pm.math.sigmoid(η))

saves = pm.Binomial("saves", nhl_df['sa'], svp, observed=nhl_df['sv'])

In [37]:
with nhl_model:
nhl_trace = pm.sample(nuts_kwargs={'target_accept': 0.9}, **SAMPLE_KWARGS)

Auto-assigning NUTS sampler...
Multiprocess sampling (3 chains in 3 jobs)
NUTS: [σ_η, Δ_η, μ_η]
Sampling 3 chains: 100%|██████████| 6000/6000 [00:21<00:00, 276.80draws/s]


### Sergei Bobrovsky¶

In [41]:
fig

Out[41]:

### Uncertainty in save percentage¶

In [44]:
ax.figure

Out[44]:
In [ ]: