A Dockerfile that will produce a container with all the dependencies necessary to run this notebook is available here.
Clear the theano
cache before each run to avoid subtle bugs.
!rm rf /home/jovyan/.theano/
%matplotlib inline
import datetime
from itertools import product
import logging
import pickle
from warnings import filterwarnings
from matplotlib import pyplot as plt
from matplotlib.offsetbox import AnchoredText
from matplotlib.ticker import FuncFormatter, StrMethodFormatter
import numpy as np
import pandas as pd
import scipy as sp
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
from theano import shared, tensor as tt
# keep theano from complaining about compile locks for small models
(logging.getLogger('theano.gof.compilelock')
.setLevel(logging.CRITICAL))
# silence PyMC3 warnings (there aren't many)
filterwarnings(
'ignore', ".*diverging samples after tuning.*",
module='pymc3'
)
filterwarnings(
'ignore', ".*acceptance probability.*",
module='pymc3'
)
pct_formatter = StrMethodFormatter('{x:.1%}')
blue, green, *_ = sns.color_palette()
# configure pyplot for readability when rendered as a slideshow and projected
sns.set(color_codes=True)
plt.rc('figure', figsize=(8, 6))
LABELSIZE = 14
plt.rc('axes', labelsize=LABELSIZE)
plt.rc('axes', titlesize=LABELSIZE)
plt.rc('figure', titlesize=LABELSIZE)
plt.rc('legend', fontsize=LABELSIZE)
plt.rc('xtick', labelsize=LABELSIZE)
plt.rc('ytick', labelsize=LABELSIZE)
SEED = 207183 # from random.org, for reproducibility
Since late in the 20142015 season, the NBA has issued last two minute reports. These reports give the league's assessment of the correctness of fall calls and noncalls in the last two minutes of any game where the score difference was three or fewer points at any point in the last two minutes.
These reports are notably different from playbyplay logs, in that they include information on noncalls for notable oncourt interactions. This noncall information presents a unique opportunity to study the factors that impact foul calls. There is a level of subjectivity inherent in the the NBA's definition of notable oncourt interactions which we attempt to mitigate later using seasonspecific factors.
We download the data locally to be kind to GitHub.
%%bash
DATA_URI=https://raw.githubusercontent.com/polygraphcool/lasttwominutereport/32f1c43dfa06c2e7652cc51ea65758007f2a1a01/output/all_games.csv
DATA_DEST=/tmp/all_games.csv
if [[ ! e $DATA_DEST ]];
then
wget q O $DATA_DEST $DATA_URI
fi
We use only a subset of the columns in the source data set.
USECOLS = [
'period',
'seconds_left',
'call_type',
'committing_player',
'disadvantaged_player',
'review_decision',
'play_id',
'away',
'home',
'date',
'score_away',
'score_home',
'disadvantaged_team',
'committing_team'
]
orig_df = pd.read_csv(
'/tmp/all_games.csv',
usecols=USECOLS,
index_col='play_id',
parse_dates=['date']
)
orig_df.shape
The each row of the DataFrame
represents a play and each column describes an attrbiute of the play:
period
is the period of the game,seconds_left
is the number of seconds remaining in the game,call_type
is the type of callcommitting_player
and disadvantaged_player
are the names of the players involved in the play,review_decision
is the opinion of the league reviewer on whether or not the play was called correctly:review_decision = "INC"
means the call was an incorrect noncall,review_decision = "CNC"
means the call was an correct noncall,review_decision = "IC"
means the call was an incorrect call,review_decision = "CC"
means the call was an correct call,away
and home
are the abbreviations of the teams involved in the game,date
is the date on which the game was played,score_away
and score_home
are the scores of the away
and home
team during the play, respectively,disadvantaged_team
and committing_team
indicate how each team is involved in the play.orig_df.head(n=2).T
First we examine the types of calls present in the data set.
orig_df.call_type.value_counts()
The portion of call_type
before the colon is the general category of the call. We count the occurence of these categories below.
(orig_df.call_type
.str.split(':', expand=True).iloc[:, 0]
.value_counts()
.plot(kind='bar', color=blue, logy=True, title="Call types")
.set_ylabel("Frequency"));
We restrict our attention to foul calls, though other call types would be interesting to study in the future.
foul_df = orig_df[
orig_df.call_type
.fillna("UNKNOWN")
.str.startswith("Foul")
]
We count the foul call types below.
(foul_df.call_type
.str.split(': ', expand=True).iloc[:, 1]
.value_counts()
.plot(kind='bar', color=blue, logy=True, title="Foul Types")
.set_ylabel("Frequency"));
We restrict our attention to the five foul types below, which generally involve two players. This subset of fouls allows us to pursue our second research question in the most direct manner.
FOULS = [
f"Foul: {foul_type}"
for foul_type in [
"Personal",
"Shooting",
"Offensive",
"Loose Ball",
"Away from Play"
]
]
There are a number of misspelled team names in the data, which we correct.
TEAM_MAP = {
"NKY": "NYK",
"COS": "BOS",
"SAT": "SAS",
"CHi": "CHI",
"LA)": "LAC",
"AT)": "ATL",
"ARL": "ATL"
}
def correct_team_name(col):
def _correct_team_name(df):
return df[col].apply(lambda team_name: TEAM_MAP.get(team_name, team_name))
return _correct_team_name
We also convert each game date to NBA season.
def date_to_season(date):
if date >= datetime.datetime(2017, 10, 17):
return '20172018'
elif date >= datetime.datetime(2016, 10, 25):
return '20162017'
elif date >= datetime.datetime(2015, 10, 27):
return '20152016'
else:
return '20142015'
We clean the data by
review_decision
is missing,clean_df = (foul_df.where(lambda df: df.period == "Q4")
.where(lambda df: (df.date.between(datetime.datetime(2016, 10, 25),
datetime.datetime(2017, 4, 12))
 df.date.between(datetime.datetime(2015, 10, 27),
datetime.datetime(2016, 5, 30)))
)
.assign(
review_decision=lambda df: df.review_decision.fillna("INC"),
committing_team=correct_team_name('committing_team'),
disadvantged_team=correct_team_name('disadvantaged_team'),
away=correct_team_name('away'),
home=correct_team_name('home'),
season=lambda df: df.date.apply(date_to_season)
)
.where(lambda df: df.call_type.isin(FOULS))
.dropna()
.drop('period', axis=1)
.assign(call_type=lambda df: (df.call_type
.str.split(': ', expand=True)
.iloc[:, 1])))
About 50% of the rows in the full data set remain.
clean_df.shape[0] / orig_df.shape[0]
clean_df.head(n=2).T
We use scikitlearn
's LabelEncoder
to transform categorical features (call type, player, and season) to integer ids.
call_type_enc = LabelEncoder().fit(
clean_df.call_type
)
n_call_type = call_type_enc.classes_.size
player_enc = LabelEncoder().fit(
np.concatenate((
clean_df.committing_player,
clean_df.disadvantaged_player
))
)
n_player = player_enc.classes_.size
season_enc = LabelEncoder().fit(
clean_df.season
)
n_season = season_enc.classes_.size
We transform the data by
seconds_left
to the nearest second (purely for convenience),foul_called
equal to zero if a foul was not called, or one if it was,score_committing
and score_disadvantaged
to the score of the committing and disadvantaged teams, respectively.df = (clean_df[['seconds_left']]
.round(0)
.assign(
call_type=call_type_enc.transform(clean_df.call_type),
foul_called=1. * clean_df.review_decision.isin(['CC', 'INC']),
player_committing=player_enc.transform(clean_df.committing_player),
player_disadvantaged=player_enc.transform(clean_df.disadvantaged_player),
score_committing=clean_df.score_home.where(
clean_df.committing_team == clean_df.home,
clean_df.score_away
),
score_disadvantaged=clean_df.score_home.where(
clean_df.disadvantaged_team == clean_df.home,
clean_df.score_away
),
season=season_enc.transform(clean_df.season)
))
The resulting DataFrame
is ready for analysis.
df.head(n=2).T
George Box (via Dustin Tran):

def make_foul_rate_yaxis(ax, label="Observed foul call rate"):
ax.yaxis.set_major_formatter(pct_formatter)
ax.set_ylabel(label)
return ax
Below we examine the foul call rate by season.
make_foul_rate_yaxis(
df.pivot_table('foul_called', 'season')
.rename(index=season_enc.inverse_transform)
.rename_axis("Season")
.plot(kind='bar', rot=0, legend=False)
);
There is a pronounced difference between the foul call rate in the 20152016 and 20162017 NBA seasons. This change in foul call rates is due to a rule change between these seasons meant to cut down on hackaShaq fouls.
Our first model accounts for this difference.
We use pymc3
to specify our models. Our first model is given by
$$ \begin{align*} \beta^{\textrm{season}}_s & \sim N(0, 5) \\ \eta^{\textrm{game}}_k & = \beta^{\textrm{season}}_{s(k)} \\ p_k & = \textrm{sigm}\left(\eta^{\textrm{game}}_k\right). \end{align*} $$
We use a logistic regression model with different factors for each season.
import pymc3 as pm
with pm.Model() as base_model:
β_season = pm.Normal('β_season', 0., 5., shape=n_season)
p = pm.Deterministic('p', pm.math.sigmoid(β_season))
$$y_k \sim \textrm{Bernoulli}(p_k)$$
When building models, we will wrap each feature in a Theano shared
variable in order to eventually facilitate posterior predictive sampling.
season = shared(df.season.values)
with base_model:
y = pm.Bernoulli(
'y', p[season],
observed=df.foul_called.values
)
PyMC3 provides an accessible interface to stateofthe art Bayesian inference algorithms. Throughout this talk, we will use PyMC3 to perform Hamiltonian Monte Carlo inference (HMC).
Unfortunately there is not enough time in this talk to do these deep topics justice. For the curious:
NJOBS = 3
SAMPLE_KWARGS = {
'draws': 1000,
'njobs': NJOBS,
'random_seed': [
SEED + i for i in range(NJOBS)
]
}
with base_model:
base_trace = pm.sample(**SAMPLE_KWARGS)
The folk theorem [of statistical computing] is this: When you have computational problems, often there’s a problem with your model.
We rely on three diagnostics to ensure that our samples have converged to the posterior distribution:
For more information on energy plots and BFMI consult Robust Statistical Workflow with PyStan.
bfmi = pm.bfmi(base_trace)
max_gr = max(np.max(gr_stats) for gr_stats in pm.gelman_rubin(base_trace).values())
CONVERGENCE_TITLE = lambda: f"BFMI = {bfmi:.2f}\nGelmanRubin = {max_gr:.3f}"
(pm.energyplot(base_trace, legend=False, figsize=(6, 4))
.set_title(CONVERGENCE_TITLE()));
We use the samples from p
's posterior distribution to calculate residuals, which we use to criticize our models. These residuals allow us to assess how well our model describes the datageneration process and to discover unmodeled sources of variation.
base_trace['p']
base_trace['p'].shape
resid_df = (df.assign(p_hat=base_trace['p'][:, df.season].mean(axis=0))
.assign(resid=lambda df: df.foul_called  df.p_hat))
resid_df[['foul_called', 'p_hat', 'resid']].head()
The perseason residuals are quite small, which is to be expected.
(resid_df.pivot_table('resid', 'season')
.rename(index=season_enc.inverse_transform))
Anyone who has watched a close basketball game will realize that we have neglected an important factor in late game foul calls — intentional fouls. Near the end of the game, intentional fouls are used by the losing team when they are on defense to end the leading team's possession as quickly as possible.
The influence of intentional fouls in the plot below is shown by the rapidly increasing of the residuals as the number of seconds left in the game decreases.
def make_time_axes(ax,
xlabel="Seconds remaining in game",
ylabel="Observed foul call rate"):
ax.invert_xaxis()
ax.set_xlabel(xlabel)
return make_foul_rate_yaxis(ax, label=ylabel)
make_time_axes(
resid_df.pivot_table('resid', 'seconds_left')
.reset_index()
.plot('seconds_left', 'resid', kind='scatter'),
ylabel="Residual"
);
df['trailing_committing'] = (df.score_committing
.lt(df.score_disadvantaged)
.mul(1.)
.astype(np.int64))
The following plot illustrates the fact that only the trailing team has any incentive to committ intentional fouls.
make_time_axes(
df.pivot_table('foul_called', 'seconds_left', 'trailing_committing')
.rolling(20).mean()
.rename(columns={0: "No", 1: "Yes"})
.rename_axis("Committing team is trailing", axis=1)
.plot()
);
Intentional fouls are only useful when the trailing (and committing) team is on defense. The plot below reflects this fact; shooting and personal fouls are almost always called against the defensive player; we see that they are called at a much higher rate than offensive fouls.
ax = (df.pivot_table('foul_called', 'call_type')
.rename(index=call_type_enc.inverse_transform)
.rename_axis("Call type", axis=0)
.plot(kind='barh', legend=False))
ax.xaxis.set_major_formatter(pct_formatter);
ax.set_xlabel("Observed foul call rate");
We continue to model the differnce in foul call rates between seasons.
with pm.Model() as poss_model:
β_season = pm.Normal('β_season', 0., 5., shape=2)
Throughout this talk, we will use hierarchical distributions to model the variation of foul call rates across different categories (in this instance, call types). For much more information on hierarchical models, consult Data Analysis Using Regression and Multilevel/Hierarchical Models.
$$ \begin{align*} \sigma_{\textrm{call}} & \sim \operatorname{HalfNormal}(5) \\ \beta^{\textrm{call}}_{c} & \sim \operatorname{HierarchicalNormal}(0, \sigma_{\textrm{call}}^2) \end{align*} $$
For sampling efficiency, we use an offset parametrization) of the hierarchical normal distribution.
def hierarchical_normal(name, shape, σ_shape=1):
Δ = pm.Normal(
f'Δ_{name}', 0., 1., shape=shape
)
σ = pm.HalfNormal(f'σ_{name}', 5., shape=σ_shape)
return pm.Deterministic(name, Δ * σ)
with poss_model:
β_call = hierarchical_normal('β_call', n_call_type)
We add score difference and the number of possessions by which the committing team is trailing to the DataFrame
. We assume that at most three points can be scored in a single possession (while this is not quite correct, fourpoint plays are rare enough that we do not account for them in our analysis).
df['score_diff'] = (df.score_disadvantaged
.sub(df.score_committing))
df['trailing_poss'] = (df.score_diff
.div(3)
.apply(np.ceil))
trailing_poss_enc = LabelEncoder().fit(df.trailing_poss)
trailing_poss = shared(
trailing_poss_enc.transform(df.trailing_poss)
)
n_trailing_poss = trailing_poss_enc.classes_.size
The plot below shows that the foul call rate (over time) varies based on the score difference (quantized into possessions) between the disadvanted team and the committing team.
make_time_axes(
df.pivot_table('foul_called', 'seconds_left', 'trailing_poss')
.loc[:, 1:3]
.rolling(20).mean()
.rename_axis(
"Trailing possessions\n(committing team)",
axis=1
)
.plot()
);
The plot below reflects the fact that intentional fouls are disproportionately personal fouls; the rate at which personal fouls are called increases drastically as the game nears its end.
make_time_axes(
df.pivot_table('foul_called', 'seconds_left', 'call_type')
.rolling(20).mean()
.rename(columns=call_type_enc.inverse_transform)
.rename_axis(None, axis=1)
.plot()
);
Due to the NBA's shot clock, the natural timescale of a basketball game is possessions, not seconds, remaining.
df['remaining_poss'] = (df.seconds_left
.floordiv(25)
.add(1))
remaining_poss_enc = LabelEncoder().fit(df.remaining_poss)
remaining_poss = shared(
remaining_poss_enc.transform(df.remaining_poss)
)
n_remaining_poss = remaining_poss_enc.classes_.size
Below we plot the foul call rate across trailing possession/remaining posession pairs. Note that we always calculate trailing possessions (trailing_poss
) from the perspective of the committing team. For instance, trailing_poss = 1
indicates that the committing team is trailing by 13 points, whereas trailing_poss = 1
indicates that the committing team is leading by 13 points.
ax = sns.heatmap(
df.pivot_table('foul_called', 'trailing_poss', 'remaining_poss')
.rename_axis(
"Trailing possessions\n(committing team)", axis=0
)
.rename_axis("Remaining possessions", axis=1),
cmap='seismic', cbar_kws={'format': pct_formatter}
)
ax.invert_yaxis();
ax.set_title("Observed foul call rate");
The heatmap above shows that the foul call rate increases significantly when the committing team is trailing by more than the number of possessions remaining in the game. That is, teams resort to intentional fouls only when the opposing team can run out the clock and guarantee a win. (Since we have quantized the score difference and time into posessions, this conclusion is not entirely correct; it is, however, correct enough for our purposes.)
def plot_foul_diff_heatmap(*_, data=None, **kwargs):
ax = plt.gca()
sns.heatmap(
data.pivot_table(
'diff',
'trailing_poss',
'remaining_poss'
),
cmap='seismic', robust=True,
cbar_kws={'format': pct_formatter}
)
ax.invert_yaxis()
ax.set_title("Observed foul call rate")
call_name_df = df.assign(
call_type=lambda df: call_type_enc.inverse_transform(
df.call_type.values
)
)
diff_df = (pd.merge(
call_name_df,
call_name_df.groupby('call_type')
.foul_called.mean()
.rename('avg_foul_called')
.reset_index()
)
.assign(diff=lambda df: df.foul_called  df.avg_foul_called))
The heatmaps below are broken out by call type, and show the difference between the foul call rate for each trailing/remaining possession combination and the overall foul call rate for the call type in question
(sns.FacetGrid(diff_df, col='call_type', col_wrap=3, aspect=1.5)
.map_dataframe(plot_foul_diff_heatmap)
.set_axis_labels(
"Remaining possessions",
"Trailing possessions\n(committing team)"
)
.set_titles("{col_name}"));
These plots confirm that most intentional fouls are personal fouls. They also show that the threeway interaction between trailing possesions, remaining possessions, and call type are important to model foul call rates.
$$ \begin{align*} \sigma_{\textrm{poss}, c} & \sim \operatorname{HalfNormal}(5) \\ \beta^{\textrm{poss}}_{t, r, c} & \sim \operatorname{HierarchicalNormal}(0, \sigma_{\textrm{poss}, c}^2) \end{align*} $$
with poss_model:
β_poss = hierarchical_normal(
'β_poss',
(n_trailing_poss, n_remaining_poss, n_call_type),
σ_shape=(1, 1, n_call_type)
)
$$\eta^{\textrm{game}}_k = \beta^{\textrm{season}}_{s(k)} + \beta^{\textrm{call}}_{c(k)} + \beta^{\textrm{poss}}_{t(k),r(k),c(k)}$$
call_type = shared(df.call_type.values)
with poss_model:
η_game = β_season[season] \
+ β_call[call_type] \
+ β_poss[
trailing_poss, remaining_poss, call_type
]
$$ \begin{align*} p_k & = \operatorname{sigm}\left(\eta^{\textrm{game}}_k\right) \end{align*} $$
with poss_model:
p = pm.Deterministic('p', pm.math.sigmoid(η_game))
y = pm.Bernoulli('y', p, observed=df.foul_called)
with poss_model:
poss_trace = pm.sample(**SAMPLE_KWARGS)
The BFMI and GelmanRubin statistics for this model indicate no problems with HMC sampling and good convergence.
bfmi = pm.bfmi(poss_trace)
max_gr = max(np.max(gr_stats) for gr_stats in pm.gelman_rubin(poss_trace).values())
(pm.energyplot(poss_trace, legend=False, figsize=(6, 4))
.set_title(CONVERGENCE_TITLE()));
resid_df = (df.assign(p_hat=poss_trace['p'].mean(axis=0))
.assign(resid=lambda df: df.foul_called  df.p_hat))
The following plots show that, grouped various ways, the residuals for this model are relatively welldistributed.
ax = sns.heatmap(
resid_df.pivot_table('resid', 'trailing_poss', 'remaining_poss')
.rename_axis("Trailing possessions\n(committing team)", axis=0)
.rename_axis("Remaining possessions", axis=1)
.loc[3:3],
cmap='seismic', cbar_kws={'format': pct_formatter}
)
ax.invert_yaxis();
ax.set_title("Observed foul call rate");
N_BIN = 20
bin_ix, bins = pd.qcut(
resid_df.p_hat, N_BIN,
labels=np.arange(N_BIN),
retbins=True
)
ax = (resid_df.groupby(bins[bin_ix])
.resid.mean()
.rename_axis('p_hat', axis=0)
.reset_index()
.plot('p_hat', 'resid', kind='scatter'))
ax.xaxis.set_major_formatter(pct_formatter);
ax.set_xlabel(r"Binned $\hat{p}$");
make_foul_rate_yaxis(ax, label="Residual");
ax = (resid_df.groupby('seconds_left')
.resid.mean()
.reset_index()
.plot('seconds_left', 'resid', kind='scatter'))
make_time_axes(ax, ylabel="Residual");
Now that we have two models, we can engage in model selection. We use the widely applicable Bayesian information criterion (WAIC) for model selection.
MODEL_NAME_MAP = {
0: "Base",
1: "Possession"
}
comp_df = (pm.compare(
(base_trace, poss_trace),
(base_model, poss_model)
)
.rename(index=MODEL_NAME_MAP)
.loc[MODEL_NAME_MAP.values()])
Since smaller WAICs are better, the possession model clearly outperforms the base model.
comp_df
fig, ax = plt.subplots()
ax.errorbar(
np.arange(len(MODEL_NAME_MAP)), comp_df.WAIC,
yerr=comp_df.SE, fmt='o'
);
ax.set_xticks(np.arange(len(MODEL_NAME_MAP)));
ax.set_xticklabels(comp_df.index);
ax.set_xlabel("Model");
ax.set_ylabel("WAIC");
We now turn to the question of whether or not committing and/or drawing fouls is a measurable skill. We use an itemresponse theory (IRT) model to study this question.
Unfortunately there is not enough time in this talk to do Bayesian itemresponse theory justice. For the curious:
fig, ax = plt.subplots()
ax.set_aspect('equal');
x = y = np.linspace(3, 3, 100)
C = sp.special.expit(
np.subtract.outer(x, y)
)
poly = ax.pcolor(x, y, C, cmap='bwr')
ax.text(
4., 3.5, "Liberal",
fontdict={'size': LABELSIZE}
);
ax.text(
3.1, 3.5, "Conservative",
fontdict={'size': LABELSIZE}
);
ax.text(
4.85, 3.1, "Conservative",
fontdict={'size': LABELSIZE}
);
cbar = fig.colorbar(poly, ax=ax)
cbar.ax.yaxis.set_ticks(np.linspace(0, 1, 5));
cbar.ax.yaxis.set_major_formatter(pct_formatter);
ax.set_ylabel("Case ideal point");
ax.set_xlabel("Justice ideal point");
ax.set_title("Probability justice issues\nconservative opinion on case");
fig
with pm.Model() as irt_model:
β_season = pm.Normal('β_season', 0., 5., shape=n_season)
β_call = hierarchical_normal('β_call', n_call_type)
β_poss = hierarchical_normal(
'β_poss',
(n_trailing_poss, n_remaining_poss, n_call_type),
σ_shape=(1, 1, n_call_type)
)
η_game = β_season[season] \
+ β_call[call_type] \
+ β_poss[trailing_poss, remaining_poss, call_type]
player_committing = shared(df.player_committing.values)
player_disadvantaged = shared(df.player_disadvantaged.values)
n_player = player_enc.classes_.size
$$ \begin{align*} \sigma_{\theta} & \sim \operatorname{HalfNormal}(5) \\ \theta^{\textrm{player}}_{i, s} & \sim \operatorname{HierarchicalNormal}(0, \sigma_{\theta}^2) \end{align*} $$
with irt_model:
θ_player = hierarchical_normal(
'θ_player', (n_player, n_season)
)
θ = θ_player[player_disadvantaged, season]
$$ \begin{align*} \sigma_{b} & \sim \operatorname{HalfNormal}(5) \\ b^{\textrm{player}}_{j, s} & \sim \operatorname{HierarchicalNormal}(0, \sigma_{b}^2) \end{align*} $$
with irt_model:
b_player = hierarchical_normal(
'b_player', (n_player, n_season)
)
b = b_player[player_committing, season]
$$\eta^{\textrm{player}}_k = \theta_k  b_k$$
with irt_model:
η_player = θ  b
$$\eta_k = \eta^{\textrm{game}}_k + \eta^{\textrm{player}}_k$$
with irt_model:
η = η_game + η_player
with irt_model:
p = pm.Deterministic('p', pm.math.sigmoid(η))
y = pm.Bernoulli(
'y', p,
observed=df.foul_called
)
with irt_model:
irt_trace = pm.sample(**SAMPLE_KWARGS)
bfmi = pm.bfmi(irt_trace)
max_gr = max(np.max(gr_stats) for gr_stats in pm.gelman_rubin(irt_trace).values())
(pm.energyplot(irt_trace, legend=False, figsize=(6, 4))
.set_title(CONVERGENCE_TITLE()));
resid_df = (df.assign(p_hat=irt_trace['p'].mean(axis=0))
.assign(resid=lambda df: df.foul_called  df.p_hat))
The binned residuals for this model are more asymmetric than for the previous models, but still not too bad.
N_BIN = 50
bin_ix, bins = pd.qcut(
resid_df.p_hat, N_BIN,
labels=np.arange(N_BIN),
retbins=True
)
ax = (resid_df.groupby(bins[bin_ix])
.resid.mean()
.rename_axis('p_hat', axis=0)
.reset_index()
.plot('p_hat', 'resid', kind='scatter'))
ax.xaxis.set_major_formatter(pct_formatter);
ax.set_xlabel(r"Binned $\hat{p}$");
make_foul_rate_yaxis(ax, label="Residual");
ax = (resid_df.groupby('seconds_left')
.resid.mean()
.reset_index()
.plot('seconds_left', 'resid', kind='scatter'))
make_time_axes(ax, ylabel="Residual");
The IRT model represents a marginal improvement over the possession model in terms of WAIC.
MODEL_NAME_MAP[2] = "IRT"
comp_df = (pm.compare(
(base_trace, poss_trace, irt_trace),
(base_model, poss_model, irt_model)
)
.rename(index=MODEL_NAME_MAP)
.loc[MODEL_NAME_MAP.values()])
comp_df
fig, ax = plt.subplots()
ax.errorbar(
np.arange(len(MODEL_NAME_MAP)), comp_df.WAIC,
yerr=comp_df.SE, fmt='o'
);
ax.set_xticks(np.arange(len(MODEL_NAME_MAP)));
ax.set_xticklabels(comp_df.index);
ax.set_xlabel("Model");
ax.set_ylabel("WAIC");
def varname_to_param(varname):
return varname[0]
def varname_to_player(varname):
return int(varname[3:2])
def varname_to_season(varname):
return int(varname[1])
irt_df = (pm.trace_to_dataframe(
irt_trace, varnames=['θ_player', 'b_player']
)
.rename(columns=lambda col: col.replace('_player', ''))
.T
.apply(
lambda s: pd.Series.describe(
s, percentiles=[0.055, 0.945]
),
axis=1
)
[['mean', '5.5%', '94.5%']]
.rename(columns={
'5.5%': 'low',
'94.5%': 'high'
})
.rename_axis('varname')
.reset_index()
.assign(
param=lambda df: df.varname.apply(varname_to_param),
player=lambda df: df.varname.apply(varname_to_player),
season=lambda df: df.varname.apply(varname_to_season)
)
.drop('varname', axis=1))
irt_df.head()
player_irt_df = irt_df.pivot_table(
index='player',
columns=['param', 'season'],
values='mean'
)
player_irt_df.head()
The following plot shows that the committing skill appears to be somewhat larger than the disadvantaged skill. This difference seems reasonable because most fouls are committed by the player on defense; committing skill is quite likely to to be correlated with defensive ability.
def plot_latent_params(df):
fig, ax = plt.subplots()
n, _ = df.shape
y = np.arange(n)
ax.errorbar(
df['mean'], y,
xerr=(df[['high', 'low']]
.sub(df['mean'], axis=0)
.abs()
.values.T),
fmt='o'
)
ax.set_yticks(y)
ax.set_yticklabels(
player_enc.inverse_transform(df.player)
)
ax.set_ylabel("Player")
return fig, ax
fig, axes = plt.subplots(
ncols=2, nrows=2, sharex=True,
figsize=(16, 8)
)
(θ0_ax, θ1_ax), (b0_ax, b1_ax) = axes
bins = np.linspace(
0.9 * irt_df['mean'].min(),
1.1 * irt_df['mean'].max(),
75
)
θ0_ax.hist(
player_irt_df['θ', 0],
bins=bins, normed=True
);
θ1_ax.hist(
player_irt_df['θ', 1],
bins=bins, normed=True
);
θ0_ax.set_yticks([]);
θ0_ax.set_title(
r"$\hat{\theta}$ (" + season_enc.inverse_transform(0) + ")"
);
θ1_ax.set_yticks([]);
θ1_ax.set_title(
r"$\hat{\theta}$ (" + season_enc.inverse_transform(1) + ")"
);
b0_ax.hist(
player_irt_df['b', 0],
bins=bins, normed=True, color=green
);
b1_ax.hist(
player_irt_df['b', 1],
bins=bins, normed=True, color=green
);
b0_ax.set_xlabel(
r"$\hat{b}$ (" + season_enc.inverse_transform(0) + ")"
);
b0_ax.invert_yaxis();
b0_ax.xaxis.tick_top();
b0_ax.set_yticks([]);
b1_ax.set_xlabel(
r"$\hat{b}$ (" + season_enc.inverse_transform(1) + ")"
);
b1_ax.invert_yaxis();
b1_ax.xaxis.tick_top();
b1_ax.set_yticks([]);
fig.suptitle("Disadvantaged skill", size=18);
fig.text(0.45, 0.02, "Committing skill", size=18)
fig.tight_layout();
The latent ability parameters tend to lie in the interval $[0.2, 0.2]$, so these skills are small, if they exist.
fig
top_bot_irt_df = (irt_df.groupby('param')
.apply(
lambda df: pd.concat((
df.nlargest(10, 'mean'),
df.nsmallest(10, 'mean')
),
axis=0, ignore_index=True
)
)
.reset_index(drop=True))
top_bot_irt_df.head()
We now examine the top and bottom ten players in each ability, across both seasons.
The top players in terms of disadvantaged ability tend to be good scorers (Jimmy Butler, Ricky Rubio, John Wall, Andre Iguodala). The presence of DeAndre Jordan in the top ten seems to be due to the hackaShaq phenomenon. Future work, it would be interesting to control for the disavantage player's free throw percentage in order to mitigate the influence of the hackaShaq effect on the measurement of latent skill.
Interestingly, the bottom players (in terms of disadvantaged ability) include many stars (Pau Gasol, Carmelo Anthony, Kevin Durant, Kawhi Leonard). The presence of these stars in the bottom may somewhat counteract the pervasive narrative that referees favor stars in their foul calls.
fig, ax = plot_latent_params(
top_bot_irt_df[top_bot_irt_df.param == 'θ']
.sort_values('mean')
)
ax.set_xlabel(r"$\hat{\theta}$");
ax.set_title("Top and bottom ten");