NBA Free throw analysis

Now let's see some of these methods in action on real world data. I'm not a basketball guru by any means, but I thought it would be fun to see whether we can find players that perform differently in free throws when playing at home versus away. Basketballvalue.com has some nice play by play data on season and playoff data between 2007 and 2012, which we will use for this analysis. It's not perfect, for example it only records player's last names, but it will do for the purpose of demonstration.

Getting data:

  • Download and extract play by play data from 2007 - 2012 data from http://basketballvalue.com/downloads.php
  • Concatenate all text files into file called raw.data
  • Run following to extract free throw data into free_throws.csv
    cat raw.data | ack Free Throw | sed -E 's/[0-9]+([A-Z]{3})([A-Z]{3})[[:space:]][0-9]*[[:space:]].?[0-9]{2}:[0-9]{2}:[0-9]{2}[[:space:]]*\[([A-z]{3}).*\][[:space:]](.*)[[:space:]]Free Throw.*(d|\))/\1,\2,\3,\4,\5/ ; s/(.*)d$/\10/ ; s/(.*)\)$/\11/' > free_throws.csv
In [1]:
from __future__ import division

import pandas as pd
import numpy as np
import scipy as sp

import scipy.stats

import toyplot as tp

Data munging

Because only last name is included, we analyze "player-team" combinations to avoid duplicates. This could mean that the same player has multiple rows if he changed teams.

In [2]:
df = pd.read_csv('free_throws.csv', names=["away", "home", "team", "player", "score"])

df["at_home"] = df["home"] == df["team"]

df.head()
Out[2]:
away home team player score at_home
0 CHI MIA CHI Gordon 1 False
1 CHI MIA CHI Gordon 1 False
2 CHI MIA CHI Deng 1 False
3 CHI MIA CHI Deng 0 False
4 CHI MIA MIA Wade 1 True

Overall free throw%

We note that at home the ft% is slightly higher, but there is not much difference

In [3]:
df.groupby("at_home").mean()
Out[3]:
score
at_home
False 0.757928
True 0.759924

Aggregating to player level

We use a pivot table to get statistics on every player.

In [4]:
sdf = pd.pivot_table(df, index=["player", "team"], columns="at_home", values=["score"], 
               aggfunc=[len, sum], fill_value=0).reset_index()
sdf.columns = ['player', 'team', 'atm_away', 'atm_home', 'score_away', 'score_home']

sdf['atm_total'] = sdf['atm_away'] + sdf['atm_home']
sdf['score_total'] = sdf['score_away'] + sdf['score_home']

sdf.sample(10)
Out[4]:
player team atm_away atm_home score_away score_home atm_total score_total
543 Freije ATL 6 0 5 0 6 5
1187 Owens IND 11 23 8 17 34 25
980 M. Williams ATL 322 351 270 276 673 546
38 Allen DAL 4 8 3 8 12 11
222 Brezec DET 3 11 2 5 14 7
397 Davis MIN 143 133 124 107 276 231
1071 Miles MEM 17 12 12 10 29 22
388 Davidson CHA 25 3 17 1 28 18
1544 Udoh GSW 67 61 49 39 128 88
683 Hayward UTA 164 158 132 125 322 257

Individual tests

For each player, we assume each free throw is an independent draw from a Bernoulli distribution with probability $p_{ij}$ of succeeding where $i$ denotes the player and $j=\{a, h\}$ denoting away or home, respectively.

Our null hypotheses are that there is no difference between playing at home and away, versus the alternative that there is a difference. While you could argue a one-sided test for home advantage is also appropriate, I am sticking with a two-sided test.

$$\begin{aligned} H_{0, i}&: p_{i, a} = p_{i, h},\\ H_{1, i}&: p_{i, a} \neq p_{i, h}. \end{aligned}$$

To get test statistics, we conduct a simple two-sample proportions test, where our test statistic is:

$$Z = \frac{\hat p_h - \hat p_a}{\sqrt{\hat p (1-\hat p) (\frac{1}{n_h} + \frac{1}{n_a})}}$$

where

  • $n_h$ and $n_a$ are the number of attempts at home and away, respectively
  • $X_h$ and $X_a$ are the number of free throws made at home and away
  • $\hat p_h = X_h / n_h$ is the MLE for the free throw percentage at home
  • likewise, $\hat p_a = X_a / n_a$ for away ft%
  • $\hat p = \frac{X_h + X_a}{n_h + n_a}$ is the MLE for overall ft%, used for the pooled variance estimator

Then we know from Stats 101 that $Z \sim N(0, 1)$ under the null hypothesis that there is no difference in free throw percentages.

For a normal approximation to hold, we need $np > 5$ and $n(1-p) > 5$, since $p \approx 0.75$, let's be a little conservative and say we need at least 50 samples for a player to get a good normal approximation.

This leads to data on 936 players, and for each one we compute Z, and the corresponding p-value.

In [5]:
data = sdf.query('atm_total > 50').copy()
len(data)
Out[5]:
936
In [6]:
data['p_home'] = data['score_home'] / data['atm_home']
data['p_away'] = data['score_away'] / data['atm_away']
data['p_ovr'] = (data['score_total']) / (data['atm_total'])

# two-sided
data['zval'] = (data['p_home'] - data['p_away']) / np.sqrt(data['p_ovr'] * (1-data['p_ovr']) * (1/data['atm_away'] + 1/data['atm_home']))
data['pval'] = 2*(1-sp.stats.norm.cdf(np.abs(data['zval'])))

# one-sided testing home advantage
# data['zval'] = (data['p_home'] - data['p_away']) / np.sqrt(data['p_ovr'] * (1-data['p_ovr']) * (1/data['atm_away'] + 1/data['atm_home']))
# data['pval'] = (1-sp.stats.norm.cdf(data['zval']))
In [7]:
data.sample(10)
Out[7]:
player team atm_away atm_home score_away score_home atm_total score_total p_home p_away p_ovr zval pval
1640 Wilcox SEA 302 293 199 197 595 396 0.672355 0.658940 0.665546 0.346735 0.728790
1078 Miller HOU 60 34 49 29 94 78 0.852941 0.816667 0.829787 0.449649 0.652964
626 Green SAS 46 54 32 47 100 79 0.870370 0.695652 0.790000 2.137917 0.032524
683 Hayward UTA 164 158 132 125 322 257 0.791139 0.804878 0.798137 -0.307047 0.758808
476 Elson SAS 59 65 47 49 124 96 0.753846 0.796610 0.774194 -0.568797 0.569494
1212 Paul NOK 202 153 166 124 355 290 0.810458 0.821782 0.816901 -0.273215 0.784688
1059 McLeod GSW 31 22 25 22 53 47 1.000000 0.806452 0.886792 2.191266 0.028433
554 Gaines NJN 54 62 27 43 116 70 0.693548 0.500000 0.603448 2.125609 0.033536
921 Krstic BOS 38 45 29 33 83 62 0.733333 0.763158 0.746988 -0.311391 0.755504
1018 Martin SAC 912 916 792 773 1828 1565 0.843886 0.868421 0.856127 -1.494435 0.135062
In [8]:
canvas = tp.Canvas(800, 300)
ax1 = canvas.axes(grid=(1, 2, 0), label="Histogram p-values")
hist_p = ax1.bars(np.histogram(data["pval"], bins=50, normed=True), color="steelblue")
hisp_p_density = ax1.plot([0, 1], [1, 1], color="red")
ax2 = canvas.axes(grid=(1, 2, 1), label="Histogram z-values")
hist_z = ax2.bars(np.histogram(data["zval"], bins=50, normed=True), color="orange")
x = np.linspace(-3, 3, 200)
hisp_z_density = ax2.plot(x, sp.stats.norm.pdf(x), color="red")
0.00.51.00.00.51.01.52.0Histogram p-values-2020.00.10.20.30.4Histogram z-values

Global tests

We can test the global null hypothesis, that is, there is no difference in free throw % between playing at home and away for any player using both Fisher's Combination Test and the Bonferroni method. Which one is preferred in this case? I would expect to see many small difference in effects rather than a few players showing huge effects, so Fisher's Combination Test probably has much better power.

Fisher's combination test

We expect this test to have good power: if there is a difference between playing at home and away we would expect to see a lot of little effects.

In [9]:
T = -2 * np.sum(np.log(data["pval"]))
print 'p-value for Fisher Combination Test: {:.3e}'.format(1 - sp.stats.chi2.cdf(T, 2*len(data)))
p-value for Fisher Combination Test: 6.280e-04

Bonferroni's method

The theory would suggest this test has a lot less power, it's unlikely to have a few players where the difference is relatively huge.

In [10]:
print '"p-value" Bonferroni: {:.3e}'.format(min(1, data["pval"].min() * len(data)))
"p-value" Bonferroni: 1.000e+00

Conclusion

Indeed, we find a small p-value for Fisher's Combination Test, while Bonferroni's method does not reject the null hypothesis. In fact, if we multiply the smallest p-value by the number of hypotheses, we get a number larger than 1, so we aren't even remotely close to any significance.

Multiple tests

So there definitely seems some evidence that there is a difference in performance. If you tell a sport's analyst that there is evidence that at least some players that perform differently away versus at home, their first question will be: "So who is?" Let's see if we can properly answer that question.

Naive method

Let's first test each null hypothesis ignoring the fact that we are dealing with many hypotheses. Please don't do this at home!

In [11]:
alpha = 0.05
data["reject_naive"] = 1*(data["pval"] < alpha)

print 'Number of rejections: {}'.format(data["reject_naive"].sum())
Number of rejections: 65

If we don't correct for multiple comparisons, there are actually 65 "significant" results (at $\alpha = 0.05$), which corresponds to about 7% of the players. We expect around 46 rejections by chance, so it's a bit more than expected, but this is a bogus method so no matter what, we should discard the results.

Bonferroni correction

Let's do it the proper way though, first using Bonferroni correction. Since this method is basically the same as the Bonferroni global test, we expect no rejections:

In [12]:
from statsmodels.sandbox.stats.multicomp import multipletests
In [13]:
data["reject_bc"] = 1*(data["pval"] < alpha / len(data))
print 'Number of rejections: {}'.format(data["reject_bc"].sum())
Number of rejections: 0

Indeed, no rejections.

Benjamini-Hochberg

Let's also try the BHq procedure, which has a bit more power than Bonferonni.

In [14]:
is_reject, corrected_pvals, _, _ = multipletests(data["pval"], alpha=0.1, method='fdr_bh')
In [15]:
data["reject_fdr"] = 1*is_reject
data["pval_fdr"] = corrected_pvals
print 'Number of rejections: {}'.format(data["reject_fdr"].sum())
Number of rejections: 0

Even though the BHq procedure has more power, we can't reject any of the individual hypothesis, hence we don't find sufficient evidence for any of the players that free throw performance is affected by location.

Taking a step back

If we take a step back and take another look at our data, we quickly find that we shouldn't be surprised with our results. In particular, our tests are clearly underpowered. That is, the probability of rejecting the null hypothesis when there is a true effect is very small given the effect sizes that are reasonable.

While there are definitely sophisticated approaches to power analysis, we can use a simple tool to get a rough estimate. The free throw% is around 75% percent, and at that level it takes almost 2500 total attempts to detect a difference in ft% of 5% ($\alpha = 0.05$, power = $0.8$), and 5% is a pretty remarkable difference when only looking at home and away difference. For most players, the observed difference is not even close to 5%, and we have only 11 players in our dataset with more than 2500 free throws.

To have any hope to detect effects for those few players that have plenty of data, the worst thing one can do is throw in a bunch of powerless tests. It would have been much better to restrict our analysis to players where we have a lot of data. Don't worry, I've already done that and again we cannot reject a single hypothesis.

So unfortunately it seems we won't be impressing our friends with cool results, more likely we will be the annoying person pointing out the fancy stats during a game don't really mean anything.

There is one cool take-away though: Fisher's combination test did reject the global null hypothesis even though each single test had almost no power, combined they did yield a significant result. If we aggregate the data across all players first and then conduct a single test of proportions, it turns out we cannot reject that hypothesis.

In [16]:
len(data.query("atm_total > 2500"))
Out[16]:
11
In [17]:
reduced_data = data.query("atm_total > 2500").copy()

is_reject2, corrected_pvals2, _, _ = multipletests(reduced_data["pval"], alpha=0.1, method='fdr_bh')
reduced_data["reject_fdr2"] = 1*is_reject2
reduced_data["pval_fdr2"] = corrected_pvals2


print 'Number of rejections: {}'.format(reduced_data["reject_fdr2"].sum())
Number of rejections: 0