Now let's see some of these methods in action on real world data. I'm not a basketball guru by any means, but I thought it would be fun to see whether we can find players that perform differently in free throws when playing at home versus away. Basketballvalue.com has some nice play by play data on season and playoff data between 2007 and 2012, which we will use for this analysis. It's not perfect, for example it only records player's last names, but it will do for the purpose of demonstration.

- Download and extract play by play data from 2007 - 2012 data from http://basketballvalue.com/downloads.php
- Concatenate all text files into file called
`raw.data`

- Run following to extract free throw data into
`free_throws.csv`

`cat raw.data | ack Free Throw | sed -E 's/[0-9]+([A-Z]{3})([A-Z]{3})[[:space:]][0-9]*[[:space:]].?[0-9]{2}:[0-9]{2}:[0-9]{2}[[:space:]]*\[([A-z]{3}).*\][[:space:]](.*)[[:space:]]Free Throw.*(d|\))/\1,\2,\3,\4,\5/ ; s/(.*)d$/\10/ ; s/(.*)\)$/\11/' > free_throws.csv`

In [1]:

```
from __future__ import division
import pandas as pd
import numpy as np
import scipy as sp
import scipy.stats
import toyplot as tp
```

Because only last name is included, we analyze "player-team" combinations to avoid duplicates. This could mean that the same player has multiple rows if he changed teams.

In [2]:

```
df = pd.read_csv('free_throws.csv', names=["away", "home", "team", "player", "score"])
df["at_home"] = df["home"] == df["team"]
df.head()
```

Out[2]:

We note that at home the ft% is slightly higher, but there is not much difference

In [3]:

```
df.groupby("at_home").mean()
```

Out[3]:

We use a pivot table to get statistics on every player.

In [4]:

```
sdf = pd.pivot_table(df, index=["player", "team"], columns="at_home", values=["score"],
aggfunc=[len, sum], fill_value=0).reset_index()
sdf.columns = ['player', 'team', 'atm_away', 'atm_home', 'score_away', 'score_home']
sdf['atm_total'] = sdf['atm_away'] + sdf['atm_home']
sdf['score_total'] = sdf['score_away'] + sdf['score_home']
sdf.sample(10)
```

Out[4]:

For each player, we assume each free throw is an independent draw from a Bernoulli distribution with probability $p_{ij}$ of succeeding where $i$ denotes the player and $j=\{a, h\}$ denoting away or home, respectively.

Our null hypotheses are that there is no difference between playing at home and away, versus the alternative that there is a difference. While you could argue a one-sided test for home advantage is also appropriate, I am sticking with a two-sided test.

$$\begin{aligned} H_{0, i}&: p_{i, a} = p_{i, h},\\ H_{1, i}&: p_{i, a} \neq p_{i, h}. \end{aligned}$$To get test statistics, we conduct a simple two-sample proportions test, where our test statistic is:

$$Z = \frac{\hat p_h - \hat p_a}{\sqrt{\hat p (1-\hat p) (\frac{1}{n_h} + \frac{1}{n_a})}}$$where

- $n_h$ and $n_a$ are the number of attempts at home and away, respectively
- $X_h$ and $X_a$ are the number of free throws made at home and away
- $\hat p_h = X_h / n_h$ is the MLE for the free throw percentage at home
- likewise, $\hat p_a = X_a / n_a$ for away ft%
- $\hat p = \frac{X_h + X_a}{n_h + n_a}$ is the MLE for overall ft%, used for the pooled variance estimator

Then we know from Stats 101 that $Z \sim N(0, 1)$ under the null hypothesis that there is no difference in free throw percentages.

For a normal approximation to hold, we need $np > 5$ and $n(1-p) > 5$, since $p \approx 0.75$, let's be a little conservative and say we need at least 50 samples for a player to get a good normal approximation.

This leads to data on 936 players, and for each one we compute Z, and the corresponding p-value.

In [5]:

```
data = sdf.query('atm_total > 50').copy()
len(data)
```

Out[5]:

In [6]:

```
data['p_home'] = data['score_home'] / data['atm_home']
data['p_away'] = data['score_away'] / data['atm_away']
data['p_ovr'] = (data['score_total']) / (data['atm_total'])
# two-sided
data['zval'] = (data['p_home'] - data['p_away']) / np.sqrt(data['p_ovr'] * (1-data['p_ovr']) * (1/data['atm_away'] + 1/data['atm_home']))
data['pval'] = 2*(1-sp.stats.norm.cdf(np.abs(data['zval'])))
# one-sided testing home advantage
# data['zval'] = (data['p_home'] - data['p_away']) / np.sqrt(data['p_ovr'] * (1-data['p_ovr']) * (1/data['atm_away'] + 1/data['atm_home']))
# data['pval'] = (1-sp.stats.norm.cdf(data['zval']))
```

In [7]:

```
data.sample(10)
```

Out[7]:

In [8]:

```
canvas = tp.Canvas(800, 300)
ax1 = canvas.axes(grid=(1, 2, 0), label="Histogram p-values")
hist_p = ax1.bars(np.histogram(data["pval"], bins=50, normed=True), color="steelblue")
hisp_p_density = ax1.plot([0, 1], [1, 1], color="red")
ax2 = canvas.axes(grid=(1, 2, 1), label="Histogram z-values")
hist_z = ax2.bars(np.histogram(data["zval"], bins=50, normed=True), color="orange")
x = np.linspace(-3, 3, 200)
hisp_z_density = ax2.plot(x, sp.stats.norm.pdf(x), color="red")
```

We can test the global null hypothesis, that is, there is no difference in free throw % between playing at home and away for any player using both Fisher's Combination Test and the Bonferroni method. Which one is preferred in this case? I would expect to see many small difference in effects rather than a few players showing huge effects, so Fisher's Combination Test probably has much better power.

We expect this test to have good power: if there is a difference between playing at home and away we would expect to see a lot of little effects.

In [9]:

```
T = -2 * np.sum(np.log(data["pval"]))
print 'p-value for Fisher Combination Test: {:.3e}'.format(1 - sp.stats.chi2.cdf(T, 2*len(data)))
```

The theory would suggest this test has a lot less power, it's unlikely to have a few players where the difference is relatively huge.

In [10]:

```
print '"p-value" Bonferroni: {:.3e}'.format(min(1, data["pval"].min() * len(data)))
```

Indeed, we find a small p-value for Fisher's Combination Test, while Bonferroni's method does not reject the null hypothesis. In fact, if we multiply the smallest p-value by the number of hypotheses, we get a number larger than 1, so we aren't even remotely close to any significance.

So there definitely seems some evidence that there is a difference in performance. If you tell a sport's analyst that there is evidence that at least some players that perform differently away versus at home, their first question will be: "So who is?" Let's see if we can properly answer that question.

Let's first test each null hypothesis ignoring the fact that we are dealing with many hypotheses. Please don't do this at home!

In [11]:

```
alpha = 0.05
data["reject_naive"] = 1*(data["pval"] < alpha)
print 'Number of rejections: {}'.format(data["reject_naive"].sum())
```

If we don't correct for multiple comparisons, there are actually 65 "significant" results (at $\alpha = 0.05$), which corresponds to about 7% of the players. We expect around 46 rejections by chance, so it's a bit more than expected, but this is a bogus method so no matter what, we should discard the results.

Let's do it the proper way though, first using Bonferroni correction. Since this method is basically the same as the Bonferroni global test, we expect no rejections:

In [12]:

```
from statsmodels.sandbox.stats.multicomp import multipletests
```

In [13]:

```
data["reject_bc"] = 1*(data["pval"] < alpha / len(data))
print 'Number of rejections: {}'.format(data["reject_bc"].sum())
```

Indeed, no rejections.

Let's also try the BHq procedure, which has a bit more power than Bonferonni.

In [14]:

```
is_reject, corrected_pvals, _, _ = multipletests(data["pval"], alpha=0.1, method='fdr_bh')
```

In [15]:

```
data["reject_fdr"] = 1*is_reject
data["pval_fdr"] = corrected_pvals
print 'Number of rejections: {}'.format(data["reject_fdr"].sum())
```

Even though the BHq procedure has more power, we can't reject any of the individual hypothesis, hence we don't find sufficient evidence for any of the players that free throw performance is affected by location.

If we take a step back and take another look at our data, we quickly find that we shouldn't be surprised with our results. In particular, our tests are clearly underpowered. That is, the probability of rejecting the null hypothesis when there is a true effect is very small given the effect sizes that are reasonable.

While there are definitely sophisticated approaches to power analysis, we can use a simple tool to get a rough estimate. The free throw% is around 75% percent, and at that level it takes almost 2500 total attempts to detect a difference in ft% of 5% ($\alpha = 0.05$, power = $0.8$), and 5% is a pretty remarkable difference when only looking at home and away difference. For most players, the observed difference is not even close to 5%, and we have only 11 players in our dataset with more than 2500 free throws.

To have any hope to detect effects for those few players that have plenty of data, the worst thing one can do is throw in a bunch of powerless tests. It would have been much better to restrict our analysis to players where we have a lot of data. Don't worry, I've already done that and again we cannot reject a single hypothesis.

So unfortunately it seems we won't be impressing our friends with cool results, more likely we will be the annoying person pointing out the fancy stats during a game don't really mean anything.

There is one cool take-away though: Fisher's combination test did reject the global null hypothesis even though each single test had almost no power, combined they did yield a significant result. If we aggregate the data across all players first and then conduct a single test of proportions, it turns out we cannot reject that hypothesis.

In [16]:

```
len(data.query("atm_total > 2500"))
```

Out[16]:

In [17]:

```
reduced_data = data.query("atm_total > 2500").copy()
is_reject2, corrected_pvals2, _, _ = multipletests(reduced_data["pval"], alpha=0.1, method='fdr_bh')
reduced_data["reject_fdr2"] = 1*is_reject2
reduced_data["pval_fdr2"] = corrected_pvals2
print 'Number of rejections: {}'.format(reduced_data["reject_fdr2"].sum())
```