We fit the Poisson NBA scoring model to the 2002-2003 NBA season. See 11.6.2 of Lange for model details.
We split the data into training (dat) and test (dat_test). The training data is used to fit the model and test is used to test performance. The model predicts the score for any pair of teams, so we use as our model evaluation criteria mean absolute error (MAE) $$MAE = \sum_{\text{all games}} |\text{score} - \text{prediction}|$$ Note that for each game we make two score predictions (one for each team).
We compare the Poisson model to two simpler models based on MAE. These simpler models are:
Neither of these methods accounts for the opponents a team is playing.
Note: In order to exactly recover Table 11.1 of Lange, do not separate test data from the training data.
library(data.table)
set.seed(1234)
dat <- as.data.frame(fread("http://longjp.github.io/statcomp/data/nba_cleaned.txt"))
head(dat)
team1_name | team1_score | team2_name | team2_score | location | nOT | game_length |
---|---|---|---|---|---|---|
San Antonio Spurs | 87 | Los Angeles Lakers | 82 | Los Angeles | 0 | 48 |
Orlando Magic | 95 | Philadelphia 76ers | 88 | Orlando | 0 | 48 |
Sacramento Kings | 94 | Cleveland Cavaliers | 67 | Sacramento | 0 | 48 |
Chicago Bulls | 99 | Boston Celtics | 96 | Boston | 0 | 48 |
Detroit Pistons | 86 | New York Knicks | 77 | Detroit | 0 | 48 |
Golden State Warriors | 106 | San Antonio Spurs | 98 | Golden State | 0 | 48 |
team_names <- unique(c(dat$team1_name,dat$team2_name))
n_team <- length(team_names)
## divide data into training and test
trainix <- sample(c(TRUE,FALSE),size=nrow(dat),replace=TRUE,prob=c(0.8,0.2))
dat_test <- dat[!trainix,]
dat <- dat[trainix,]
Before fitting the model, we convert the information in the data frame into matrices. This leads to much faster code than for loops. There is also less code.
##### STEP 1: PUT DATA IN NICE MATRICES
## T = t_ij = total time (across all games)
## team i played team j (symmetric)
## P = p_ij = total points (across all games)
## i scores on j (not symmetric)
T <- matrix(0,nrow=n_team,ncol=n_team,dimnames=list(team_names,team_names))
P <- matrix(0,nrow=n_team,ncol=n_team,dimnames=list(team_names,team_names))
for(ii in 1:nrow(T)){
## find opponent and game length of opponents for particular row
out <- subset(dat,team1_name==rownames(T)[ii],select=c("team2_name","team1_score","team2_score","game_length"))
## calculate total minutes, points scored, points allowed
mins <- tapply(out[,4],out[,1],sum)
ps <- tapply(out[,2],out[,1],sum)
pg <- tapply(out[,3],out[,1],sum)
## record data in matrices
T[ii,names(mins)] <- mins
P[ii,names(ps)] <- P[ii,names(ps)] + ps
P[names(ps),ii] <- P[names(ps),ii] + pg
}
## right now t_ij is time team i played j when team i won
## thus t_ji is time team i played j when team j won
## by adding transpose we get total time i played j
T <- T + t(T)
Prs <- rowSums(P) ## \sum_j p_ij
Pcs <- colSums(P) ## \sum_j p_j
Calculate the number of points scored in an average game for a team. Predict this number of points regardless of who is playing who.
ppg <- mean(c(dat$team1_score,dat$team2_score))
ppg ## our prediction for every team, game
## use ppg to predict every game
pred_errors <- c(abs(dat_test$team1_score - ppg),
abs(dat_test$team2_score - ppg))
mean(pred_errors)
sd(pred_errors) / sqrt(length(pred_errors))
length(pred_errors)
For each team, compute mean points scored per game. Predict that the team will score this amount, regardless of opponent.
dat1 <- dat[,c(1,2)]
colnames(dat1) <- c("n","s")
dat2 <- dat[,c(3,4)]
colnames(dat2) <- c("n","s")
temp <- rbind(dat1,dat2)
temp <- aggregate(temp[,2],
by=list(team=as.factor(temp[,1])),
FUN=mean)
ppg_by_team <- temp[,2]
names(ppg_by_team) <- temp[,1]
dat1_test <- dat_test[,c(1,2)]
colnames(dat1_test) <- c("n","s")
dat2_test <- dat_test[,c(3,4)]
colnames(dat2_test) <- c("n","s")
temp <- rbind(dat1_test,dat2_test)
temp <- cbind(temp,predict=ppg_by_team[temp[,1]])
##compute mean absolute error
mean(abs(temp$s - temp$predict)) ## mean absolute error
sd(abs(temp$s - temp$predict)) / sqrt(nrow(temp))
nrow(temp)
We fit the Poisson Sports Model to the NBA 2002-2003 season. See Section 11.6, Example 11.6.2 of Lange. We reproduce Table 11.1
## od is matrix of parameter fits for o and d
od <- matrix(0,nrow=n_team,ncol=2)
colnames(od) <- c("o","d")
rownames(od) <- colnames(T)
## N iterations
N <- 100
for(jj in 1:N){
od[,2] <- -log(Pcs / colSums(T*exp(od[,1])))
od[,2] <- od[,2] - od[1,2] ## force d_1 = 0 for identifiability
od[,1] <- log(Prs / colSums(t(T)*exp(-od[,2])))
}
## normalize od to 0
od <- od - mean(od)
## what is average game length
mean_length <- mean(dat$game_length)
## team rankings
out <- data.frame(o=od[,1],d=od[,2],rank=rowSums(od))
out[order(out$rank,decreasing=TRUE),]
o | d | rank | |
---|---|---|---|
Dallas Mavericks | 0.4203792 | -0.3369189 | 0.0834603290 |
San Antonio Spurs | 0.3499070 | -0.2796610 | 0.0702460117 |
Sacramento Kings | 0.4028982 | -0.3454079 | 0.0574902447 |
New Jersey Nets | 0.3457148 | -0.2916224 | 0.0540924582 |
Utah Jazz | 0.3360948 | -0.3010176 | 0.0350771474 |
Minnesota Timberwolves | 0.3747282 | -0.3404620 | 0.0342661805 |
Indiana Pacers | 0.3477113 | -0.3195195 | 0.0281918629 |
Portland Trail Blazers | 0.3274647 | -0.3012558 | 0.0262088864 |
Philadelphia 76ers | 0.3526018 | -0.3299275 | 0.0226742582 |
Los Angeles Lakers | 0.3805922 | -0.3608234 | 0.0197688899 |
Phoenix Suns | 0.3413935 | -0.3217348 | 0.0196587562 |
Houston Rockets | 0.3222202 | -0.3040308 | 0.0181893462 |
Detroit Pistons | 0.2875024 | -0.2712430 | 0.0162593223 |
New Orleans Hornets | 0.3089437 | -0.2944405 | 0.0145031812 |
Seattle Supersonics | 0.3145234 | -0.3067996 | 0.0077237814 |
Boston Celtics | 0.3147406 | -0.3152214 | -0.0004807289 |
Golden State Warriors | 0.4153412 | -0.4187090 | -0.0033677598 |
Milwaukee Bucks | 0.3861412 | -0.3928648 | -0.0067235874 |
Orlando Magic | 0.3699500 | -0.3808526 | -0.0109025395 |
Washington Wizards | 0.2922052 | -0.3150529 | -0.0228476446 |
Memphis Grizzlies | 0.3565122 | -0.3828170 | -0.0263047739 |
New York Knicks | 0.3402700 | -0.3677800 | -0.0275099938 |
Atlanta Hawks | 0.3160172 | -0.3540329 | -0.0380156551 |
Los Angeles Clippers | 0.3367175 | -0.3759553 | -0.0392377842 |
Chicago Bulls | 0.3410588 | -0.3812009 | -0.0401420836 |
Toronto Raptors | 0.3056668 | -0.3595021 | -0.0538352445 |
Miami Heat | 0.2273253 | -0.2905265 | -0.0632012499 |
Denver Nuggets | 0.2255051 | -0.3053182 | -0.0798130765 |
Cleveland Cavaliers | 0.3010436 | -0.3964722 | -0.0954285347 |
## MAKE PROBABILISTIC PREDICTIONS: DALLAS versus HOUSTON
min <- 48 ## game length
dallas_rate <- min*exp(od["Dallas Mavericks",1] - od["Houston Rockets",2])
houston_rate <- min*exp(od["Houston Rockets",1] - od["Dallas Mavericks",2])
## random draw of scores
n <- 1000
d_score <- rpois(n,lambda=dallas_rate)
h_score <- rpois(n,lambda=houston_rate)
scores <- data.frame(c(rep("dallas",n),rep("houston",n)),c(d_score,h_score))
colnames(scores) <- c("team","points")
## plot
library(ggplot2)
options(repr.plot.width=6,repr.plot.height=4)
g <- ggplot(scores, aes(points))
g + geom_density(aes(fill=factor(team)), alpha=0.8) +
labs(title="Density plot",
subtitle="Team Scores",
caption="Source: scores",
x="Points Scored",
fill="Team")
## distributions appear normal because normal approximation to poisson
## fraction of time dallas wins in head to head
## could also compute analytically
mean(d_score > h_score)
Is this model any good? It is important to have
For example:
Test: Compare the simple model to the Poisson model using a set of games which has not been used to fit the models (i.e. a test set).
The Signal and the Noise has some nice discussion on this.
head(dat_test)
team1_name | team1_score | team2_name | team2_score | location | nOT | game_length | |
---|---|---|---|---|---|---|---|
5 | Detroit Pistons | 86 | New York Knicks | 77 | Detroit | 0 | 48 |
14 | Philadelphia 76ers | 95 | Milwaukee Bucks | 93 | Philadelphia | 0 | 48 |
16 | Seattle Supersonics | 86 | Phoenix Suns | 73 | Seattle | 0 | 48 |
26 | Minnesota Timberwolves | 111 | Orlando Magic | 105 | Minnesota | 0 | 48 |
28 | Philadelphia 76ers | 98 | New York Knicks | 86 | Philadelphia | 0 | 48 |
29 | Phoenix Suns | 78 | Cleveland Cavaliers | 74 | Phoenix | 0 | 48 |
t1od <- od[dat_test$team1_name,1] - od[dat_test$team2_name,2]
t1_pred <- mean_length*exp(t1od)
t2od <- od[dat_test$team2_name,1] - od[dat_test$team1_name,2]
t2_pred <- mean_length*exp(t2od)
pred_errors <- abs(c(dat_test$team1_score-t1_pred,dat_test$team2_score-t2_pred))
mean(pred_errors)
sd(pred_errors) / sqrt(length(pred_errors))
Model provides marginal improvement over simple prediction strategies as measured by mean absolute error (MAE),
The standard errors are about $0.3$, so the differences are borderline due to chance.