NBA Scoring Model fit with Block Relaxation¶

We fit the Poisson NBA scoring model to the 2002-2003 NBA season. See 11.6.2 of Lange for model details.

We split the data into training (dat) and test (dat_test). The training data is used to fit the model and test is used to test performance. The model predicts the score for any pair of teams, so we use as our model evaluation criteria mean absolute error (MAE) $$MAE = \sum_{\text{all games}} |\text{score} - \text{prediction}|$$ Note that for each game we make two score predictions (one for each team).

We compare the Poisson model to two simpler models based on MAE. These simpler models are:

Compute the average game score on the training data and then use this average to predict all games for all teams.
Compute the average game score by team. Then predict this score every time the team plays.

Neither of these methods accounts for the opponents a team is playing.

Note: In order to exactly recover Table 11.1 of Lange, do not separate test data from the training data.

In [1]:

library(data.table)
set.seed(1234)
dat <- as.data.frame(fread("http://longjp.github.io/statcomp/data/nba_cleaned.txt"))

In [208]:

head(dat)

team1_name	team1_score	team2_name	team2_score	location	game_length
San Antonio Spurs	87	Los Angeles Lakers	82	Los Angeles	48
Orlando Magic	95	Philadelphia 76ers	88	Orlando	48
Sacramento Kings	94	Cleveland Cavaliers	67	Sacramento	48
Chicago Bulls	99	Boston Celtics	96	Boston	48
Detroit Pistons	86	New York Knicks	77	Detroit	48
Golden State Warriors	106	San Antonio Spurs	98	Golden State	48

In [209]:

team_names <- unique(c(dat$team1_name,dat$team2_name))
n_team <- length(team_names)

In [210]:

## divide data into training and test
trainix <- sample(c(TRUE,FALSE),size=nrow(dat),replace=TRUE,prob=c(0.8,0.2))
dat_test <- dat[!trainix,]
dat <- dat[trainix,]

Data Prep¶

Before fitting the model, we convert the information in the data frame into matrices. This leads to much faster code than for loops. There is also less code.

In [211]:

##### STEP 1: PUT DATA IN NICE MATRICES
## T = t_ij = total time (across all games) 
##            team i played team j (symmetric)
## P = p_ij = total points (across all games)
##            i scores on j (not symmetric)
T <- matrix(0,nrow=n_team,ncol=n_team,dimnames=list(team_names,team_names))
P <- matrix(0,nrow=n_team,ncol=n_team,dimnames=list(team_names,team_names))
for(ii in 1:nrow(T)){
    ## find opponent and game length of opponents for particular row
    out <- subset(dat,team1_name==rownames(T)[ii],select=c("team2_name","team1_score","team2_score","game_length"))
    ## calculate total minutes, points scored, points allowed
    mins <- tapply(out[,4],out[,1],sum)
    ps <- tapply(out[,2],out[,1],sum)
    pg <- tapply(out[,3],out[,1],sum)
    ## record data in matrices
    T[ii,names(mins)] <- mins
    P[ii,names(ps)] <- P[ii,names(ps)] + ps
    P[names(ps),ii] <- P[names(ps),ii] + pg
}
## right now t_ij is time team i played j when team i won
##      thus t_ji is time team i played j when team j won
## by adding transpose we get total time i played j
T <- T + t(T) 


Prs <- rowSums(P) ## \sum_j p_ij
Pcs <- colSums(P) ## \sum_j p_j

Simple Model 1¶

Calculate the number of points scored in an average game for a team. Predict this number of points regardless of who is playing who.

In [212]:

ppg <- mean(c(dat$team1_score,dat$team2_score)) 
ppg ## our prediction for every team, game

95.0858798735511

In [213]:

## use ppg to predict every game
pred_errors <- c(abs(dat_test$team1_score - ppg),
                 abs(dat_test$team2_score - ppg))
mean(pred_errors)
sd(pred_errors) / sqrt(length(pred_errors))
length(pred_errors)

10.0136371619248

0.348763555116867

480

Simple Model 2¶

For each team, compute mean points scored per game. Predict that the team will score this amount, regardless of opponent.

In [214]:

dat1 <- dat[,c(1,2)]
colnames(dat1) <- c("n","s")
dat2 <- dat[,c(3,4)]
colnames(dat2) <- c("n","s")
temp <- rbind(dat1,dat2)
temp <- aggregate(temp[,2],
                  by=list(team=as.factor(temp[,1])),
                  FUN=mean)

ppg_by_team <- temp[,2]
names(ppg_by_team) <- temp[,1]

In [225]:

dat1_test <- dat_test[,c(1,2)]
colnames(dat1_test) <- c("n","s")
dat2_test <- dat_test[,c(3,4)]
colnames(dat2_test) <- c("n","s")
temp <- rbind(dat1_test,dat2_test)
temp <- cbind(temp,predict=ppg_by_team[temp[,1]])

##compute mean absolute error
mean(abs(temp$s - temp$predict)) ## mean absolute error
sd(abs(temp$s - temp$predict)) / sqrt(nrow(temp))
nrow(temp)

9.58994359094837

0.318856626154474

480

OD Model¶

We fit the Poisson Sports Model to the NBA 2002-2003 season. See Section 11.6, Example 11.6.2 of Lange. We reproduce Table 11.1

In [226]:

## od is matrix of parameter fits for o and d
od <- matrix(0,nrow=n_team,ncol=2)
colnames(od) <- c("o","d")
rownames(od) <- colnames(T)

## N iterations
N <- 100
for(jj in 1:N){
    od[,2] <- -log(Pcs / colSums(T*exp(od[,1])))
    od[,2] <- od[,2] - od[1,2] ## force d_1 = 0 for identifiability
    od[,1] <- log(Prs / colSums(t(T)*exp(-od[,2])))
}
## normalize od to 0
od <- od - mean(od)

## what is average game length
mean_length <- mean(dat$game_length)

Explore Model Output¶

In [227]:

## team rankings
out <- data.frame(o=od[,1],d=od[,2],rank=rowSums(od))
out[order(out$rank,decreasing=TRUE),]

	o	d	rank
Dallas Mavericks	0.4203792	-0.3369189	0.0834603290
San Antonio Spurs	0.3499070	-0.2796610	0.0702460117
Sacramento Kings	0.4028982	-0.3454079	0.0574902447
New Jersey Nets	0.3457148	-0.2916224	0.0540924582
Utah Jazz	0.3360948	-0.3010176	0.0350771474
Minnesota Timberwolves	0.3747282	-0.3404620	0.0342661805
Indiana Pacers	0.3477113	-0.3195195	0.0281918629
Portland Trail Blazers	0.3274647	-0.3012558	0.0262088864
Philadelphia 76ers	0.3526018	-0.3299275	0.0226742582
Los Angeles Lakers	0.3805922	-0.3608234	0.0197688899
Phoenix Suns	0.3413935	-0.3217348	0.0196587562
Houston Rockets	0.3222202	-0.3040308	0.0181893462
Detroit Pistons	0.2875024	-0.2712430	0.0162593223
New Orleans Hornets	0.3089437	-0.2944405	0.0145031812
Seattle Supersonics	0.3145234	-0.3067996	0.0077237814
Boston Celtics	0.3147406	-0.3152214	-0.0004807289
Golden State Warriors	0.4153412	-0.4187090	-0.0033677598
Milwaukee Bucks	0.3861412	-0.3928648	-0.0067235874
Orlando Magic	0.3699500	-0.3808526	-0.0109025395
Washington Wizards	0.2922052	-0.3150529	-0.0228476446
Memphis Grizzlies	0.3565122	-0.3828170	-0.0263047739
New York Knicks	0.3402700	-0.3677800	-0.0275099938
Atlanta Hawks	0.3160172	-0.3540329	-0.0380156551
Los Angeles Clippers	0.3367175	-0.3759553	-0.0392377842
Chicago Bulls	0.3410588	-0.3812009	-0.0401420836
Toronto Raptors	0.3056668	-0.3595021	-0.0538352445
Miami Heat	0.2273253	-0.2905265	-0.0632012499
Denver Nuggets	0.2255051	-0.3053182	-0.0798130765
Cleveland Cavaliers	0.3010436	-0.3964722	-0.0954285347

In [228]:

## MAKE PROBABILISTIC PREDICTIONS: DALLAS versus HOUSTON
min <- 48 ## game length
dallas_rate <- min*exp(od["Dallas Mavericks",1] - od["Houston Rockets",2])
houston_rate <- min*exp(od["Houston Rockets",1] - od["Dallas Mavericks",2])
## random draw of scores
n <- 1000
d_score <- rpois(n,lambda=dallas_rate)
h_score <- rpois(n,lambda=houston_rate)


scores <- data.frame(c(rep("dallas",n),rep("houston",n)),c(d_score,h_score))
colnames(scores) <- c("team","points")

In [229]:

## plot
library(ggplot2)
options(repr.plot.width=6,repr.plot.height=4)
g <- ggplot(scores, aes(points))
g + geom_density(aes(fill=factor(team)), alpha=0.8) + 
    labs(title="Density plot", 
         subtitle="Team Scores",
         caption="Source: scores",
         x="Points Scored",
         fill="Team")
## distributions appear normal because normal approximation to poisson


## fraction of time dallas wins in head to head
## could also compute analytically
mean(d_score > h_score)

0.673

Model Performance¶

Is this model any good? It is important to have

criteria by which a model can be judged
models (in addition to your own) on which you can make the comparison

For example:

criteria: absolute error in game score prediction
simple model: for the next game, predict team X will score whatever its average score has been for all previous games.

Test: Compare the simple model to the Poisson model using a set of games which has not been used to fit the models (i.e. a test set).

The Signal and the Noise has some nice discussion on this.

Evaluate Poisson model¶

In [230]:

head(dat_test)

	team1_name	team1_score	team2_name	team2_score	location	game_length
5	Detroit Pistons	86	New York Knicks	77	Detroit	48
14	Philadelphia 76ers	95	Milwaukee Bucks	93	Philadelphia	48
16	Seattle Supersonics	86	Phoenix Suns	73	Seattle	48
26	Minnesota Timberwolves	111	Orlando Magic	105	Minnesota	48
28	Philadelphia 76ers	98	New York Knicks	86	Philadelphia	48
29	Phoenix Suns	78	Cleveland Cavaliers	74	Phoenix	48

In [231]:

t1od <- od[dat_test$team1_name,1] - od[dat_test$team2_name,2]
t1_pred <- mean_length*exp(t1od)
t2od <- od[dat_test$team2_name,1] - od[dat_test$team1_name,2]
t2_pred <- mean_length*exp(t2od)

In [232]:

pred_errors <- abs(c(dat_test$team1_score-t1_pred,dat_test$team2_score-t2_pred))

In [233]:

mean(pred_errors)

9.15091397865782

In [234]:

sd(pred_errors) / sqrt(length(pred_errors))

0.295727732591393

Conclusion¶

Model provides marginal improvement over simple prediction strategies as measured by mean absolute error (MAE),

9.2 MAE with Poisson model
9.6 MAE with team specific average score model
10.0 MAE with predicting same mean game score for every game / team

The standard errors are about $0.3$, so the differences are borderline due to chance.

In [ ]: