Awhile back I read a really cool blog post about bias in decisions about the line of scrimmage in American Football. The post reported that NFL officials tend to place the line of scrimmage at tidy yard numbers (i.e., 10s and 5s) more often than at less-tidy numbers (e.g., 4s and 6s). The technical term for this is statistical heaping, and its observed commonly in survey data.
Recently, I have been thinking about whether a similar principle is at work when reviewers assign a score to an album. My guess is that authors usually have a general sense of the album's score, but they need to pick a specific value (Is this a 7.8? 7.9? 7.7?). In this notebook, I'll just quickly establish the presence of statistical heaping in the review data.
rm(list=ls())
library(sqldf)
library(ggplot2)
con = dbConnect(SQLite(), dbname="../pitchfork.db")
scores = dbGetQuery( con, 'SELECT score FROM reviews')
dbDisconnect(con)
print('SCORES SUMMARY')
summary(scores)
print('SCORES STANDARD DEV')
sd(scores$score)
[1] "SCORES SUMMARY"
score Min. : 0.000 1st Qu.: 6.400 Median : 7.200 Mean : 7.006 3rd Qu.: 7.800 Max. :10.000
[1] "SCORES STANDARD DEV"
If there is no statistical heaping in the data, then "round" scores, such as 7.0 and 8.0, should not be more frequently chosen than alternatives (such as 7.1 or 7.9). I'll define two models which characterize what ought to happen, assuming no bias: a Uniform model and a Normal model.
This model proposes that reviewers draw each score at random, uniformly between 0.0 and 10.0:
$$ score = uniform(0.0, 10.0) $$This is not a very plausible model because we know scores are not uniformly distributed. But it's still worth considering for this issue (I think). Under this model, each score is equally probable, and so each decimal place ( *.0, *.1) is also equally likely. The only kink is that *.0 is a little more probable because 10.0 is a possibility but 10.1-10.9 are not.
This model is based on the observation that scores are (roughly) normally distributed, with $\mu=7.006$ and $\sigma=1.294$:
$$ score = normal(\mu=7.006, \sigma=1.294) $$Obviously, under this model, scores around 7.0 are the most probable.
I'm basically going to count the number of reviews with scores ending at each decimal place, and compare that to what we'd expect under both models. Simple? Let's go!
# ----- Get observed counts
# the rounding is due to a weird issue with the mod operator,
# some scores have a mod of like 0.00000002003 or something nutty.
value = round(as.vector(scores$score) %% 1.0, 1)
observed = as.data.frame(table(value))
colnames(observed) <- c('decimal','count')
observed
decimal | count |
---|---|
0 | 2776 |
0.1 | 1392 |
0.2 | 1749 |
0.3 | 1602 |
0.4 | 1723 |
0.5 | 2049 |
0.6 | 1571 |
0.7 | 1662 |
0.8 | 2151 |
0.9 | 1718 |
# ----- Compute expected values assuming uniform choices
uniform = as.data.frame(table(round(seq(0,10,0.1) %% 1.0,1)))
colnames(uniform) = c('decimal','count')
uniform$proportion = uniform$count / sum(uniform$count)
uniform$count = uniform$proportion * sum(observed$count)
uniform
decimal | count | proportion |
---|---|---|
0 | 2003.198 | 0.1089109 |
0.1 | 1821.089 | 0.0990099 |
0.2 | 1821.089 | 0.0990099 |
0.3 | 1821.089 | 0.0990099 |
0.4 | 1821.089 | 0.0990099 |
0.5 | 1821.089 | 0.0990099 |
0.6 | 1821.089 | 0.0990099 |
0.7 | 1821.089 | 0.0990099 |
0.8 | 1821.089 | 0.0990099 |
0.9 | 1821.089 | 0.0990099 |
# ----- compute expected values assuming normal choices
x = seq(0,10,0.1)
m = mean(scores$score)
s = sd(scores$score)
density = dnorm(x, mean = m, sd = s)
p = density / sum(density)
idx = round(x %% 1.0,1)
normal = data.frame(list(score = x, decimal = idx, prob = p))
normal = aggregate(normal$prob, by=list(normal$decimal), FUN=sum)
colnames(normal) = c('decimal','proportion')
normal$count = normal$proportion * sum(observed$count)
normal
decimal | proportion | count |
---|---|---|
0.0 | 0.10065429 | 1851.334 |
0.1 | 0.09893412 | 1819.695 |
0.2 | 0.09928866 | 1826.216 |
0.3 | 0.09958802 | 1831.722 |
0.4 | 0.09983921 | 1836.343 |
0.5 | 0.10004867 | 1840.195 |
0.6 | 0.10022226 | 1843.388 |
0.7 | 0.10036521 | 1846.017 |
0.8 | 0.10048221 | 1848.169 |
0.9 | 0.10057736 | 1849.919 |
# get x ticks !
xticklabs = character()
for (i in 0:9) {
xticklabs = c(xticklabs, paste0('*.',toString(i)))
}
plot = ggplot(NULL, aes(x = decimal, y = count)) +
geom_bar(data = observed, stat = 'identity', colour = '#004D40' , fill = '#004D40') +
geom_point(data = uniform, aes(colour = 'Uniform Sampling'),
stroke = 1.25, size = 4, shape = 23) +
geom_point(data = normal, aes(x = 1:10, colour = 'Normal Sampling '),
stroke = 1.25, size = 6, shape = 23) +
ylab("Number of Reviews") +
scale_y_continuous(breaks = seq(from=0,to=3000,by=500)) +
scale_x_discrete(breaks=observed$decimal, labels=xticklabs) +
theme_bw(base_size = 10) +
theme(
axis.title.x=element_blank(),
panel.grid.major.x = element_blank(),
legend.direction="horizontal",
legend.title = element_blank(),
legend.position = c(0.5, 0.9),
panel.grid.minor.y = element_blank(),
axis.text.x = element_text(size=10),
) +
scale_colour_manual(values=c("#009688", "#FFB300"))
print(plot)
Obviously, *.0 is selected far more often than you'd expect if scoring was unbiased. I think this is pretty strong evidence of statistical heaping!