Handicapping pub trivia

Allen B. Downey

MIT License

In [1]:
# If we're running in Colab, install empiricaldist

import sys
IN_COLAB = 'google.colab' in sys.modules

if IN_COLAB:
    !pip install empiricaldist

Introduction

This notebook is inspired by this question on Reddit's statistics forum

If there is a quiz of x questions with varying results between teams of different sizes, how could you logically handicap the larger teams to bring some sort of equivalence in performance measure?

[Suppose there are] 25 questions and a team of two scores 11/25. A team of 4 scores 17/25. Who did better in terms of average performance of each member?

One respondent suggested a binomial model, in which every player has the same probability of answering any question correctly.

I suggested a model based on item response theory, in which each question has a different level of difficulty, d, each player has a different level of efficacy e, and the probability that a player answers a question is

expit(e-d+c)

where c is a constant offset for all players and questions, and expit is the inverse of the logit function.

Another respondent pointed out that group dynamics will come into play. On a given team, it is not enough if one player knows the answer; they also have to persuade their teammates.

So let's explore these models and see how far we get. Among other things, this will be a good exercise in using NumPy n-dimensional arrays.

The binomial model

I'll start with an array with dimensions for k players, n questions, and m=1000 simulations.

In [2]:
import numpy as np

k = 4      # number of players
n = 25     # number of questions
m = 10000   # number of iterations

a = np.random.random((k, n, m))
a.shape
Out[2]:
(4, 25, 10000)

Now suppose each player has a 30% chance of answer each question. We can compute a Boolean array that indicates which questions each player got.

In [3]:
p = 0.3

b = (a < p)
b.shape
Out[3]:
(4, 25, 10000)

Let's assume for now that a team gets the question right if any of the players gets it.

If you like, you can think of "gets it" as a combination of "knows the answer" and "successfuly convinces teammates".

With this assumption, we can use the logical OR operator to reduce the answers along the player axis.

In [4]:
c = np.logical_or.reduce(b, axis=0)
c.shape
Out[4]:
(25, 10000)

The result is an array with one dimension for questions and one for iterations.

Now we can compute the sum along the questions.

In [5]:
d = np.sum(c, axis=0)
d.shape
Out[5]:
(10000,)

The result is one-dimensional array of iterations that approximates the distribution of scores for a team of k=4 players.

Here's what the distribution looks like.

In [6]:
from empiricaldist import Cdf
In [7]:
import matplotlib.pyplot as plt
In [8]:
cdf = Cdf.from_seq(d)
cdf.plot()

plt.xlabel('Number of correct responses')
plt.ylabel('CDF');

The following function encapsulates the code we have so far, so we can run it with different values of k:

In [9]:
def simulate(p, k, n=25, m=10000):
    a = np.random.random((k, n, m))
    b = (a < p)
    c = np.logical_or.reduce(b, axis=0)
    d = np.sum(c, axis=0)
    return d

Here's what the distribution of correct responses looks like for a range of values of k.

In [10]:
p = 0.3

for k in range(1, 6):
    d = simulate(p, k)
    cdf = Cdf.from_seq(d)
    cdf.plot(label=k)
    print(cdf.median())

plt.xlabel('Number of correct responses')
plt.ylabel('CDF');
plt.legend();
7.0
13.0
16.0
19.0
21.0

We can use these CDFs to compare scores between teams with different sizes.

A team of two that scores 11/25 is in the 31nd percentile.

In [11]:
d = simulate(p=0.3, k=2)
cdf = Cdf.from_seq(d)
cdf(11)
Out[11]:
array(0.309)

A team of 4 that scores 17/25 is in the 22nd percentile.

In [12]:
d = simulate(p=0.3, k=4)
cdf = Cdf.from_seq(d)
cdf(17)
Out[12]:
array(0.2415)

So I would say the team of two out-performed the team of 4.

Item response theory

Now suppose we have players with different levels of efficacy, drawn from a standard normal distribution.

In [13]:
players = np.random.normal(size=(k, 1, m))
players.shape
Out[13]:
(5, 1, 10000)

And questions with different levels of difficulty, also drawn from a standard normal distribution.

In [14]:
questions = np.random.normal(size=(1, n, m))
questions.shape
Out[14]:
(1, 25, 10000)

We can use the expit function to compute the probability that each player answers each question.

In [15]:
from scipy.special import expit

c = -1
p = expit(players - questions + c)
p.shape
Out[15]:
(5, 25, 10000)

I chose c=-1 so that the average probability is about the same as in the binomial model.

In [16]:
np.mean(p)
Out[16]:
0.3237610597780847

We can use the same simulate function with the new model; the difference is that p is an array now, rather than a constant.

In [17]:
d = simulate(p, k)
cdf = Cdf.from_seq(d)
cdf.plot()

Here are the distributions with a range of values for k:

In [18]:
questions = np.random.normal(size=(1, n, m))

for k in range(1, 6):
    players = np.random.normal(size=(k, 1, m))
    p = expit(players - questions + c)
    d = simulate(p, k)
    cdf = Cdf.from_seq(d)
    cdf.plot(label=k)
    print(cdf.median())
    
plt.legend();
8.0
13.0
16.0
18.0
20.0

Again, we can use these CDFs to compare scores between teams with different sizes.

A team of two that scores 11/25 is in the 39th percentile.

In [19]:
k = 2
players = np.random.normal(size=(k, 1, m))
p = expit(players - questions + c)
d = simulate(p, k)
cdf = Cdf.from_seq(d)
cdf(11)
Out[19]:
array(0.3921)

A team of 4 that scores 17/25 is in the 40th percentile.

In [20]:
k = 4
players = np.random.normal(size=(k, 1, m))
p = expit(players - questions + c)
d = simulate(p, k)
cdf = Cdf.from_seq(d)
cdf(17)
Out[20]:
array(0.4097)

So under this model we would say the team of 4 out-performed the team of 2, which is the opposite of our conclusion under the multinomial model!

In [ ]: