In [1]:

from IPython.display import HTML

HTML('''<script>
code_show=true; 
function code_toggle() {
if (code_show){
$('div.input').hide();
} else {
$('div.input').show();
}
code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the raw code."></form>''')

Out[1]:

In [2]:

import numpy as np
import pandas as pd

In [3]:

import scipy.interpolate
import scipy.stats 

In [4]:

import bokeh.plotting
import bokeh.models  # LinearAxis 
from bokeh.palettes import Category10_10 as palette

In [5]:

bokeh.plotting.output_notebook()

Loading BokehJS ...

Question:¶

Given two normal distributions of different mean and standard deviation, consider a measurement with a value between the two means, what is the probability for it belonging to each of the two distributions?

As an example, consider two groups of runners. The first group of runners trains regularly and is in good shape. The second group of runners does not train but still enjoy running occaisionally. You measure how fast everyone can run the mile then calculate the mean and standard deviation for each group. Then you find you have the time for one person but forgot to label what group they came from. Based on their time for running the mile, what is the probability they are in the group that trains regularly?

To put some numbers to this example, lets say the group that trains regularly have a mean time of 6.5 minutes with a standard deviation of 0.75 minutes and that the occasional runners have a mean time of 10 minutes with a standard devaition of 2 minutes. Additionally, lets say there are 20 people in the faster group and 80 people in the slower group.

Mean times:
$ m_1 = 6.5\,\text{min}$
$ m_2 = 12.0\,\text{min}$

Standard deviations:
$ s_1 = 0.75\,\text{min}$
$ s_2 = 3.0\,\text{min}$

Group size:
$ n_1 = 20\,\text{people}$
$ n_2 = 80\,\text{people}$

Caveat:¶

Note that often a Gaussian distribution is a simplified view of a more complicated genuine distribution. For this situation, it is much more likely that both groups of runners would have an asymmetric distribution. We could safely expect both groups of runners, but especially the group that does not train regularly, would exhibit a longer tail for slower times. In the following analysis, we proceed under the optimized assumption that each group can be described by a single Gaussian distribution.

In [6]:

m1 = 6.5
m2 = 12.0

s1 = .75
s2 = 3.0

n1 = 20
n2 = 80
n = n1 + n2

In [7]:

x_min = 2.0
x_max = 25.0
x = np.linspace(x_min, x_max, 200)
y1 = scipy.stats.norm.pdf(x, loc=m1, scale=s1) * n1 / n

y2 = scipy.stats.norm.pdf(x, loc=m2, scale=s2) * n2 / n

y = y1 + y2

p = bokeh.plotting.figure(width=400, height=300, 
                          x_axis_label='Time (m)',
                          y_axis_label='% of People')
p.line(x, y, legend='combined', color=palette[0], line_width=2)
p.line(x, y1, legend='fast group', color=palette[1])
p.line(x, y2, legend='slow group', color=palette[2])

p.legend.location = 'top_right'

bokeh.plotting.show(p)

If we calculate the area under the curves, we should get the percent of people in each group.

In [8]:

dx = x[1] - x[0]

In [9]:

a = dx * y.sum()
print(a)

0.999674635352

In [10]:

a1 = dx * y1.sum()
print(a1)

0.199999999882

In [11]:

a2 = dx * y2.sum()
print(a2)

0.79967463547

Now lets consider the runner who was not labelled.

My initial approach was to compare his time, $t$ to the means and standard distributions, calculating the difference between the means as $$ \begin{align} d_1 &= | t - m_1|\\ d_2 &= | t - m_2| \end{align} $$ then calculate the probability of being in the fast group as $$ p_\text{fast} =\left\{ \begin{array}{cl} 1 &\mbox{ if $t < m_1$} \\ 1 - \dfrac{s_1 d_1}{s_1 d_1 + s_2 d_2} &\mbox{ if $m_1 <= t < m_2$}\\ 0 &\mbox{ if $m_2 <= t$} \end{array} \right. $$

This probability will have the following distribution.

In [12]:

p1_v1 = np.zeros_like(x)
p1_v1[x < m1] = 1

d1 = np.abs(x - m1)
d2 = np.abs(x - m2)

mask = np.logical_and(m1 < x, x < m2)
p1_v1[mask] = 1 - s1 * d1[mask] / (s1 * d1[mask] + s2 * d2[mask])

In [13]:

p = bokeh.plotting.figure(width=400, height=300, toolbar_location='above',
                          x_axis_label='Time (m)',
                          y_axis_label='% of People')
p.extra_y_ranges = {'prob': bokeh.models.Range1d(start=0, end=1.02)}
p.add_layout(bokeh.models.LinearAxis(y_range_name='prob', 
                                     axis_label='Probability'), 'right')

p.y_range = bokeh.models.Range1d(start=0, end=0.15)
p.line(x, y, legend='combined', color=palette[0], line_width=2)
p.line(x, y1, legend='fast group', color=palette[1])
p.line(x, y2, legend='slow group', color=palette[2])
p.line(x, p1_v1, legend='p fast', color=palette[3], y_range_name='prob')
p.line(x, 1-p1_v1, legend='p slow', color=palette[4], y_range_name='prob')

p.legend.location = 'top_right'

bokeh.plotting.show(p)

This is clearly wrong as the crossover from being more likely to be in fast group is much too close to the slow group. Lets see what happens when we switch the standard deviations to use $$ p_\text{fast} =\left\{ \begin{array}{cl} 1 &\mbox{ if $t < m_1$} \\ 1 - \dfrac{s_2 d_1}{s_2 d_1 + s_1 d_2} &\mbox{ if $m_1 <= t < m_2$}\\ 0 &\mbox{ if $m_2 <= t$} \end{array} \right. $$

With this change, the probability will have the following distribution.

In [14]:

p1_v1[mask] = 1 - s2 * d1[mask] / (s2 * d1[mask] + s1 * d2[mask])
p2_v1 = 1 - p1_v1

In [15]:

p = bokeh.plotting.figure(width=400, height=300, toolbar_location='above',
                          x_axis_label='Time (m)',
                          y_axis_label='% of People')
p.extra_y_ranges = {'prob': bokeh.models.Range1d(start=0, end=1.02)}
p.add_layout(bokeh.models.LinearAxis(y_range_name='prob', 
                                     axis_label='Probability'), 'right')

p.y_range = bokeh.models.Range1d(start=0, end=0.15)
p.line(x, y, legend='combined', color=palette[0], line_width=2)
p.line(x, y1, legend='fast group', color=palette[1])
p.line(x, y2, legend='slow group', color=palette[2])
p.line(x, p1_v1, legend='p fast', color=palette[3], y_range_name='prob')
p.line(x, p2_v1, legend='p slow', color=palette[4], y_range_name='prob')

p.legend.location = 'top_right'

bokeh.plotting.show(p)

This seems more reasonable but I did not take into account the relative difference in the size of the groups. Lets add a scale factor related to this difference.

$$ p_\text{fast} =\left\{ \begin{array}{cl} 1 &\mbox{ if $t < m_1$} \\ 1 - \dfrac{s_2 d_1 \frac{n_1}{n}}{s_2 d_1 \frac{n_1}{n} + s_1 d_2 \frac{n_2}{n}} &\mbox{ if $m_1 <= t < m_2$}\\ 0 &\mbox{ if $m_2 <= t$} \end{array} \right. $$

In [16]:

p1_v1[mask] = 1 - s2 * d1[mask] * n1 / n / (s2 * d1[mask] * n1 / n + s1 * d2[mask] * n2 / n)
p2_v1 = 1 - p1_v1

In [17]:

p = bokeh.plotting.figure(width=400, height=300, toolbar_location='above',
                          x_axis_label='Time (m)',
                          y_axis_label='% of People')
p.extra_y_ranges = {'prob': bokeh.models.Range1d(start=0, end=1.02)}
p.add_layout(bokeh.models.LinearAxis(y_range_name='prob', 
                                     axis_label='Probability'), 'right')

p.y_range = bokeh.models.Range1d(start=0, end=0.15)
p.line(x, y, legend='combined', color=palette[0], line_width=2)
p.line(x, y1, legend='fast group', color=palette[1])
p.line(x, y2, legend='slow group', color=palette[2])
p.line(x, p1_v1, legend='p fast', color=palette[3], y_range_name='prob')
p.line(x, p2_v1, legend='p slow', color=palette[4], y_range_name='prob')

p.legend.location = 'top_right'

bokeh.plotting.show(p)

This looks like a linear switch between the two means. How does this change if we change one of the mean values? Let make $m_2=15.0$ minutes.

In [18]:

m2 = 15.0

y1 = scipy.stats.norm.pdf(x, loc=m1, scale=s1) * n1 / n
y2 = scipy.stats.norm.pdf(x, loc=m2, scale=s2) * n2 / n
y = y1 + y2

p1_v1 = np.zeros_like(x)
p1_v1[x < m1] = 1

d1 = np.abs(x - m1)
d2 = np.abs(x - m2)

mask = np.logical_and(m1 < x, x < m2)
p1_v1[mask] = 1 - s1 * d1[mask] / (s1 * d1[mask] + s2 * d2[mask])
p1_v1[mask] = 1 - s2 * d1[mask] * n1 / n / (s2 * d1[mask] * n1 / n + s1 * d2[mask] * n2 / n)
p2_v1 = 1 - p1_v1

In [19]:

p = bokeh.plotting.figure(width=400, height=300, toolbar_location='above',
                          x_axis_label='Time (m)',
                          y_axis_label='% of People')
p.extra_y_ranges = {'prob': bokeh.models.Range1d(start=0, end=1.02)}
p.add_layout(bokeh.models.LinearAxis(y_range_name='prob', 
                                     axis_label='Probability'), 'right')

p.y_range = bokeh.models.Range1d(start=0, end=0.15)
p.line(x, y, legend='combined', color=palette[0], line_width=2)
p.line(x, y1, legend='fast group', color=palette[1])
p.line(x, y2, legend='slow group', color=palette[2])
p.line(x, p1_v1, legend='p fast', color=palette[3], y_range_name='prob')
p.line(x, p2_v1, legend='p slow', color=palette[4], y_range_name='prob')

p.legend.location = 'top_right'

bokeh.plotting.show(p)

This still looks like a linear switch between the two means. This seem a littles strange. Lets see how this changes if we change the relative size of the distributions? Let change the mean of the slow group back to the original value, $m_2 = 12.0$, and adjust the size of the groups so $n_1=40$ and $n_2=60$ people.

In [20]:

m2 = 12.0
n1 = 40.0
n2 = 60.0
n = n1 + n2

y1 = scipy.stats.norm.pdf(x, loc=m1, scale=s1) * n1 / n
y2 = scipy.stats.norm.pdf(x, loc=m2, scale=s2) * n2 / n
y = y1 + y2

p1_v1 = np.zeros_like(x)
p1_v1[x < m1] = 1

d1 = np.abs(x - m1)
d2 = np.abs(x - m2)

mask = np.logical_and(m1 < x, x < m2)
p1_v1[mask] = 1 - s1 * d1[mask] / (s1 * d1[mask] + s2 * d2[mask])
p1_v1[mask] = 1 - s2 * d1[mask] * n1 / n / (s2 * d1[mask] * n1 / n + s1 * d2[mask] * n2 / n)
p2_v1 = 1 - p1_v1

In [21]:

p = bokeh.plotting.figure(width=400, height=300, toolbar_location='above',
                          x_axis_label='Time (m)',
                          y_axis_label='% of People')
p.extra_y_ranges = {'prob': bokeh.models.Range1d(start=0, end=1.02)}
p.add_layout(bokeh.models.LinearAxis(y_range_name='prob', 
                                     axis_label='Probability'), 'right')

p.y_range = bokeh.models.Range1d(start=0, end=0.15)
p.line(x, y, legend='combined', color=palette[0], line_width=2)
p.line(x, y1, legend='fast group', color=palette[1])
p.line(x, y2, legend='slow group', color=palette[2])
p.line(x, p1_v1, legend='p fast', color=palette[3], y_range_name='prob')
p.line(x, p2_v1, legend='p slow', color=palette[4], y_range_name='prob')

p.legend.location = 'top_right'

bokeh.plotting.show(p)

This shows that adjusting the verifies that the linear switch was a coincidence of our selection. I am still considering if this implementation of the relative size is appropriate.

In a more general sense, while the approach considered so far provides a switch between 0 and 1 for the times that occur between the two distributions, it does not apply what is know about a normal distribution and the probability of being a certain distance from the mean. As a more concrete example, there is a finite probability, small as it may be in this case, that a runner from the slow group would run faster than the mean for the fast group. We can apply what know about a Gaussian distribution to account for this.

Lets reset the group sizes to the original $n_1 = 20$ and $n_2 = 80$.

In [22]:

n1 = 20.0
n2 = 80.0
n = n1 + n2

y1 = scipy.stats.norm.pdf(x, loc=m1, scale=s1) * n1 / n
y2 = scipy.stats.norm.pdf(x, loc=m2, scale=s2) * n2 / n
y = y1 + y2

p1_v1 = np.zeros_like(x)
p1_v1[x < m1] = 1

d1 = np.abs(x - m1)
d2 = np.abs(x - m2)

mask = np.logical_and(m1 < x, x < m2)
p1_v1[mask] = 1 - s1 * d1[mask] / (s1 * d1[mask] + s2 * d2[mask])
p1_v1[mask] = 1 - s2 * d1[mask] * n1 / n / (s2 * d1[mask] * n1 / n + s1 * d2[mask] * n2 / n)
p2_v1 = 1 - p1_v1

In [23]:

p = bokeh.plotting.figure(width=400, height=300, toolbar_location='above',
                          x_axis_label='Time (m)',
                          y_axis_label='% of People')
p.extra_y_ranges = {'prob': bokeh.models.Range1d(start=0, end=1.02)}
p.add_layout(bokeh.models.LinearAxis(y_range_name='prob', 
                                     axis_label='Probability'), 'right')

p.y_range = bokeh.models.Range1d(start=0, end=0.15)
p.line(x, y, legend='combined', color=palette[0], line_width=2)
p.line(x, y1, legend='fast group', color=palette[1])
p.line(x, y2, legend='slow group', color=palette[2])
p.line(x, p1_v1, legend='p fast', color=palette[3], y_range_name='prob')
p.line(x, p2_v1, legend='p slow', color=palette[4], y_range_name='prob')

p.legend.location = 'top_right'

bokeh.plotting.show(p)

Now lets calculate the probability a runner would have a given time for each distribution. To do this, we will use scipy.stats.cdf, the cumulative distribution function, taking the difference of probabilities between a small $\delta t$.

In [24]:

x_lower = x - dx / 2
x_upper = x + dx / 2
pd1 = (scipy.stats.norm.cdf(x_upper, loc=m1, scale=s1) - 
       scipy.stats.norm.cdf(x_lower, loc=m1, scale=s1)) * n1 / n
pd2 = (scipy.stats.norm.cdf(x_upper, loc=m2, scale=s2) - 
       scipy.stats.norm.cdf(x_lower, loc=m2, scale=s2)) * n2 / n

Summing each of these probability distributions should give us the percent of the total population in each group.

In [25]:

atol = 0.001  # absolute tolerance
assert np.allclose(pd1.sum(), n1 / n, atol=atol)
assert np.allclose(pd2.sum(), n2 / n, atol=atol)

Now we can consider the unlabelled runner, comparing the likelihoods he is in each group. This will take the form $$ p_\text{fast} = \dfrac{pd_1(t)}{pd_1(t) + pd_2(t)} $$ where $pd_1(t)$ and $pd_2(t)$ are the probabilities a member of the fast and slow group would respectively have a time, $t$.

In [26]:

p1_v2 = pd1 / (pd1 + pd2)
p2_v2 = 1 - p1_v2

In [27]:

p = bokeh.plotting.figure(width=400, height=300, toolbar_location='above',
                          x_axis_label='Time (m)',
                          y_axis_label='% of People')
p.extra_y_ranges = {'prob': bokeh.models.Range1d(start=0, end=1.02)}
p.add_layout(bokeh.models.LinearAxis(y_range_name='prob', 
                                     axis_label='Probability'), 'right')

p.y_range = bokeh.models.Range1d(start=0, end=0.15)
p.line(x, y, legend='combined', color=palette[0], line_width=2)
p.line(x, y1, legend='fast group', color=palette[1])
p.line(x, y2, legend='slow group', color=palette[2])
p.line(x, p1_v2, legend='p fast', color=palette[3], y_range_name='prob')
p.line(x, p2_v2, legend='p slow', color=palette[4], y_range_name='prob')

p.legend.location = 'top_right'

bokeh.plotting.show(p)

While at first this seems a bit strange (the oscillatory nature of the probability of being in the slow group), on closer inspection we see this does make sense. A time close to mean values of the distributions would get grouped with that distribution. The reason the probability of the slow group goes back up for the quickest times, $t < $ 4.18 min, is that group has such a wide standard deviation that the probability decays much slower than for the narrow distribution of the fast group.

While the larger standard deviation makes it more likely the fastest runner did not train, we could impose a constraint to have "p fast" smoothly transition to 1 at the quickest times. This would coorespond to the slow group having a more assymmetric distribution, more likely exhibiting a longer tail toward the slower times.

Lets compare my original probability and this corrected one.

In [28]:

p_per = bokeh.plotting.figure(width=400, height=300,
                              y_axis_label='% of People')
p_per.line(x, y, legend='combined', color=palette[0], line_width=2)
p_per.line(x, y1, legend='fast group', color=palette[1])
p_per.line(x, y2, legend='slow group', color=palette[2])

p_prob = bokeh.plotting.figure(width=400, height=300,
                               y_axis_label='Probability',
                               x_range=p_per.x_range)
p_prob.line(x, p1_v1, legend='p_v1 fast', color=palette[3])
p_prob.line(x, p2_v1, legend='p_v1 slow', color=palette[4])
p_prob.line(x, p1_v2, legend='p_v2 fast', color=palette[3], line_dash='dashed')
p_prob.line(x, p2_v2, legend='p_v2 slow', color=palette[4], line_dash='dashed')

p_diff = bokeh.plotting.figure(width=400, height=100,
                               x_axis_label='Time (m)',
                               y_axis_label='Residuals',
                               x_range=p_per.x_range)
p_diff.line(x, p1_v2-p1_v1, color=palette[3])
p_diff.line(x, p2_v2-p2_v1, color=palette[4])

p = bokeh.layouts.gridplot([[p_per], [p_prob], [p_diff]])
bokeh.plotting.show(p)

This shows a general agreement though there are some important differences. In particular, the time at which the probability switches from one group to the other is around 7.5 minutes rather than just over 9 minutes. This agrees much better with what we would expect looking at the two distributions.

In [ ]: