In this post, I'll try to replicate the reasoning in Chris Fonnesbeck's talk "statistical thinking for data science".
Let's start with the code. We want to build a sample of random values.
import numpy as np
def make_sample(size):
"Generates a random sample."
return np.random.normal(size=size)
make_sample(10)
array([-0.46661601, 0.41063078, 0.30758075, 1.60073702, -0.86336121, -0.23355055, -1.32449623, 0.11956636, 2.62381423, -1.26601074])
make_sample(20)
array([ 0.66381504, -0.28121257, 0.2762167 , -0.68387184, -2.0854523 , 0.54882485, -1.21564105, -2.59720332, -0.61130765, -0.78036254, -1.07960788, -1.03531623, -0.45372958, -0.14735239, -0.904162 , -0.34943136, -0.55128214, 0.82399238, 1.194479 , 0.12733662])
What he then does, is that he builds in a bias. What sort of bias?
Let's tackle these two separately. First, missing values.
def missing(true_sample, prob):
"Removes missing values following a binomial law."
missing_indices = np.random.binomial(1, prob, size=true_sample.size).astype(np.bool)
return true_sample[~missing_indices]
What this function does is as follows: it first draws a vector of binomial values that it casts to booleans, then it censors the initial sample based on the values drawn. Let's check the output of a the binomial sample:
np.random.binomial(1, 0.1, size=20).astype(np.bool)
array([False, False, False, True, False, False, False, False, False, False, False, False, True, False, False, False, False, False, True, False], dtype=bool)
np.random.binomial(1, 0.5, size=20).astype(np.bool)
array([False, True, True, False, True, True, False, True, True, False, True, True, True, True, False, True, False, False, False, True], dtype=bool)
Using the logical ~ (not) operator yields a restricted output:
sample = make_sample(10)
missing_sample = sample[~np.random.binomial(1, 0.5, size=10).astype(np.bool)]
print("original sample: {}".format(sample))
print("original sample with missing values: {}".format(missing_sample))
original sample: [-0.2340218 -0.72057151 -0.04013565 -1.30307987 -0.6110323 1.61743412 -0.84212281 0.15583878 -1.79058383 1.58297096] original sample with missing values: [-0.84212281 -1.79058383 1.58297096]
Let's plot the output from this procedure using the seaborn package.
import seaborn as sns
sns.set()
%matplotlib inline
sns.violinplot(data=[missing(make_sample(100), 0.5),
missing(make_sample(500), 0.5),
missing(make_sample(5000), 0.5)])
<matplotlib.axes._subplots.AxesSubplot at 0x116e58828>
We see that missing values, dont really have an effect even with lots of samples. What about censoring samples? Or, as called in the video, bias?
def censored(sample):
"Returns censored sample."
negative_values = sample < 0
return sample[~negative_values]
sns.violinplot(data=[censored(make_sample(100)),
censored(make_sample(500)),
censored(make_sample(50000))])
<matplotlib.axes._subplots.AxesSubplot at 0x111459048>