%matplotlib inline
import matplotlib.pyplot as plt
from SDSPSM import get_metrics, drawBarGraph
from ci_helpers import create_bernoulli_population
T = 4000 # total size of population
p = 0.6 # 60% has yellow balls
# create population
population, population_freq = create_bernoulli_population(T,p)
# population metrics
mu, var, sigma = get_metrics(population)
# visualize
fig, (ax1) = plt.subplots(1,1, figsize=(5,3))
drawBarGraph(population_freq, ax1, [T, mu, var, sigma], 'Population Distribution','Gumballs', 'Counts',xmin=0)
plt.show()
Deriving and visualizing the probability Mass function (the intermediate density function, where total area of bars will be 1, is just for fitting normal continous approximation later)
from ci_helpers import mini_plot_SDSP
fig, (ax1,ax2,ax3) = plt.subplots(1,3,figsize=(15,4))
mini_plot_SDSP(population, ax1,ax2,ax3, norm_off=True)
plt.show()
Let us sample from population, N no of times, each time with sample set of size n. If $np \geq 10$ and $nq \geq 10$, the resulting sampling distribution should be approximately normal.
Remember, for Population described by random variable Y, we describe the sampling distribution by
\begin{equation} \color {blue}{ \begin{aligned} \text{Random Variable} \ \ \widehat{p} = \overline{\widehat{Y}} \\ \\ \mu_{\widehat{p}} = \mu(\overline{\widehat{Y}}) \\ \\ \sigma_{\widehat{p}} = \sigma(\overline{\widehat{Y}}) \end{aligned} } \end{equation}where the $\widehat{}$ indicates the statistical outcome. And statistically by CLT,
\begin{equation} \color {blue} { \begin{aligned} \mu_{\widehat{p}} \approx 0.6 = \mu = p \\ \\ \sigma_{\widehat{p}} \approx 0.0693 \approx \dfrac{0.4898}{\sqrt{50}} = \dfrac {\sigma}{\sqrt{n}} = \sqrt{\dfrac {pq}{n}} \end{aligned} } \end{equation}from ci_helpers import sample_with_CI
from random import seed
N = 100
n = 50
#seed(0)
# sample from population
Y_mean_list, CI_list = sample_with_CI(N, n, population, sigma=sigma, mode='z')
# sample metrics
mu, var, sigma = get_metrics(Y_mean_list)
# visualize
fig, (ax1,ax2,ax3) = plt.subplots(1,3,figsize=(15,4))
mini_plot_SDSP(Y_mean_list,ax1,ax2,ax3,width=0.05, norm_off=True)
from IPython.display import display, Math
display(Math(r'\mu_{{\hat{{p}}}}:{} \ \ \ \ \sigma_{{\hat{{p}}}}:{}'.format(mu, sigma)))
If we know population SD
For each of above sample set of size 'n', let us calculate margin of error using population SD $\sigma$ as below. 1.96 is from Z tranformation, like we saw earlier in our theoretical section.
\begin{equation} \color{blue}{M = 1.96 \dfrac{\sigma}{\sqrt{n}}} \end{equation}and Confidence interval
$$ CI = Y \pm M $$from ci_helpers import plot_ci_accuracy_1
fig, ax = plt.subplots(1,1, figsize=(20,5))
plot_ci_accuracy_1(ax, CI_list, mu)
plt.show()
CI containing pop.mean:100.0%
Almost all of our CI contained the real population mean all the time. This is great, but typically we will not know population parameters. That forces us to make next best choice.
When we do not know population SD
For each sample mean $\overline{X_k}$ calculated, the margin of error M is calculated as below. Note, the constant value $t_{n-1}$ depends on degrees of freedom (n-1).
\begin{equation} \color{blue}{M_k = \pm t_{n-1} \dfrac{S_k}{\sqrt{n}}} \end{equation}and confidence interval, $$ CI = Y \pm M $$
Hope you noted. This time, for each sample mean, we also calculate unbiased sample variable of that set (that is, divided by n-1), and use that for calculating $M_k$
Also note, instead of Z distribution, we have used t distribution, this is because t has bigger tails, so can better compensate for lower sample sizes.
Sampling and plotting again. We sample again, because, for each sample, this time, we calculate CI using t distribution. And this is only difference between our earlier sampling method and now, as far as sample proportion value $\widehat{p}$ is considered. Only CI is calculated differently.
t value for 95% CI:
Degrees of Freedom is huge topic, but usually it means, one less than sample size, that is $df=n-1$ For 95% confidence level, the significance value $\alpha$ (which we will learn in hypothesis testing) is, $\alpha = 1 - 0.95 = 0.05$.
To calculate t, we simply need to pass, $(1-\alpha, df)$. A sample calculation shown below for sample size n = 10
from scipy import stats
#Studnt, n=999, p<0.05, 2-tail
#equivalent to Excel TINV(0.05,999)
print(stats.t.ppf(1-0.025, 10-1))
2.2621571627409915
from ci_helpers import sample_with_CI
N = 100
n = 50
#seed(0)
# sample from population, this time in 't' mode, so CI intervals are calculated with 't' value 2.093
Y_mean_list, CI_list = sample_with_CI(N, n, population, sigma=sigma, mode='t')
# sample metrics
mu, var, sigma = get_metrics(Y_mean_list)
# visualize
fig, (ax1,ax2,ax3) = plt.subplots(1,3,figsize=(15,4))
mini_plot_SDSP(Y_mean_list,ax1,ax2,ax3,width=0.05, norm_off=True)
from IPython.display import display, Math
display(Math(r'\mu_{{\hat{{p}}}}:{} \ \ \ \ \sigma_{{\hat{{p}}}}:{}'.format(mu, sigma)))
plt.show()
from ci_helpers import plot_ci_accuracy_1
fig, ax = plt.subplots(1,1, figsize=(20,5))
plot_ci_accuracy_1(ax, CI_list, mu)
plt.show()
CI containing pop.mean:96.0%
Generally we get more than 95% as above. Only few times, our CI did not contain the population mean. This is good, given we managed to get to this with just unbiased SD. Above result just means, if we take a sampling size, and calculate CI, and do that 100 times, about 95 times our CI would contain population mean, and our result gave 97 times. We could expect at least 95% most of the time.
But can we get any idea, how that "success" of getting population mean in our CI, 95% of time, depends on sample size? We get it, greater the sample size, better, but how it would be? Let us take our simulation to next scale as below, trying with various experiment and sample sizes.
Every time above sampling was done, the results varied. Not always above 95% as expected. So,..
We shall try applying various combinations of z or t distribution with population, or unbiased or biased sample SDs. We also vary sample size and experiment size as well to see the effect.
Environment:
Applied methods:
from ci_helpers import repeated_experiments_with_CI, plot_ci_accuracy_2, get_dist_label, get_mode_label
fig, axarr = plt.subplots(2,3, figsize=(15,10))
m , _ , s = get_metrics(population)
dist = 0
max_sample_size = int(T/4) # 25% of total population
N_list = range(5,500,20)
n_list = range(5,max_sample_size,50) # different sample sizes
for i in range(1,4): # 1,2,3 modes
for each_N in N_list:
for each_n in n_list:
dist=0 # z dist
ax = axarr[0,i-1]
accuracy_list = repeated_experiments_with_CI(population,m,s,[each_N],[each_n],i,dist,'b')
plot_ci_accuracy_2(ax, accuracy_list)
ax.set_title('Using {} and {}'.format(get_dist_label(dist), get_mode_label(i)))
dist=1 # t dist
ax = axarr[1,i-1]
accuracy_list = repeated_experiments_with_CI(population,m,s,[each_N],[each_n],i,dist,'b')
plot_ci_accuracy_2(ax, accuracy_list)
ax.set_title('Using {} and {}'.format(get_dist_label(dist), get_mode_label(i)))
plt.tight_layout()
plt.show()
One thing is clear. Using population SD gives much superior results.
However unable to make any further inferences till the pending doubts in bottom section of this chapter are clarified.
Let Y be the random variable indicating temperature over a distribution of certain values.
If limiting values are say, 0 deg C to 40 deg C, our population would thus look like this: $[23,13,35,50,10,2,5,0,33, \cdots ,21]$
Unlike Sample proportions,we do not know or designate any proportion of temperatures in this example, but we know the mean and variance by simply calculating all values in the distribution. These would be our population parameters.
Population mean $\mu = \mu_y$
Population variance $\sigma^2 = \sigma_y^2$
%matplotlib inline
from math import floor
import matplotlib.pyplot as plt
from random import random, seed, shuffle
from SDSPSM import get_metrics, drawBarGraph, getPopulationStatistics
from ci_helpers import createRandomPopulation
seed(0)
popMin = 1 # Min population
popMax = 40 # Max population
freqMax = 200 # freq of any set of population (for eg, no of occurances of temperatures at 25 deg C)
population, population_freq = createRandomPopulation(popMax - popMin + 1, freqMax)
#mu, var, sigma = get_metrics(population) # just to cross check..
#print(mu, var, sigma)
N, mu, var, sigma = getPopulationStatistics(population_freq, popMin)
#visualize
fig, (ax1) = plt.subplots(1,1, figsize=(16,3))
drawBarGraph(population_freq, ax1, [N, mu, var, sigma], 'Population Distribution','Temperature', 'Counts')
plt.show()
Let us visualize the density function and PMF as usual..
from ci_helpers import mini_plot_SDSM
fig, (ax1,ax2,ax3) = plt.subplots(1,3,figsize=(15,4))
mini_plot_SDSM(population, ax1, ax2, ax3, popMax, width=1)
plt.show()
Let us sample from population, N no of times, each time with sample set of size n. If $n > 30$, the resulting sampling distribution should be approximately normal (always if population itself was normally distributed)
Remember, for Population described by random variable Y, we describe the sampling distribution of sample means by
\begin{equation} \color {blue}{ \begin{aligned} \mu_{\overline{Y}} = \mu(\overline{\widehat{Y}}) \\ \\ \sigma_{\overline{Y}} = \sigma(\overline{\widehat{Y}}) \end{aligned} } \end{equation}where the $\widehat{}$ indicates the statistical outcome. And statistically by CLT,
\begin{equation} \color {blue} { \begin{aligned} \mu_{\overline{Y}} = 19.4 \approx 20 = \mu \\ \\ \sigma_{\overline{Y}} \approx 1.52 \approx \dfrac{11.32}{\sqrt{50}} = \dfrac {\sigma}{\sqrt{n}} \end{aligned} } \end{equation}$\overline{Y}$ is called the sample means which is a random variable.
from ci_helpers import sample_with_CI
from random import seed
N = 100
n = 50
#seed(0)
# sample from population
Y_mean_list, CI_list = sample_with_CI(N, n, population, sigma=sigma, mode='z')
# sample metrics
mu, var, sigma = get_metrics(Y_mean_list)
# visualize
fig, (ax1,ax2,ax3) = plt.subplots(1,3,figsize=(15,4))
mini_plot_SDSM(Y_mean_list, ax1, ax2, ax3, popMax, width=0.1)
from IPython.display import display, Math
display(Math(r'\mu_{{\hat{{p}}}}:{} \ \ \ \ \sigma_{{\hat{{p}}}}:{}'.format(mu, sigma)))
Ok I get it, the resulting distribution and density functions look abnormal (ugly, slightly normal). Try increasing experiment size N, and you will see much better approximation of normal distribution. We had to stick with N=100 because we have to see how CI from each sample mean performs, so bear with me here.
If we know population SD
Just like in case of Sample proportions, we calculate the CI as below.
\begin{equation} \color{blue}{M = 1.96 \dfrac{\sigma}{\sqrt{n}}} \end{equation}and Confidence interval
$$ CI = Y \pm M $$from ci_helpers import plot_ci_accuracy_1
fig, ax = plt.subplots(1,1, figsize=(20,5))
plot_ci_accuracy_1(ax, CI_list, mu)
plt.show()
CI containing pop.mean:100.0%
Almost all of our CI contained the real population mean all the time just like in case of Sample proportions. This is great, but typically we will not know population parameters. That forces us to make next best choice. Also note, if our population size was lesser, performance reduces here too.
When we do not know population SD
Just like in case of Sample proportions, we calculate the CI as below.
\begin{equation} \color{blue}{M_k = \pm t_{n-1} \dfrac{S_k}{\sqrt{n}}} \end{equation}and confidence interval, $$ CI = Y \pm M $$
Just like before, we sample again because now, our CI values differ as given by equations above.
from ci_helpers import sample_with_CI
N = 100
n = 50
#seed(0)
# sample from population, this time in 't' mode, so CI intervals are calculated with 't' value 2.093
Y_mean_list, CI_list = sample_with_CI(N, n, population, sigma=sigma, mode='t')
# sample metrics
mu, var, sigma = get_metrics(Y_mean_list)
# visualize
fig, (ax1,ax2,ax3) = plt.subplots(1,3,figsize=(15,4))
mini_plot_SDSM(Y_mean_list, ax1, ax2, ax3, popMax, width=0.1)
from IPython.display import display, Math
display(Math(r'\mu_{{\hat{{p}}}}:{} \ \ \ \ \sigma_{{\hat{{p}}}}:{}'.format(mu, sigma)))
plt.show()
from ci_helpers import plot_ci_accuracy_1
fig, ax = plt.subplots(1,1, figsize=(20,5))
plot_ci_accuracy_1(ax, CI_list, mu)
plt.show()
CI containing pop.mean:97.0%
Every time above sampling was done, the results varied. Not always above 95% as expected. So,..
We shall try applying various combinations of Z or T distribution with population, or unbiased or biased sample SDs. We also vary sample size and experiment size as well to see the effect.
Environment:
Applied methods:
from ci_helpers import repeated_experiments_with_CI, plot_ci_accuracy_2, get_dist_label, get_mode_label
fig, axarr = plt.subplots(2,3, figsize=(15,10))
m , _ , s = get_metrics(population)
dist = 0
max_sample_size = int(T/4) # 25% of total population
N_list = range(5,500,20)
n_list = range(5,max_sample_size,50) # different sample sizes
for i in range(1,4): # 1,2,3 modes
for each_N in N_list:
for each_n in n_list:
dist=0 # z dist
ax = axarr[0,i-1]
accuracy_list = repeated_experiments_with_CI(population,m,s,[each_N],[each_n],i,dist,'b')
plot_ci_accuracy_2(ax, accuracy_list)
ax.set_title('Using {} and {}'.format(get_dist_label(dist), get_mode_label(i)))
dist=1 # t dist
ax = axarr[1,i-1]
accuracy_list = repeated_experiments_with_CI(population,m,s,[each_N],[each_n],i,dist,'b')
plot_ci_accuracy_2(ax, accuracy_list)
ax.set_title('Using {} and {}'.format(get_dist_label(dist), get_mode_label(i)))
plt.tight_layout()
plt.show()
One thing is clear. Using population SD gives much superior results.
However unable to make any further inferences till the pending doubts in bottom section of this chapter are clarified.
These would be useful when solving related problems.
It could be useful when you have significance level $\alpha$ and degrees of freedom $n-1$, to calculate respective t value
from scipy import stats
conf_level = 0.9
sample_size = 5
alpha = round((1 - conf_level)/2,3)
stats.t.ppf(1-alpha, sample_size-1)
2.13184678133629
Could be useful, when you have significance level $\alpha$ and calculate corresponding Z score
from scipy import stats
conf_level = 0.99
alpha = round((1 - conf_level)/2,3)
stats.norm.ppf(alpha)
-2.575829303548901
Plotting a t distribution - might be useful while investigating how they are better than normal distribution.
from scipy.stats import t, norm
import numpy as np
fig,ax = plt.subplots(1,1)
df = 2
x = np.linspace(t.ppf(0.01,df), t.ppf(0.99,df),100)
ax.plot(x, t.pdf(x,df), color='C0') # blue is t distribution
ax.plot(x, norm.pdf(x), color='C1') # red
plt.show()
As seen in "Digging deeper" sections 1, 2 above,
when population SD is not known, using unbiased or biased Sample SD do not make much difference in CI performance. So is the case also with using Z or T distribution. A clearer investigation of why unbiased SD is better than biased, why t is better than z, in CI perspective should be done. SE question raised
Also the results were obtained via Sampling without replacement. If we sample with replacement, the CI perforamnce (in case of unknown population SD) is very poor. Why? SE question raised
To be listed after pending doubts are cleared. General consensus is below though I have not adequately empirically verified it yet via simulation,