Credit
- http://www.cyclismo.org/tutorial/R/confidence.html

In [1]:

import pandas as pd, numpy as np
%matplotlib inline
%pylab inline
import seaborn  as sns 
import matplotlib.pyplot as plt
plt.style.use('ggplot')

Populating the interactive namespace from numpy and matplotlib

1-1) Calculating a Confidence Interval From a Normal Distribution¶

In [2]:

from scipy.stats import norm

In [3]:

# quantile function in python
# R 
# https://stat.ethz.ch/R-manual/R-devel/library/stats/html/Normal.html
# PYTHON 
# https://stackoverflow.com/questions/24695174/python-equivalent-of-qnorm-qf-and-qchi2-of-r
# https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.norm.html

# qnorm(0.975) IN R 
norm.ppf(.975)

Out[3]:

1.959963984540054

In [4]:

# assume 
# sample mean = 5,
# standard deviation = 2
# sample size = 20. 
# use a 95% confidence level 

a = 5 
s = -2 
n = 20
error = norm.ppf(.975)*s/np.sqrt(20)
left = a-error
right = a+error

print ('left: {}'.format(left))
print ('right: {}'.format(right))

left: 5.876522540576581
right: 4.123477459423419

The mean of predicting values are within confidence interval between 4.12 and 5.88 with 95% confidence interval, data is normally distributed and samples are independent

1-2) Calculating a Confidence Interval From a t Distribution¶

In [5]:

from scipy.stats import t

In [6]:

# quantile function in the The Student t Distribution
# R 
# https://stat.ethz.ch/R-manual/R-devel/library/stats/html/TDist.html
# PYTHON
# https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.t.html
# https://stackoverflow.com/questions/19339305/python-function-to-get-the-t-statistic

# qt(0.975,df=n-1) IN R 
t.ppf(.975, df=n-1)

Out[6]:

2.093024054408263

In [7]:

# assume 
# sample mean = 5,
# standard deviation = 2
# sample size = 20. 
# use a 95% confidence level 

a = 5 
s = -2 
n = 20
error = t.ppf(.975, df=n-1)*s/np.sqrt(20)
left = a-error
right = a+error

print ('left: {}'.format(left))
print ('right: {}'.format(right))

left: 5.936028812839819
right: 4.063971187160181

The true mean has a probability of 95% of being in the interval between 4.06 and 5.94 assuming that the original random variable is normally distributed, and the samples are independent.

1-3) Calculating a Confidence Interval From a DataFrame¶

In [8]:

df = pd.read_csv('http://www.cyclismo.org/tutorial/R/_static/w1.dat')

In [9]:

df.head(3)

Out[9]:

	vals
0	0.43
1	0.40
2	0.45

In [10]:

df.describe().T

Out[10]:

	count	mean	std	min	25%	50%	75%	max
vals	54.0	0.765	0.378122	0.13	0.48	0.72	1.0075	1.76

In [15]:

error = t.ppf(.975, df=len(df)-1)*std(df)/np.sqrt(len(df))

print ('error : {}'.format(error))

# R 
# error <- qt(0.975,df=length(w1$vals)-1)*std(df)/sqrt(length(w1$vals))

error : vals    0.102247
dtype: float64

In [12]:

left = mean(df)-error
right = mean(df)+error

print ('left: {}'.format(left))
print ('right: {}'.format(right))

left: vals    0.662753
dtype: float64
right: vals    0.867247
dtype: float64

There is a 95% probability that the true mean is between 0.66 and 0.87 assuming that the original random variable is normally distributed, and the samples are independent.

1-4) Calculating Many Confidence Intervals From a t Distribution¶

Consider we have following test results :¶

In [67]:

# df1 
df_Comparison1 = pd.DataFrame({'Mean':[10,15],
                               'Std. Dev.':[3,2.5],
                               'Number':[300,230]})
df_Comparison1 = df_Comparison1.rename({0: 'Group1', 1: 'Group1'})  

# df2 
df_Comparison2 = pd.DataFrame({'Mean':[12,13],
                               'Std. Dev.':[4,5.3],
                               'Number':[210,340]})
df_Comparison2 = df_Comparison2.rename({0: 'Group1', 1: 'Group1'})  


# df3
df_Comparison3 = pd.DataFrame({'Mean':[30,28.5],
                               'Std. Dev.':[4.5,3],
                               'Number':[420,400]})
df_Comparison3 = df_Comparison3.rename({0: 'Group1', 1: 'Group1'})  

In [64]:

df_Comparison1

Out[64]:

	Mean	Number	Std. Dev.
Group1	10	300	3.0
Group1	15	230	2.5

In [68]:

df_Comparison2

Out[68]:

	Mean	Number	Std. Dev.
Group1	12	210	4.0
Group1	13	340	5.3

In [69]:

df_Comparison3

Out[69]:

	Mean	Number	Std. Dev.
Group1	30.0	420	4.5
Group1	28.5	400	3.0

In [32]:

# R 
# pmin() function returns the parallel minima vector of multiple vectors or matrix.
# http://www.endmemo.com/program/R/pmin.php
# e.g.
#> x <- c(3, 26, 122, 6)
#> y <- c(43,2,54,8)
#> z <- c(9,32,1,9)
#> pmax(x,y,z)
#[1]  43  32 122   9
#> pmin(x,y,z)
#[1] 3 2 1 6

# PYTHON 
# np.minimum()
# https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.minimum.html

In [35]:

m1 =  np.array([10,12,30])
m2 = np.array([10.5,13,28.5])
sd1 = np.array([3,4,4.5])
sd2 = np.array([2.5,5.3,3])
num1 = np.array([300,210,420])
num2 = np.array([230,340,400])
se = np.sqrt(sd1*sd1/num1+sd2*sd2/num2)
#error = qt(0.975,df=pmin(num1,num2)-1)*se
error = t.ppf(.975, df=np.minimum(num1,num2) -1)*se

In [37]:

se

Out[37]:

array([ 0.23911067,  0.39850737,  0.26592158])

In [36]:

error

Out[36]:

array([ 0.47113823,  0.78560924,  0.52278249])

In [38]:

left = (m1-m2)-error
right = (m1-m2)+error

print ('left: {}'.format(left))
print ('right: {}'.format(right))

left: [-0.97113823 -1.78560924  0.97721751]
right: [-0.02886177 -0.21439076  2.02278249]

This gives the confidence intervals for each of the three tests. For example, in the first experiment the 95% confidence interval is between -0.97 and -0.03 assuming that the random variables are normally distributed, and the samples are independent.