We shall work on several concepts of statistics in this notebook, always in the context of data coming from independent Bernoulli trials with an unknown parameter $p$.
Assume that we have a coin flipping experiment with a coin, chosen from a pool of coins, whose success probability is $p$. Let us consider $1$ as success and $0$ a fail. The law which determines $p$ has a density $\pi_0(p)$. Let us then flip the coin $k$ times with result $x_1,\ldots,x_k$. What is the probability law for $p$ given $x_1,\ldots,x_k$.
import numpy
p0 = 0.8 #numpy.random.uniform(0,1,1)
def pi0(p):
return 1
k=100
x = numpy.random.binomial(1,p0,k)
y = numpy.cumsum(x)
print(p0)
print(x)
#print(y)
0.8 [1 1 1 1 1 1 1 0 1 0 1 1 0 1 1 1 0 1 1 1 1 0 1 1 1 1 0 1 1 1 1 0 1 1 1 1 1 0 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 0 1 1 1 1 1 1 0 1 0 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 0 1 0 1 1 1 1]
This a typical question for conditional distribution. We consider the random variables $X_1,\ldots,X_k$ which are Bernoulli distributed with success parameter $p$. Then $$ P(X_1=x_1,\ldots,X_k=x_k|p) = p^{(x_1+\ldots+x_k)}(1-p)^{(k-(x_1+\ldots+x_k))} $$ holds true. Bayes' theorem says $$ P(p|x_1,\ldots,x_k)=\frac{p^{(x_1+\ldots+x_k)}(1-p)^{(k-(x_1+\ldots+x_k))}\pi_0(p)}{\int_0^1p^{(x_1+\ldots+x_k)}(1-p)^{(k-(x_1+\ldots+x_k))}\pi_0(p)dp} \, . $$
import matplotlib.pyplot as plt
for i in range(k):
p = numpy.linspace(0,1,1000)
l = p**y[i]*(1-p)**(i+1-y[i])*pi0(p)
l = l/numpy.sum(l)*1000
plt.plot(p,l)
plt.scatter(numpy.array([p0]),numpy.array([0]))
plt.show()
print(numpy.mean(l*p))
0.6670003336670003
0.750375187593797
0.8004001333331997
0.8337500693400401
0.8575714283567853
0.8754374267547519
0.8893331847402168
0.7999987975959133
0.8181804518136747
0.7499999999875754
0.7692307692137785
0.7857142856917244
0.7333333333422666
0.75000000001142
0.7647058823672719
0.777777777795477
0.7368421052631586
0.7500000000000011
0.7619047619047634
0.7727272727272746
0.7826086956521764
0.7499999999999992
0.7599999999999992
0.7692307692307683
0.7777777777777767
0.7857142857142841
0.7586206896551725
0.7666666666666666
0.7741935483870968
0.7812499999999999
0.787878787878788
0.7647058823529412
0.7714285714285715
0.7777777777777778
0.7837837837837837
0.7894736842105262
0.7948717948717948
0.7750000000000001
0.7804878048780487
0.7857142857142857
0.7906976744186047
0.7954545454545455
0.7777777777777777
0.7826086956521741
0.7872340425531914
0.7916666666666666
0.7959183673469389
0.7999999999999999
0.803921568627451
0.8076923076923077
0.8113207547169813
0.8148148148148149
0.8181818181818181
0.8214285714285715
0.824561403508772
0.8275862068965518
0.8305084745762713
0.8333333333333333
0.8360655737704917
0.8387096774193548
0.8412698412698412
0.8437499999999999
0.8307692307692307
0.8333333333333334
0.835820895522388
0.8235294117647058
0.8115942028985509
0.8142857142857145
0.8169014084507041
0.8194444444444444
0.8219178082191781
0.8243243243243243
0.8266666666666667
0.8157894736842106
0.8181818181818182
0.8076923076923075
0.810126582278481
0.8000000000000003
0.8024691358024693
0.8048780487804879
0.8072289156626503
0.8095238095238095
0.8117647058823532
0.813953488372093
0.8045977011494253
0.806818181818182
0.8089887640449437
0.8111111111111111
0.8131868131868134
0.815217391304348
0.8172043010752688
0.8191489361702126
0.8210526315789475
0.8125
0.8144329896907216
0.8061224489795918
0.808080808080808
0.8099999999999998
0.811881188118812
0.8137254901960784
We are writing now the sequence of maximum likelihood estimators as red point denoting the argument where the maximum is taken of the likelihood function.
for i in range(k):
p = numpy.linspace(0,1,1000)
l = p**y[i]*(1-p)**(i+1-y[i])
plt.plot(p,l)
plt.scatter(numpy.array([y[i]/(i+1)]),numpy.array([0]),color='r')
plt.show()
We are testing in the sequel whether rather the null hypothesis $p_0$ holds true or the alternative $p_1$: we first plot the likelihood for different values of $p$ given the data (Bayesian approach!), and we also plot the likelihood ratio function (actually its logarithm) depending on data given null and alternative hypothesis.
p0 = 0.5
p1 = 0.7
for i in range(k):
p = numpy.linspace(0,1,1000)
l = p**y[i]*(1-p)**(i+1-y[i])
l = l / numpy.mean(l)
plt.plot(p,l)
plt.scatter(numpy.array([p0]),numpy.array([0]),color='r')
plt.scatter(numpy.array([p1]),numpy.array([0]),color='g')
print('Likelihood-Ratio: ',numpy.log(p0**y[i]*(1-p0)**(i+1-y[i])/(p1**y[i]*(1-p1)**(i+1-y[i]))))
plt.show()
Likelihood-Ratio: -0.3364722366212129
Likelihood-Ratio: -0.6729444732424256
Likelihood-Ratio: -1.0094167098636386
Likelihood-Ratio: -1.3458889464848514
Likelihood-Ratio: -1.6823611831060643
Likelihood-Ratio: -2.0188334197272773
Likelihood-Ratio: -2.35530565634849
Likelihood-Ratio: -1.8444800325824995
Likelihood-Ratio: -2.180952269203712
Likelihood-Ratio: -1.6701266454377217
Likelihood-Ratio: -2.0065988820589347
Likelihood-Ratio: -2.3430711186801476
Likelihood-Ratio: -1.8322454949141571
Likelihood-Ratio: -2.16871773153537
Likelihood-Ratio: -2.505189968156583
Likelihood-Ratio: -2.8416622047777955
Likelihood-Ratio: -2.330836581011805
Likelihood-Ratio: -2.667308817633018
Likelihood-Ratio: -3.003781054254231
Likelihood-Ratio: -3.3402532908754434
Likelihood-Ratio: -3.6767255274966564
Likelihood-Ratio: -3.165899903730666
Likelihood-Ratio: -3.502372140351879
Likelihood-Ratio: -3.8388443769730918
Likelihood-Ratio: -4.175316613594305
Likelihood-Ratio: -4.511788850215518
Likelihood-Ratio: -4.000963226449527
Likelihood-Ratio: -4.33743546307074
Likelihood-Ratio: -4.673907699691953
Likelihood-Ratio: -5.010379936313166
Likelihood-Ratio: -5.346852172934379
Likelihood-Ratio: -4.836026549168388
Likelihood-Ratio: -5.1724987857896005
Likelihood-Ratio: -5.5089710224108135
Likelihood-Ratio: -5.845443259032026
Likelihood-Ratio: -6.181915495653239
Likelihood-Ratio: -6.518387732274452
Likelihood-Ratio: -6.007562108508462
Likelihood-Ratio: -6.344034345129675
Likelihood-Ratio: -6.680506581750888
Likelihood-Ratio: -7.016978818372101
Likelihood-Ratio: -7.353451054993313
Likelihood-Ratio: -6.842625431227322
Likelihood-Ratio: -7.179097667848535
Likelihood-Ratio: -7.515569904469748
Likelihood-Ratio: -7.852042141090961
Likelihood-Ratio: -8.188514377712174
Likelihood-Ratio: -8.524986614333386
Likelihood-Ratio: -8.8614588509546
Likelihood-Ratio: -9.197931087575812
Likelihood-Ratio: -9.534403324197026
Likelihood-Ratio: -9.870875560818238
Likelihood-Ratio: -10.207347797439452
Likelihood-Ratio: -10.543820034060664
Likelihood-Ratio: -10.880292270681878
Likelihood-Ratio: -11.21676450730309
Likelihood-Ratio: -11.553236743924304
Likelihood-Ratio: -11.889708980545516
Likelihood-Ratio: -12.226181217166728
Likelihood-Ratio: -12.562653453787942
Likelihood-Ratio: -12.899125690409154
Likelihood-Ratio: -13.235597927030367
Likelihood-Ratio: -12.724772303264377
Likelihood-Ratio: -13.061244539885589
Likelihood-Ratio: -13.397716776506803
Likelihood-Ratio: -12.886891152740812
Likelihood-Ratio: -12.376065528974822
Likelihood-Ratio: -12.712537765596034
Likelihood-Ratio: -13.049010002217248
Likelihood-Ratio: -13.38548223883846
Likelihood-Ratio: -13.721954475459674
Likelihood-Ratio: -14.058426712080886
Likelihood-Ratio: -14.394898948702098
Likelihood-Ratio: -13.884073324936107
Likelihood-Ratio: -14.220545561557321
Likelihood-Ratio: -13.70971993779133
Likelihood-Ratio: -14.046192174412543
Likelihood-Ratio: -13.535366550646552
Likelihood-Ratio: -13.871838787267766
Likelihood-Ratio: -14.208311023888978
Likelihood-Ratio: -14.544783260510192
Likelihood-Ratio: -14.881255497131404
Likelihood-Ratio: -15.217727733752616
Likelihood-Ratio: -15.55419997037383
Likelihood-Ratio: -15.04337434660784
Likelihood-Ratio: -15.379846583229051
Likelihood-Ratio: -15.716318819850265
Likelihood-Ratio: -16.052791056471477
Likelihood-Ratio: -16.38926329309269
Likelihood-Ratio: -16.725735529713905
Likelihood-Ratio: -17.062207766335117
Likelihood-Ratio: -17.39868000295633
Likelihood-Ratio: -17.73515223957754
Likelihood-Ratio: -17.224326615811552
Likelihood-Ratio: -17.560798852432765
Likelihood-Ratio: -17.049973228666776
Likelihood-Ratio: -17.386445465287988
Likelihood-Ratio: -17.7229177019092
Likelihood-Ratio: -18.059389938530412
Likelihood-Ratio: -18.395862175151624
from scipy import special
p0 = 0.5
p1 = 0.7
k=100
for i in range(1,k):
cs = numpy.linspace(0,i,i+1)
test = numpy.ones(i+1)
cdf = numpy.ones(i+1)
for l in range(i+1):
test[l] = numpy.log(p0**cs[l]*(1-p0)**(i-cs[l])/(p1**cs[l]*(1-p1)**(i-cs[l])))
helper[l] = scipy.special.binom(i+1,l)*p0**cs[l]*(1-p0)**(i-cs[l])
cdf = numpy.cumsum(helper)
treshhold = test[numpy.argmax(cdf>0.95)]
plt.plot(cs,test,color='blue')
plt.plot(cs,numpy.ones(i+1)*treshhold)
plt.scatter(numpy.array([y[i-1]]),numpy.array([0]),color='r')
plt.show()
We shall also calculate the $p$-value, i.e. the probability with which values larger or equal than the actual observation would appear under null hypothesis. It measures how 'extreme' an observation is, given the null hypthesis is true.
from scipy import special
p0 = 0.5
p1 = 0.7
k=100
for i in range(1,k):
cs = numpy.linspace(0,i,i+1)
c0 = numpy.zeros(i+1)
for l in range(i):
c0[l] = scipy.special.binom(i,l)*p0**(cs[l])*(1-p0)**(i-cs[l])
plt.plot(cs,c0,color='blue')
plt.scatter(numpy.array([y[i-1]]),numpy.array([0]),color='r')
print(1-numpy.cumsum(c0)[y[i-1]])
plt.show()
0.5
0.25
0.125
0.0625
0.03125
0.015625
0.0078125
0.00390625
0.001953125
0.0107421875
0.005859375
0.003173828125
0.01123046875
0.0064697265625
0.003692626953125
0.0020904541015625
0.0063629150390625
0.0037689208984375
0.0022125244140625
0.0012884140014648438
0.0007448196411132812
0.002171754837036133
0.001299738883972168
0.000771939754486084
0.0004552602767944336
0.00026676058769226074
0.0007568597793579102
0.00045611709356307983
0.0002730563282966614
0.00016245711594820023
9.609758853912354e-05
0.0002675263676792383
0.0001620317343622446
9.756279177963734e-05
5.842093378305435e-05
3.480084706097841e-05
2.062879502773285e-05
5.808350397273898e-05
3.5127392038702965e-05
2.1138511328899767e-05
1.266040180780692e-05
7.548600478912704e-06
2.0967078967260022e-05
1.2724299324418098e-05
7.687045098236922e-06
4.623849960694315e-06
2.769810798497474e-06
1.652633354609634e-06
9.823268882769298e-07
5.81777902297631e-07
3.433558868337627e-07
2.0196608696565477e-07
1.184175689727951e-07
6.921677486726452e-08
4.0338048012955596e-08
2.3440920648987174e-08
1.3584262936738867e-08
7.851308825301828e-09
4.52619564050849e-09
2.602845716737079e-09
1.493220769255288e-09
8.54663007032741e-10
3.054140051972354e-09
1.771111590720409e-09
1.0246228310251126e-09
3.450712493524577e-09
1.072898014875534e-08
6.375016403659117e-09
3.777915091873751e-09
2.2330873816756025e-09
1.316664333117501e-09
7.744472840798267e-10
4.5445025520507443e-10
1.427774343198962e-09
8.469198675697953e-10
2.522461794995934e-09
1.5118173379846667e-09
4.291089927832559e-09
2.597470794007961e-09
1.568271734697646e-09
9.445142445940746e-10
5.674669623090267e-10
3.4013003524790975e-10
2.033978541149395e-10
5.862472729489809e-10
3.5380276486307594e-10
2.130270404521184e-10
1.2797496395933194e-10
7.671019375266042e-11
4.588229796098631e-11
2.738609339303366e-11
1.6311951789305112e-11
9.696576874773655e-12
2.892275308141734e-11
1.7338130930966145e-11
4.970746037002982e-11
3.0039970511097636e-11
1.8115842159716067e-11
1.0903056235633812e-11
numpy.mean(x)
0.53
We collect two samples of normally distributed data in an unpaired way. We test whether the difference of expectations is $0$ (null hypothesis)
## Import the packages
import numpy
from scipy import stats
## Define 2 random distributions
#Sample Size
N = 10
#Gaussian distributed data with mean = 1 and var = 1
a = np.random.randn(N) + 1
#Gaussian distributed data with with mean = 0 and var = 1
b = np.random.randn(N)
# calculate the empirical standard deviations with unbiased estimators
var_a = a.var(ddof=1)
var_b = b.var(ddof=1)
#std deviation
s = numpy.sqrt((var_a + var_b)/2)
# Calculate the t-statistics
t = (a.mean() - b.mean())/(s*numpy.sqrt(2/N))
# Compare with the critical t-value
# Degrees of freedom
df = 2*N - 2
#p-value after comparison with the t
p = 1 - stats.t.cdf(t,df=df)
print("t = " + str(t))
print("p = " + str(2*p))
# You can see that after comparing the t statistic with the critical t value (computed internally) we get a good p value of 0.0005 and thus we reject the null hypothesis and thus it proves that the mean of the two distributions are different and statistically significant.
## Cross Checking with the internal scipy function
t2, p2 = stats.ttest_ind(a,b)
print("t = " + str(t2))
print("p = " + str(p2))
x = numpy.linspace(-5,5,1000)
plt.plot(x,stats.t.pdf(x,df=df))
plt.scatter(numpy.array([t]),numpy.array([0]),color='red')
plt.show()
t = 2.4141807955588708 p = 0.026639355757870353 t = 2.4141807955588717 p = 0.02663935575787036
import matplotlib.pyplot as plt
import numpy
mu = 1
sigmaX = 1
sigmaZ = 1
n = 100
X = numpy.random.normal(mu,sigmaX,1)*numpy.ones(n)
Z = numpy.random.normal(0,sigmaZ,n)
Y = X + Z
Xhat = numpy.ones(n)
for i in range(n):
x = numpy.linspace(-10,10,1000)
l = numpy.exp(-0.5*(Y[i]-x)**2/sigmaZ**2)*numpy.exp(-0.5*(x-mu)**2/sigmaX**2)
l = l/numpy.sum(l)*1000/20
plt.plot(x,l)
plt.scatter(numpy.array([X[0]]),numpy.array([0]))
plt.show()
Xhat[i] = numpy.sum(l*x)*20/1000
helper=(mu/sigmaX**2+Y[i]/sigmaZ**2)
sigmaX = numpy.sqrt(1/(1/sigmaX**2+1/sigmaZ**2))
mu = sigmaX**2*helper
plt.plot(X)
plt.plot(Y)
plt.plot(Xhat)
plt.show()