We shall work on several concepts of statistics in this notebook, always in the context of data coming from independent Bernoulli trials with an unknown parameter $p$.

Assume that we have a coin flipping experiment with a coin, chosen from a pool of coins, whose success probability is $p$. Let us consider $1$ as success and $0$ a fail. The law which determines $p$ has a density $\pi_0(p)$. Let us then flip the coin $k$ times with result $x_1,\ldots,x_k$. What is the probability law for $p$ given $x_1,\ldots,x_k$.

In [122]:

import numpy

p0 = 0.8 #numpy.random.uniform(0,1,1)
def pi0(p):
    return 1

k=100
x = numpy.random.binomial(1,p0,k)
y = numpy.cumsum(x)
print(p0)
print(x)
#print(y)

0.8
[1 1 1 1 1 1 1 0 1 0 1 1 0 1 1 1 0 1 1 1 1 0 1 1 1 1 0 1 1 1 1 0 1 1 1 1 1
 0 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 0 1 1 1 1 1 1 0
 1 0 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 0 1 0 1 1 1 1]

Bayesian Approach¶

This a typical question for conditional distribution. We consider the random variables $X_1,\ldots,X_k$ which are Bernoulli distributed with success parameter $p$. Then $$ P(X_1=x_1,\ldots,X_k=x_k|p) = p^{(x_1+\ldots+x_k)}(1-p)^{(k-(x_1+\ldots+x_k))} $$ holds true. Bayes' theorem says $$ P(p|x_1,\ldots,x_k)=\frac{p^{(x_1+\ldots+x_k)}(1-p)^{(k-(x_1+\ldots+x_k))}\pi_0(p)}{\int_0^1p^{(x_1+\ldots+x_k)}(1-p)^{(k-(x_1+\ldots+x_k))}\pi_0(p)dp} \, . $$

In [123]:

import matplotlib.pyplot as plt

for i in range(k):
    p = numpy.linspace(0,1,1000)
    l = p**y[i]*(1-p)**(i+1-y[i])*pi0(p)
    l = l/numpy.sum(l)*1000
    plt.plot(p,l)
    plt.scatter(numpy.array([p0]),numpy.array([0]))
    plt.show()
    print(numpy.mean(l*p))

0.6670003336670003

0.750375187593797

0.8004001333331997

0.8337500693400401

0.8575714283567853

0.8754374267547519

0.8893331847402168

0.7999987975959133

0.8181804518136747

0.7499999999875754

0.7692307692137785

0.7857142856917244

0.7333333333422666

0.75000000001142

0.7647058823672719

0.777777777795477

0.7368421052631586

0.7500000000000011

0.7619047619047634

0.7727272727272746

0.7826086956521764

0.7499999999999992

0.7599999999999992

0.7692307692307683

0.7777777777777767

0.7857142857142841

0.7586206896551725

0.7666666666666666

0.7741935483870968

0.7812499999999999

0.787878787878788

0.7647058823529412

0.7714285714285715

0.7777777777777778

0.7837837837837837

0.7894736842105262

0.7948717948717948

0.7750000000000001

0.7804878048780487

0.7857142857142857

0.7906976744186047

0.7954545454545455

0.7777777777777777

0.7826086956521741

0.7872340425531914

0.7916666666666666

0.7959183673469389

0.7999999999999999

0.803921568627451

0.8076923076923077

0.8113207547169813

0.8148148148148149

0.8181818181818181

0.8214285714285715

0.824561403508772

0.8275862068965518

0.8305084745762713

0.8333333333333333

0.8360655737704917

0.8387096774193548

0.8412698412698412

0.8437499999999999

0.8307692307692307

0.8333333333333334

0.835820895522388

0.8235294117647058

0.8115942028985509

0.8142857142857145

0.8169014084507041

0.8194444444444444

0.8219178082191781

0.8243243243243243

0.8266666666666667

0.8157894736842106

0.8181818181818182

0.8076923076923075

0.810126582278481

0.8000000000000003

0.8024691358024693

0.8048780487804879

0.8072289156626503

0.8095238095238095

0.8117647058823532

0.813953488372093

0.8045977011494253

0.806818181818182

0.8089887640449437

0.8111111111111111

0.8131868131868134

0.815217391304348

0.8172043010752688

0.8191489361702126

0.8210526315789475

0.8125

0.8144329896907216

0.8061224489795918

0.808080808080808

0.8099999999999998

0.811881188118812

0.8137254901960784

Maximum Likelihood Estimator¶

We are writing now the sequence of maximum likelihood estimators as red point denoting the argument where the maximum is taken of the likelihood function.

In [124]:

for i in range(k):
    p = numpy.linspace(0,1,1000)
    l = p**y[i]*(1-p)**(i+1-y[i])
    plt.plot(p,l)
    plt.scatter(numpy.array([y[i]/(i+1)]),numpy.array([0]),color='r')
    plt.show()

Tests¶

We are testing in the sequel whether rather the null hypothesis $p_0$ holds true or the alternative $p_1$: we first plot the likelihood for different values of $p$ given the data (Bayesian approach!), and we also plot the likelihood ratio function (actually its logarithm) depending on data given null and alternative hypothesis.

In [125]:

p0 = 0.5
p1 = 0.7

for i in range(k):
    p = numpy.linspace(0,1,1000)
    l = p**y[i]*(1-p)**(i+1-y[i])
    l = l / numpy.mean(l)
    plt.plot(p,l)
    plt.scatter(numpy.array([p0]),numpy.array([0]),color='r')
    plt.scatter(numpy.array([p1]),numpy.array([0]),color='g')
    print('Likelihood-Ratio: ',numpy.log(p0**y[i]*(1-p0)**(i+1-y[i])/(p1**y[i]*(1-p1)**(i+1-y[i]))))
    plt.show()

Likelihood-Ratio:  -0.3364722366212129

Likelihood-Ratio:  -0.6729444732424256

Likelihood-Ratio:  -1.0094167098636386

Likelihood-Ratio:  -1.3458889464848514

Likelihood-Ratio:  -1.6823611831060643

Likelihood-Ratio:  -2.0188334197272773

Likelihood-Ratio:  -2.35530565634849

Likelihood-Ratio:  -1.8444800325824995

Likelihood-Ratio:  -2.180952269203712

Likelihood-Ratio:  -1.6701266454377217

Likelihood-Ratio:  -2.0065988820589347

Likelihood-Ratio:  -2.3430711186801476

Likelihood-Ratio:  -1.8322454949141571

Likelihood-Ratio:  -2.16871773153537

Likelihood-Ratio:  -2.505189968156583

Likelihood-Ratio:  -2.8416622047777955

Likelihood-Ratio:  -2.330836581011805

Likelihood-Ratio:  -2.667308817633018

Likelihood-Ratio:  -3.003781054254231

Likelihood-Ratio:  -3.3402532908754434

Likelihood-Ratio:  -3.6767255274966564

Likelihood-Ratio:  -3.165899903730666

Likelihood-Ratio:  -3.502372140351879

Likelihood-Ratio:  -3.8388443769730918

Likelihood-Ratio:  -4.175316613594305

Likelihood-Ratio:  -4.511788850215518

Likelihood-Ratio:  -4.000963226449527

Likelihood-Ratio:  -4.33743546307074

Likelihood-Ratio:  -4.673907699691953

Likelihood-Ratio:  -5.010379936313166

Likelihood-Ratio:  -5.346852172934379

Likelihood-Ratio:  -4.836026549168388

Likelihood-Ratio:  -5.1724987857896005

Likelihood-Ratio:  -5.5089710224108135

Likelihood-Ratio:  -5.845443259032026

Likelihood-Ratio:  -6.181915495653239

Likelihood-Ratio:  -6.518387732274452

Likelihood-Ratio:  -6.007562108508462

Likelihood-Ratio:  -6.344034345129675

Likelihood-Ratio:  -6.680506581750888

Likelihood-Ratio:  -7.016978818372101

Likelihood-Ratio:  -7.353451054993313

Likelihood-Ratio:  -6.842625431227322

Likelihood-Ratio:  -7.179097667848535

Likelihood-Ratio:  -7.515569904469748

Likelihood-Ratio:  -7.852042141090961

Likelihood-Ratio:  -8.188514377712174

Likelihood-Ratio:  -8.524986614333386

Likelihood-Ratio:  -8.8614588509546

Likelihood-Ratio:  -9.197931087575812

Likelihood-Ratio:  -9.534403324197026

Likelihood-Ratio:  -9.870875560818238

Likelihood-Ratio:  -10.207347797439452

Likelihood-Ratio:  -10.543820034060664

Likelihood-Ratio:  -10.880292270681878

Likelihood-Ratio:  -11.21676450730309

Likelihood-Ratio:  -11.553236743924304

Likelihood-Ratio:  -11.889708980545516

Likelihood-Ratio:  -12.226181217166728

Likelihood-Ratio:  -12.562653453787942

Likelihood-Ratio:  -12.899125690409154

Likelihood-Ratio:  -13.235597927030367

Likelihood-Ratio:  -12.724772303264377

Likelihood-Ratio:  -13.061244539885589

Likelihood-Ratio:  -13.397716776506803

Likelihood-Ratio:  -12.886891152740812

Likelihood-Ratio:  -12.376065528974822

Likelihood-Ratio:  -12.712537765596034

Likelihood-Ratio:  -13.049010002217248

Likelihood-Ratio:  -13.38548223883846

Likelihood-Ratio:  -13.721954475459674

Likelihood-Ratio:  -14.058426712080886

Likelihood-Ratio:  -14.394898948702098

Likelihood-Ratio:  -13.884073324936107

Likelihood-Ratio:  -14.220545561557321

Likelihood-Ratio:  -13.70971993779133

Likelihood-Ratio:  -14.046192174412543

Likelihood-Ratio:  -13.535366550646552

Likelihood-Ratio:  -13.871838787267766

Likelihood-Ratio:  -14.208311023888978

Likelihood-Ratio:  -14.544783260510192

Likelihood-Ratio:  -14.881255497131404

Likelihood-Ratio:  -15.217727733752616

Likelihood-Ratio:  -15.55419997037383

Likelihood-Ratio:  -15.04337434660784

Likelihood-Ratio:  -15.379846583229051

Likelihood-Ratio:  -15.716318819850265

Likelihood-Ratio:  -16.052791056471477

Likelihood-Ratio:  -16.38926329309269

Likelihood-Ratio:  -16.725735529713905

Likelihood-Ratio:  -17.062207766335117

Likelihood-Ratio:  -17.39868000295633

Likelihood-Ratio:  -17.73515223957754

Likelihood-Ratio:  -17.224326615811552

Likelihood-Ratio:  -17.560798852432765

Likelihood-Ratio:  -17.049973228666776

Likelihood-Ratio:  -17.386445465287988

Likelihood-Ratio:  -17.7229177019092

Likelihood-Ratio:  -18.059389938530412

Likelihood-Ratio:  -18.395862175151624

In [126]:

from scipy import special

p0 = 0.5
p1 = 0.7
k=100

for i in range(1,k):
    cs = numpy.linspace(0,i,i+1)
    test = numpy.ones(i+1)
    cdf = numpy.ones(i+1)
    for l in range(i+1):
        test[l] = numpy.log(p0**cs[l]*(1-p0)**(i-cs[l])/(p1**cs[l]*(1-p1)**(i-cs[l])))
        helper[l] = scipy.special.binom(i+1,l)*p0**cs[l]*(1-p0)**(i-cs[l])
    cdf = numpy.cumsum(helper)
    treshhold = test[numpy.argmax(cdf>0.95)]
    plt.plot(cs,test,color='blue')
    plt.plot(cs,numpy.ones(i+1)*treshhold)
    plt.scatter(numpy.array([y[i-1]]),numpy.array([0]),color='r')
    plt.show()

$p$-value¶

We shall also calculate the $p$-value, i.e. the probability with which values larger or equal than the actual observation would appear under null hypothesis. It measures how 'extreme' an observation is, given the null hypthesis is true.

In [127]:

from scipy import special

p0 = 0.5
p1 = 0.7
k=100

for i in range(1,k):
    cs = numpy.linspace(0,i,i+1)
    c0 = numpy.zeros(i+1)
    for l in range(i):
        c0[l] = scipy.special.binom(i,l)*p0**(cs[l])*(1-p0)**(i-cs[l])
    plt.plot(cs,c0,color='blue')
    plt.scatter(numpy.array([y[i-1]]),numpy.array([0]),color='r')
    print(1-numpy.cumsum(c0)[y[i-1]])
    plt.show()

0.5

0.25

0.125

0.0625

0.03125

0.015625

0.0078125

0.00390625

0.001953125

0.0107421875

0.005859375

0.003173828125

0.01123046875

0.0064697265625

0.003692626953125

0.0020904541015625

0.0063629150390625

0.0037689208984375

0.0022125244140625

0.0012884140014648438

0.0007448196411132812

0.002171754837036133

0.001299738883972168

0.000771939754486084

0.0004552602767944336

0.00026676058769226074

0.0007568597793579102

0.00045611709356307983

0.0002730563282966614

0.00016245711594820023

9.609758853912354e-05

0.0002675263676792383

0.0001620317343622446

9.756279177963734e-05

5.842093378305435e-05

3.480084706097841e-05

2.062879502773285e-05

5.808350397273898e-05

3.5127392038702965e-05

2.1138511328899767e-05

1.266040180780692e-05

7.548600478912704e-06

2.0967078967260022e-05

1.2724299324418098e-05

7.687045098236922e-06

4.623849960694315e-06

2.769810798497474e-06

1.652633354609634e-06

9.823268882769298e-07

5.81777902297631e-07

3.433558868337627e-07

2.0196608696565477e-07

1.184175689727951e-07

6.921677486726452e-08

4.0338048012955596e-08

2.3440920648987174e-08

1.3584262936738867e-08

7.851308825301828e-09

4.52619564050849e-09

2.602845716737079e-09

1.493220769255288e-09

8.54663007032741e-10

3.054140051972354e-09

1.771111590720409e-09

1.0246228310251126e-09

3.450712493524577e-09

1.072898014875534e-08

6.375016403659117e-09

3.777915091873751e-09

2.2330873816756025e-09

1.316664333117501e-09

7.744472840798267e-10

4.5445025520507443e-10

1.427774343198962e-09

8.469198675697953e-10

2.522461794995934e-09

1.5118173379846667e-09

4.291089927832559e-09

2.597470794007961e-09

1.568271734697646e-09

9.445142445940746e-10

5.674669623090267e-10

3.4013003524790975e-10

2.033978541149395e-10

5.862472729489809e-10

3.5380276486307594e-10

2.130270404521184e-10

1.2797496395933194e-10

7.671019375266042e-11

4.588229796098631e-11

2.738609339303366e-11

1.6311951789305112e-11

9.696576874773655e-12

2.892275308141734e-11

1.7338130930966145e-11

4.970746037002982e-11

3.0039970511097636e-11

1.8115842159716067e-11

1.0903056235633812e-11

In [117]:

numpy.mean(x)

Out[117]:

0.53

$t$-test¶

We collect two samples of normally distributed data in an unpaired way. We test whether the difference of expectations is $0$ (null hypothesis)

In [128]:

## Import the packages
import numpy
from scipy import stats


## Define 2 random distributions
#Sample Size
N = 10
#Gaussian distributed data with mean = 1 and var = 1
a = np.random.randn(N) + 1
#Gaussian distributed data with with mean = 0 and var = 1
b = np.random.randn(N)

In [129]:

# calculate the empirical standard deviations with unbiased estimators

var_a = a.var(ddof=1)
var_b = b.var(ddof=1)

#std deviation
s = numpy.sqrt((var_a + var_b)/2)

# Calculate the t-statistics
t = (a.mean() - b.mean())/(s*numpy.sqrt(2/N))

# Compare with the critical t-value
# Degrees of freedom
df = 2*N - 2

#p-value after comparison with the t 
p = 1 - stats.t.cdf(t,df=df)

print("t = " + str(t))
print("p = " + str(2*p))
# You can see that after comparing the t statistic with the critical t value (computed internally) we get a good p value of 0.0005 and thus we reject the null hypothesis and thus it proves that the mean of the two distributions are different and statistically significant.

## Cross Checking with the internal scipy function
t2, p2 = stats.ttest_ind(a,b)
print("t = " + str(t2))
print("p = " + str(p2))

x = numpy.linspace(-5,5,1000)

plt.plot(x,stats.t.pdf(x,df=df))
plt.scatter(numpy.array([t]),numpy.array([0]),color='red')
plt.show()

t = 2.4141807955588708
p = 0.026639355757870353
t = 2.4141807955588717
p = 0.02663935575787036

Bayesian Approach in a Gaussian world -- the Kalman filter¶

In [10]:

import matplotlib.pyplot as plt
import numpy

mu = 1
sigmaX = 1
sigmaZ = 1
n = 100

X = numpy.random.normal(mu,sigmaX,1)*numpy.ones(n)
Z = numpy.random.normal(0,sigmaZ,n)
Y = X + Z
Xhat = numpy.ones(n)

for i in range(n):
    x = numpy.linspace(-10,10,1000)
    l = numpy.exp(-0.5*(Y[i]-x)**2/sigmaZ**2)*numpy.exp(-0.5*(x-mu)**2/sigmaX**2)
    l = l/numpy.sum(l)*1000/20
    plt.plot(x,l)
    plt.scatter(numpy.array([X[0]]),numpy.array([0]))
    plt.show()
    Xhat[i] = numpy.sum(l*x)*20/1000
    helper=(mu/sigmaX**2+Y[i]/sigmaZ**2)
    sigmaX = numpy.sqrt(1/(1/sigmaX**2+1/sigmaZ**2))
    mu = sigmaX**2*helper

In [11]:

plt.plot(X)
plt.plot(Y)
plt.plot(Xhat)
plt.show()

In [ ]: