The marginal distributions are obtained by summing the probabilies over all the values of the variable being marginalized. Thus, to obtain p(x) we sum over columns (i.e., over the values corresponding to different y):
p(x1)=P(X=x1)=P(X=x1,Y=y1)+P(X=x1,Y=y2)+P(X=x1,Y=y3)=0.01+0.05+0.1=0.16p(x2)=P(X=x2)=P(X=x2,Y=y1)+P(X=x2,Y=y2)+P(X=x2,Y=y3)=0.02+0.1+0.05=0.17p(x3)=P(X=x3)=P(X=x3,Y=y1)+P(X=x3,Y=y2)+P(X=x3,Y=y3)=0.03+0.05+0.03=0.11p(x4)=P(X=x4)=P(X=x4,Y=y1)+P(X=x4,Y=y2)+P(X=x4,Y=y3)=0.1+0.07+0.05=0.22p(x5)=P(X=x5)=P(X=x5,Y=y1)+P(X=x5,Y=y2)+P(X=x5,Y=y3)=0.1+0.2+0.04=0.34
As a correctness check, note that this distribution satisfies the normalization condition, i.e. that sum of the probabilities is 1:
∑5i=1p(xi)=1
The marginal distribution p(y) can be obtained in a similar way, by summing the matrix rows:
p(y1)=P(Y=y1)=5∑i=1P(X=xi,Y=y1)=0.01+0.02+0.03+0.1+0.1=0.26p(y2)=P(Y=y2)=5∑i=1P(X=xi,Y=y2)=0.05+0.1+0.05+0.07+0.2=0.47p(y3)=P(Y=y3)=5∑i=1P(X=xi,Y=y3)=0.1+0.05+0.03+0.05+0.04=0.27
We can again check that the normalization condition is satisfied:
∑3i=1p(yi)=1
To determine conditional distributions we use the definition of the conditional probability:
P(X=x,Y=y1)=P(X=x|Y=y1)P(Y=y1)=p(x|Y=y1)p(y1).
Thus,
p(x1|Y=y1)=P(X=x1,Y=y1)p(y1)=0.010.26≈0.038p(x2|Y=y1)=P(X=x2,Y=y1)p(y1)=0.020.26≈0.077p(x3|Y=y1)=P(X=x3,Y=y1)p(y1)=0.030.26≈0.115p(x4|Y=y1)=P(X=x4,Y=y1)p(y1)=0.10.26≈0.385p(x5|Y=y1)=P(X=x5,Y=y1)p(y1)=0.10.26≈0.385
Likewise the conditional distribution p(y|X=x3) is given by
p(y1|X=y3)=P(X=x3,Y=y1)p(x3)=0.030.11≈0.273p(y2|X=y3)=P(X=x3,Y=y2)p(x3)=0.050.11≈0.454p(y3|X=y3)=P(X=x3,Y=y3)p(x3)=0.030.11≈0.273
We can write the probability density of the two-dimensional distribution as
p(x,y)=0.4N(x,y|[102],[1001])+0.6N(x,y|[00],[8.42.02.01.7])
The marginal distribution of a weighted sum of distributions is given by the weighted sum of marginals, whereas the marginals of a bivariate normal distribution N(x,y|μ,Σ) are obtained according to the rule
∫N(x,y|μ,Σ)dy=N(x|μx,Σxx),∫N(x,y|μ,Σ)dx=N(y|μy,Σyy)
Thus, the marginals of the distribution of interest are
p(x)=0.4N(x|10,1)+0.6N(x|0,8.4),p(y)=0.4N(x|2,1)+0.6N(x|0,1.7)
The mean of a weighted sum of two distributions is the weighted sum of their averages
EX[x]=0.4∗10+0.6∗0=4,EY[y]=0.4∗2+0.6∗0=0.8
The mode of a continuous distribution is a point where this distribution has a peak. It can be determined by solving the extremum condition for each of the marginal distributions:
dp(x)dx=0,dp(y)dy=0
In the case of a mixture of normal distributions these equations are non-linear and can be solved only numerically. After finding all the solutions of these equations one has to verify for every solution that it is a peak rather than an inflection point, i.e. that at this point
d2p(x)dx2<0 or d2p(y)dy2<0
The medians mx,my can be determined from the conditions
∫m−∞p(x)dx=∫+∞mp(x)dx,∫m−∞p(y)dy=∫+∞mp(y)dy
Again, these equations can be solved here only numerically.
The mean of a two-dimensional distribution is a vector of means of the marginal distributions
μ=[40.8]
The mode of two dimensional distribution is obtained first by solving the extremum conditions
∂p(x,y)∂x=0,∂p(x,y)∂y=0
and then verifying for every solution that it is indeed a peak, i.e.
∂2p(x,y)∂x2<0,∂2p(x,y)∂y2<0,det([∂2p(x,y)∂x2∂2p(x,y)∂x∂y∂2p(x,y)∂x∂y∂2p(x,y)∂y2])>0
Again, these squations can be solved only numerically.
The conjugate prior to the Bernoulli distribution is the Beta distribution
p(μ|α,β)=1B(α,β)μα−1(1−μ)β−1∝μα−1(1−μ)β−1,
where α,β are not necessarily integers and the normalization coefficient si the Beta function defined as
B(α,β)=∫10tα−1(1−t)β−1dt
The likelihood of observing data {x1,x2,...,xN} is
p(x1,...,xN|μ)=∏Ni=1p(xi|μ)=∏Ni=1μxi(1−μ)1−xi=μ∑Ni=1xi(1−μ)N−∑Ni=1xi
The posterior distribution is proportional to teh rpoduct of this likelihood with teh prior distribution (Bayes theorem):
p(μ|x1,...,xN)∝p(x1,...,xN|μ)p(μ|α,β)∝μ∑Ni=1xi+α−1(1−μ)N−∑Ni=1xi+β−1
This is also a Beta distribution, i.e. our choice of the gonjugate prior was correct. The normalization constant is readily determined:
p(μ|x1,...,xN)=1B(∑Ni=1xi+α−1,N−∑Ni=1xi+β−1)μ∑Ni=1xi+α−1(1−μ)N−∑Ni=1xi+β−1
The probabilities of picking a mango or an apple from teh first bag are given by
p(mango|1)=46=23p(apple|1)=26=13
The probabilities of picking a mango or an apple from teh second bag are p(mango|2)=48=12p(apple|2)=48=12
The probability of picking the first or the second bag are equal to teh probabilities of head and tail respectively:
p(1)=0.6,p(2)=0.4
We now can obtain the probability that the mango was picked from the second bag using Bayes' theorem:
p(2|mango)=p(mango|2)p(2)p(mango)=p(mango|2)p(2)p(mango|1)p(1)+p(mango|2)p(2)=120.4230.6+120.4=13
xt+1 is obtained from xt by a linear transformation, Axt and adding a Gaussian random variabme w. Initial distribution for x0 is a Gaussian distribution, a linear transformation of a Gaussian random variable is also a Gaussian random variable, whareas a sum of Gaussian random variables is a Gaussian random variable. Thus, the joint distribution p(x0,x1,...,xT) is also a Gaussian distribution.
Let z=Axt+1. Since this is a linear transformation of a Gaussian random variable, xt∼N(μt,Σ), then z is distributed as (see Eq. (6.88))
z∼N(Aμt,AΣAT),
whereas the mean and the covariance of a sum of two Gaussian random variables are given by the sum of the means and the covariances of these variables, i.e.,
xt+1=z+w∼N(Aμt,AΣAT+Q),
That is
p(xt+1|y1,...,yt)=N(xt+1|Aμt,AΣAT+Q).
If we assume that xt+1 is fixed, then yt+1=Cxt+1+v follows the same distribution as v, but with the mean shifted by Cxt+1, i.e.
p(yt+1|xt+1,y1,...,yt)=N(yt+1|Cxt+1,R).
The the joint probability is obtained as
p(yt+1,xt+1|y1,...,yt)=p(yt+1|xt+1,y1,...,yt)p(xt+1|y1,...,yt)=N(yt+1|Cxt+1,R)N(xt+1|Aμt,AΣAT+Q).
Let us introduce temporary notation
μt+1=Aμt,Σt+1=AΣAT+Q,p(xt+1|y1,...,yt)=N(μt+1,Σt+1)
Then yt+1 is obtained in terms of the parameters of distribution p(xt+1|y1,...,yt) following the same steps as question 1), with the result
p(yt+1|y1,...,yt)=N(yt+1|Cμt+1,CΣt+1CT+R)=N(yt+1|CAμt,C(AΣAT+Q)CT+R).
The required conditional distribution is then obtained as
p(xt+1|y1,...,yt,yt+1)=p(yt+1,xt+1|y1,...,yt)p(yt+1|y1,...,yt)=N(yt+1|Cxt+1,R)N(xt+1|Aμt,AΣAT+Q)N(yt+1|CAμt,C(AΣAT+Q)CT+R)
The standard definition of variance is
VX[x]=EX[(x−μ)2],
where
μ=EX[x].
Using the properties of average we can write:
VX[x]=EX[(x−μ)2]=EX[x2−2xμ+μ2]=EX[x2]−EX[2xμ]+EX[μ2]=EX[x2]−2μEX[x]+μ2=EX[x2]−2μ2+μ2=EX[x2]−μ2
By substituting to this equation the definition of μ, we obtain the desired equation
VX[x]=EX[(x−μ)2]=EX[x2]−(EX[x])2
Let is expand the square in the left-hand side of (6.45)
1N2∑Ni,j=1(xi−xj)2=1N2∑Ni,j=1(x2i−2xixj+x2j)=1N2∑Ni,j=1x2i−21N2∑Ni,j=1xixj+1N2∑Ni,j=1x2j
We see that the first and the last term differ only by the summation index, i.e. they are identical: 1N2∑Ni,j=1x2i+1N2∑Ni,j=1x2j=21N2∑Ni,j=1x2i=21N∑Ni=1x2i,
since summation over j gives factor N.
The remaining term can be written as
21N2∑Ni,j=1xixj=21N2∑Ni=1xi∑Ni=1xj=2(1N∑Ni=1xi)2,
where we again used the fact that the sum is invariant to the index of summation.
We thus have proved the required relation that
1N2∑Ni,j=1(xi−xj)2=21N∑Ni=1x2i−2(1N∑Ni=1xi)2
Bernoulli distribution is given by
p(x|μ)=μx(1−μ)1−x
We can use relation
ax=exloga
to write the Bernoulli distribution as
p(x|μ)=exlogμ+(1−x)log(1−μ)=exlog(μ1−μ)+log(1−μ)=h(x)eθx−A(θ),
where the last equation is the definition of a single-parameter distribution from the exponential family, in which
h(x)=1,θ=log(μ1−μ)↔μ=eθ1+eθ,A(θ)=−log(1−μ)=log(1+eθ)
The binomial distribution can be transformed as
p(x|N,μ)=(Nx)μx(1−μ)N−x=(Nx)exlogμ+(N−x)log(1−μ)=(Nx)exlog(μ1−μ)+Nlog(1−μ)=h(x)exθ−A(θ)
where
h(x)=(Nx),θ=log(μ1−μ),A(θ)=−Nlog(1−μ)=Nlog(1+eθ)
i.e., the binomial distribution can be represented as an exponential family distribution(only μ is treated here as a parameter, since the number of trials N is fixed.)
Similarly, the beta distribution can be transoformed as
p(x|α,β)=1B(α,β)xα−1(1−x)β−1=e(α−1)logx+(β−1)log(1−x)−log(B(α,β))=h(x)eθ1ϕ1(x)+θ2ϕ2(x)−A(θ1,θ2)
where
h(x)=1,θ1=α−1,θ2=β−1,ϕ1(x)=logx,ϕ2(x)=log(1−x),A(θ1,θ2)=log(B(α,β))=log(B(1+θ1,1+θ2))
i.e. this is a distribution form the exponential family.
The product of the two distributions is then given by
p(x|N,μ)p(x|α,β)=(Nx)exlog(μ1−μ)+(α−1)logx+(β−1)log(1−x)+Nlog(1−μ)−log(B(α,β))=h(x)eθ1ϕ1(x)+θ2ϕ2(x)+θ3ϕ3(x)−A(θ1,θ2,θ3)
where
h(x)=(Nx),θ1=α−1,θ2=β−1,θ3=log(μ1−μ)ϕ1(x)=logx,ϕ2(x)=log(1−x),ϕ3(x)=xA(θ1,θ2,θ3)=log(B(α,β))−Nlog(1−μ)=log(B(1+θ1,1+θ2))+Nlog(1+eθ3)
The two normal distributions are given by
N(x|a,A)=(2π)−D2|A|−12exp[−12(x−a)TA−1(x−a)],N(x|b,B)=(2π)−D2|B|−12exp[−12(x−b)TB−1(x−b)]
their product is
N(x|a,A)N(x|b,B)=(2π)−D|AB|−12exp{−12[(x−a)TA−1(x−a)+(x−b)TB−1(x−b)]}
The expression in the exponent can be written as
Φ=(x−a)TA−1(x−a)+(x−b)TB−1(x−b)=xTA−1x−aTA−1x−xTA−1a+aTA−1a+xTB−1x−bTB−1x−xTB−1b+bTB−1b=xT(A−1+B−1)x−(aTA−1+bTB−1)x−xT(A−1a+B−1b)+aTA−1a+bTB−1b
we now introduce notation
C−1=(A−1+B−1),c=C(A−1a+B−1b),cT=(aTA−1+bTB−1)C (This can be checked by transposing the previous equation)
The expression in the exponent now takes form
Φ=xTC−1x−cTC−1x−xTC−1c+aTA−1a+bTB−1b=xTC−1x−cTC−1x−xTC−1c+cTC−1c+aTA−1a+bTB−1b−cTC−1c=(x−c)TC−1(x−c)+aTA−1a+bTB−1b−cTC−1c
where we have completed the square.
The product of the two probability distributions can be now written as
N(x|a,A)N(x|b,B)=(2π)−D|AB|−12exp{−12[(x−c)TC−1(x−c)+aTA−1a+bTB−1b−cTC−1c]}=(2π)−D2|C|−12exp[−12(x−c)TC−1(x−c)]×(2π)−D2|AB|−12|C|−12exp{−12[aTA−1a+bTB−1b−cTC−1c]}=cN(c|c,C),
where we defined
c=(2π)−D2|AB|−12|C|−12exp{−12[aTA−1a+bTB−1b−cTC−1c]}
We now can used the properties that a) the determinant of a matrix product is product of the determinants, and b) determinant of a matrix inverse is the inverse of the determinant of this matrix, and write
|A||B||C|=|A||C−1||B|=|AC−1B|=|A(A−1+B−1)B|=|A+B|
For the expression in the exponent we can write
aTA−1a+bTB−1b−cTC−1c=aTA−1a+bTB−1b−(aTA−1+bTB−1)(A−1+B−1)−1(A−1a+B−1b)=aT[A−1−A−1(A−1+B−1)A−1]a+bT[B−1−B−1(A−1+B−1)B−1]b−aTA−1(A−1+B−1)−1B−1b−bTB−1(A−1+B−1)−1A−1a
Using the property (AB)−1=B−1A−1 we obtain
A−1(A−1+B−1)−1B−1=[B(A−1+B−1)A]−1=(A+B)−1
and
A−1−A−1(A−1+B−1)A−1=A−1[1−(A−1+B−1)A−1]=A−1[1−B(A+B)−1AA−1]=A−1[1−B(A+B)−1]=A−1[(A+B)−B](A+B)−1=(A+B)−1
we thus conclude that
c=(2π)−D2|A+B|−12exp{−12(a−b)T(A+B)−1(a−b)}=N(b|a,A+B)=N(a|b,A+B)
Multivariate normal distribution, N(x|a,A) can be represented as a distribution from an exponential family:
N(x|a,A)=(2π)−D2|A|−12exp[−12(x−a)TA−1(x−a)]=(2π)−D2exp[−12tr(A−1xxT)+aTA−1x−12aTA−1a−12log|A|],
where we used that aTA−1x=xTA−1a, and also write the first term as
xTA−1x=∑i,jxi(A−1)ijxj=∑i,j(A−1)ijxjxi=∑i,j(A−1)ij(xxT)ji=tr(A−1xxT)
Representing N(x|b,B) in a similar way and multiplying the two distributions we readily obtain
N(x|a,A)N(x|b,B)=(2π)−Dexp{−12tr[(A−1+B−1)xxT]+(aTA−1+bTB−1)x−12aTA−1a−12log|A|−12bTB−1b−12log|B|}=cN(x|c,C),
where we defined
C−1=A−1+B−1,cTC−1=aTA−1+bTB−1,c=(2π)−D2exp{12cTC−1c+12log|C|−12aTA−1a−12log|A|−12bTB−1b−12log|B|}
Coefficient c can now be reduced to the required form using the matrix transformations described in part a).
The expectation value and the conditional expectation value are given by
EX[x]=∫xp(x)dx,EY[f(y)]=∫f(y)p(y)dy,EX[x|y]=∫xp(x|y)dx
We then have
EY[EX[x|y]]=∫EX[x|y]p(y)dy=∫[∫xp(x|y)dx]p(y)dy=∫∫xp(x|y)p(y)dxdy=∫∫xp(x,y)dxdy=∫x[∫p(x,y)dy]dx=∫xp(x)dx=EX[x],
where we used the definition fo the conditional probability density
p(x|y)p(y)=p(x,y)
If x is fixed, then y has the same distribution as w, but with the mean shifter by Ax+b, that is
p(y|x)=N(y|Ax+b,Q)
Let us consider random variable u=Ax, it is distributed according to
p(u)=N(u|Aμx,AΣxAT).
Then y is a sum of two Gaussian random variables u and w with its mean additionally shifted by b, that is
p(y)=N(y|Aμx+b,AΣxAT+Q),
that is
μy=Aμx+b,Σy=AΣxAT+Q.
Like in b), assuming that y is fixed we obtain the conditional distribution
p(z|y)=N(z|Cy,R)
Since Cy is a Gausssian random variable with distribution N(Cμy,CΣyCT) we obtain the distribution of z as that of a sum of two Gaussian random variables:
p(z)=N(z|Cμy,CΣyCT+R)=N(z|C(Aμx+b),C(AΣxAT+Q)CT+R)
The posterior distribution p(x|y) can be obtained by applying the Bayes' theorem:
p(x|y)=p(y|x)p(x)p(y)=N(y|Ax+b,Q)N(x|μx,Σx)N(y|Aμx+b,AΣxAT+Q)
Cdf is related to pdf as
Fx(x)=∫x−∞dx′fx(x′),ddxFx(x)=fx(x)
and changes in the interval [0,1].
The pdf of variable y=Fx(x) then can be defined as
fy(y)=fx(x)|dxdy|=fx(x)|dydx|=fx(x)|dFx(x)dx|=fx(x)fx(x)=1,
i.e. y is uniformly distributed in interval [0,1].