#!/usr/bin/env python
# coding: utf-8
# # Solutions for Chapter 2 Probability
# + date: 2017-03-08
# + tags: mlapp, statistics
# + author: Peijun Zhu
# + slug: mlapp-ch2-sol
#
# $\def\bra#1{\mathinner{\langle{#1}|}}
# \def\ket#1{\mathinner{|{#1}\rangle}}
# \def\braket#1{\mathinner{\langle{#1}\rangle}}
# \newcommand{\mdef}{\overset{\mathrm{def}}{=}}
# \newcommand{\bm}{\mathbf}
# \DeclareMathOperator{\Var}{Var}
# \DeclareMathOperator{\det}{det}
# \DeclareMathOperator{\tr}{tr}
# \DeclareMathOperator{\E}{E}
# \DeclareMathOperator{\Cov}{Cov}
# \DeclareMathOperator{\Beta}{B}
# \DeclareMathOperator{\Bdist}{Beta}
# \DeclareMathOperator{\sgn}{sgn}
# \DeclareMathOperator{\adj}{adj}
# \DeclareMathOperator{\ii}{i}
# \DeclareMathOperator{\dd}{d}
# \newcommand{\pp}{\partial}
# \DeclareMathOperator{\rhs}{RHS}
# \DeclareMathOperator{\lhs}{LHS}
# \DeclareMathOperator{\Beta}{B}$
#
# ## 2.1
# 1. $P=1/3$
# 1. $P=1/2$, the difference is that you happen to see a child.
# ## 2.2 Obvious
# ## 2.3
# \begin{align}
# \Var[x+y]&=\E[(x+y)^2]-\E[x+y]^2\\
# &=\E[x^2]+\E[y^2]+2\E[xy]-\E[x]^2-\E[y]^2-2\E[x]\E[y]\\
# &=\Var[x]+\Var[y]+2\Cov[x,y]
# \end{align}
# ## 2.4
# $I, N$ means infected and not infected, $P$ means positive
# $$\Pr(I|P)=\frac{\Pr(IP)}{\Pr(P)}=\frac{\Pr(IP)}{\Pr(IP)+\Pr(NP)}\approx\frac{10^{-4}}{10^{-4}+0.01}\approx 0.01$$
# ## 2.5
# The door we chose has probability $1/3$ for the prize. After the host has revealed another invalid door, the chance of the last door is $1-1/3=2/3$. Note the host is **not randomly** removing a door, so the chance of the two unchosen doors will fall to the last door.
# ## 2.6
# 1. $$P(H|e_1,e_2)=\frac{P(H,e_1,e_2)}{P(e_1,e_2}=\frac{P(e_1,e_2|H)P(H)}{P(e_1,e_2)}$$
# + Only ii is sufficient
# 2. Given $P(e_1,e_2|H)=P(e_1|H)P(e_2|H)$, we have $$P(H|e_1,e_2)=\frac{P(e_1,H)P(e_2,H)}{P(H)P(e_1,e_2)}$$
# + Use $P(e_1,e_2)=\sum_h P(e_1,e_2|H=h)P(H=h)$, i is equivalent to iii
# + Both i, iii are sufficient now
# ## 2.7
# Pairwise independence means
# $$P(X_iX_j)=P(X_i)P(X_j), \forall i\neq j\in \{1,\ldots ,n\}$$
#
# Mutually independent means
# $$P(\prod_{X\in S}X)=\prod_{X\in S}, \forall S\subseteq \{1,\ldots ,n\}$$
#
# Thus for $n\geq 3$, the two group of identities are not necessarily equivalent.
# ### Counter example
# $$X, Y\sim \operatorname{Ber}(0.5), \quad Z=X\oplus Y\sim \operatorname{Ber}(0.5)$$
# ## 2.8
# We want to prove $$P(x,y|z)=P(x|z)P(y|z)\Leftrightarrow P(x,y|z)=g(x,z)h(y,z)$$
#
# The LHS to RHS is obvious. We are now going from RHS to LHS.
# First of all, $p(x,y|z)$ must satisfy unitary condition
# $$\iint p(x,y|z)dxdy=\left(\int g(x,z)dx\right)\left(\int h(y,z)dy\right)=1$$
# and $$P(x|z)= g(x,z)\int h(y,z)dy,\qquad P(y|z)= h(y,z)\int h(x,z)dx$$
# And the deduction to LHS is trivial now.
# ## 2.9
# $$X\perp W|Z,Y\Rightarrow P(XWZY)=P(XZY)P(WZY)/P(ZY)$$
# $$X\perp Y|Z\Rightarrow P(XYZ)=P(XZ)P(YZ)/P(Z)$$
# $$X\perp Y|W\Rightarrow P(XYW)=P(XW)P(YW)/P(W)$$
# ### a. True
# We need to prove $$P(XYWZ)=P(XZ)P(YWZ)/P(Z),$$
# which is the result of plugging 2nd equation into 1st one.
# ### b. False
# We need to prove $$P(XYZW)=P(XZW)P(YZW)/P(ZW),$$
# which is not reachable.
# ## 2.10
# The pdf of $y=1/x$ is
#
# \begin{align}
# g(y)&=f(x(y))\left|\frac{dx}{dy}\right|=f(1/y)/y^2\\
# &=\frac{b^a}{\Gamma(a)}(1/y)^{a-1}e^{-b/y}/y^2\\
# &=\frac{b^a}{\Gamma(a)}y^{-(a+1)}e^{-b/y}
# \end{align}
# ## 2.11
# \begin{align}
# Z^2&=\int_0^{2\pi}\int_0^\infty r\exp\left(-\frac{r^2}{2\sigma^2}\right)drd\theta\\
# &=2\pi\int_0^\infty\exp\left(-\frac{t}{\sigma^2}\right)dt,\quad t=r^2/2\\
# &=2\pi\sigma^2
# \end{align}
# ## 2.12
# Notice $-\sum_x\sum_y p(x,y)\log p(x)=-\sum_x p(x)\log p(x)=H(X)$
#
# \begin{align}
# I(X,Y)&=\sum_x\sum_y p(x,y)\log\frac{p(x,y)}{p(x)p(y)}\\
# &=H(X)+H(Y)-H(X,Y)
# \end{align}
#
# \begin{align}
# H(Y|X)&=-\sum_y \log p(x,y)\log\frac{p(x,y)}{p(x)}\\
# &=H(X,Y)-H(X)
# \end{align}
#
# So $I(X,Y)=H(X)-H(X|Y)=H(Y)-H(Y|X)$
# ## 2.13
# Suppose we have $$X_{\pm}=\frac{X_1\pm X_2}{\sqrt{2}}$$
# then $$\begin{bmatrix}
# X_+\\
# X_-
# \end{bmatrix}\sim N\left(\mathbf{0},\begin{bmatrix}\sigma_+^2&\\&\sigma_-^2\end{bmatrix}\right)=N(0,\sigma_+^2)N(0,\sigma_-^2)$$
#
# And $$\sigma_+^2\frac{(X_1+X_2)^2}{2}+\sigma_-^2\frac{(X_1-X_2)^2}{2}=\sigma^2 (X_1^2+X_2^2)-2\rho X_1X_2$$
#
# Thus $\sigma_\pm=(1\pm\rho)\sigma^2$. And $X_+$ is independent on $X_-$. From $X_\pm$, we can get the distribution of $X_1$ and $X_2$.
#
# If $\rho=1$, then $X_1=X_2=X/\sqrt 2$, the mutual information $I(X_1,X_2)=H(X_1)$
#
# If $\rho=-1$, similiar.
#
# If $\rho=0$, they are independent. So $I(X_1,X_2)=0$
# ## 2.14
# 1. $r=1-H(Y|X)/H(X)=(H(X)-H(Y|X))/H(X)=I(Y, X)/H(x)$
# 1. It is obvious that $H(X)>0$.
# 1. $r\geq 0$ because we have $I(X,Y)\geq 0$ using Jensen inequality: $$I(X,Y)=-\sum_{i,j} \Pr(i,j)\log \frac{p_ip_j}{\Pr(i,j)}\geq - \log \sum_{i,j}p_ip_j=0$$
# 1. We need to prove $H(Y|X)\geq 0$. And $H(Y|X)=H(X,Y)-H(Y)\geq 0$.
#
# \begin{align}
# H(X,Y)&=-\sum_i\sum_j \Pr(i,j)\log \Pr(i,j)\\
# &\geq -\sum_i p_i\log \frac{\sum_j p^2(i,j)}{p_i}\\
# &\geq -\sum_i p_i\log \frac{p_i^2}{p_i}=H(Y)
# \end{align}
#
# As a result, $r\leq 1$
# 1. $r=0$ holds iff $I(X,Y)=0$. In this case, $\dfrac{p_ip_j}{\Pr(i,j)}$ is a constant, so $X$ is independent on $Y$
# 1. $r=0$ holds iff $I(X,Y)=H(X)$. i.e. $X=Y$
# ## 2.15
# MLE(Maximum Likehood Estimation) is to maximize the likehood
# $$\prod_i q(x=i|\theta)^{n_i}=\exp(n_i\log q_i)\propto\exp(\sum_ip_i\log q_i)$$
#
# It is equivalent to minimize $-\sum_ip_i\log q_i\equiv -\sum_ip_i\log(q_i/p_i)$
# ## 2.16
# $f=x^{a-1}(1-x)^{b-1}/\Beta(a,b)\propto x^{a-1}(1-x)^{b-1}$
# $$E(x)=\int xf dx=\Beta(a+1,b)/\Beta(a,b)=\frac{a}{a+b}$$
# We can prove the last step as follows:
#
# \begin{align}
# \Beta(p, q+1)&=\int_0^1 x^{p-1}(1-x)^{q}dx\\
# &=\int_0^1 x^{p-1}(1-x)^{q-1}dx-\int_0^1 x^{p}(1-x)^{q-1}dx\\
# &=\int_0^1 x^{p-1}(1-x)^{q-1}dx-\int_0^1 x^{p}(1-x)^{q-1}dx\\
# &=\int_0^1 x^{p-1}(1-x)^{q-1}dx-\frac{p}{q}\int_0^1 x^{p-1}(1-x)^{q}dx\\
# &=\Beta(p,q)-\frac{p}{q}\Beta(p, q+1)
# \end{align}
#
# As a result, $\Beta(p, q+1)=\dfrac{q}{p+q}\Beta(p,q)$. We can also use the relationship betweeen Beta and Gamma function, and $\Gamma(p+1)=p\Gamma(p)$
# $$\Beta(p,q+1)=\frac{\Gamma(p)\Gamma(q)}{\Gamma(p+q+1)}=\frac{q\Gamma(p)\Gamma(q)}{(p+q)\Gamma(p+q)}=\frac{q}{p+q}\Beta(p,q)$$
#
# $$E(x^2)=\int x^2f dx=\Beta(a+2,b)/\Beta(a,b)=\frac{a}{a+b}\frac{a+1}{a+b+1}$$
#
# $$\frac{f'}{f}=\frac{a-1}{x}+\frac{b-1}{1-x}=0\Rightarrow x_m=\frac{a-1}{a-b}$$
#
# $$Var(x)=E(x^2)-E(x)^2=\frac{ab}{(a+b)^2(a+b+1)}$$
# ## 2.17
# The cdf(cumulative distribution function) of each variable is
#
# $$F_1(x)=\Pr(X