$\LaTeX \text{ commands here} \newcommand{\R}{\mathbb{R}} \newcommand{\im}{\text{im}\,} \newcommand{\norm}[1]{||#1||} \newcommand{\inner}[1]{\langle #1 \rangle} \newcommand{\span}{\mathrm{span}} \newcommand{\proj}{\mathrm{proj}} \newcommand{\OPT}{\mathrm{OPT}} \newcommand{\vx}{\vec{x}} \newcommand{\I}{\mathbb{I}} $
Georgia Tech, CS 4540
Jacob Abernethy
Date: Thursday, November 21, 2019
There are many problems on which we need to be able to generate random samples from various probability distributions.
We have recently seen the birth of Generative Adversarial Networks, a deep learning technique for producing random samples that appear to resemble images that match the distribution of some training set.
from IPython.display import YouTubeVideo
YouTubeVideo('XOxxPcy5Gr4')
Let's say I have the ability to sample a fair coin toss $X \in \{0,1\}$ as many times as I want (equal probability of 0 and 1).
Let's try a bigger class of distributions.
Here are some canonical examples of tricky (and sometimes #P-hard!) sampling problems:
Assume we have an $n \times n$ binary matrix $M$. Recall that the permanent of a matrix is defined as $$\text{perm}(M) = \sum_{\sigma \in S_n} \prod_{i=1}^n M_{i\sigma(i)}$$ where $S_n$ is the set of all $n!$ permutations.
Note: Computing the determinant of a matrix is easy! But computing the permanent is #P-hard!!!
Convince yourself that computing the permanent of binary matrix $M$ is equivalent to counting the number of perfect matchings in an undirected bipartite graph with $n$ nodes in each partition.
Assume you had some algorithm that can sample perfect matchings uniformly at random. How can you use that to estimate the permanent of a matrix?
Hint: The trick is to construct a sequence of graphs, where each successive graph has one edge deleted. If $M'$ has one fewer edge than $M$, can you think of a way to estimate $\frac{\text{perm}(M)}{\text{perm}(M')}$?
A Markov chain is a series of random variables $X_1, \dots, X_N$ satisfying the Markov property:
The future is independent of the past, given the present.
$$
p(x_n | x_1, \dots, x_{n-1}) = p(x_n | x_{n-1})
$$
A chain is stationary if the transition probabilities do not change with time.
We can represent a Markov chain using a "directed graphical model". The graph specifies how the sequence of random variables relate to each other, and in particular the independence assumptions.
This idea is complex and I'd recommend a course on graphical models to learn more
@pgm_render
def pgm_markov_chain():
pgm = daft.PGM([6, 6], origin=[0, 0])
# Nodes
pgm.add_node(daft.Node("x1", r"$X_n$", 2, 2.5))
pgm.add_node(daft.Node("x2", r"$X_2$", 3, 2.5))
pgm.add_node(daft.Node("ellipsis", r" . . . ", 3.7, 2.5,
offset=(0, 0), plot_params={"ec" : "none"}))
pgm.add_node(daft.Node("ellipsis_end", r"", 3.7, 2.5,
offset=(0, 0), plot_params={"ec" : "none"}))
pgm.add_node(daft.Node("xN", r"$X_N$", 4.5, 2.5))
# Add in the edges.
pgm.add_edge("x1", "x2", head_length=0.08)
pgm.add_edge("x2", "ellipsis", head_length=0.08)
pgm.add_edge("ellipsis_end", "xN", head_length=0.08)
return pgm;
%%capture
pgm_markov_chain("images/pgm/markov-chain.png")
If a sequence has the Markov property, then the joint distribution factorizes according to $$ p(x_1, \dots, x_N) = p(x_1) \prod_{n=2}^N p(x_n | x_{n-1}) $$
Suppose $X_t \in \{1,\dots,K\}$ is discrete. Then, a stationary chain with $N$ states can be described by a transition matrix, $A \in \R^{N \times N}$ where $$ a_{ij} = p(X_t=j \mid X_{t-1}=i) $$
is the probability of transitioning from state $i$ to $j$.
Each row sums to one, $\sum_{j=1}^K A_{ij} = 1$, so $A$ is a row-stochastic matrix.
Transitions between states can be represented as a graph:
The nodes in this graph represent states that each $X_i$ can take.
Here the transition matrix is $A = \begin{bmatrix} (1-\alpha) & \alpha \\ \beta & (1-\beta) \end{bmatrix}$
Answer: they are the ideal model to reason about many randomized algorithms. Consider the following alg:
state = 0
for i in range(NUMITER):
cointoss = np.random.somedist() # sample a random variable
state = update(state,cointoss) # update state based on random sample
return state
We can analyze the behavior of this algorithm by viewing the state
variable as following a markov process.
Consider a row vector $x_t \in \R^{1 \times K}$ with entries $x_{tj} = p(X_t = j)$. Then,
Therefore, we conclude $x_t = x_{t-1} A$.
(Note $\sum_{j=1}^K x_{tj} = 1$.)
Since $x_t = x_{t-1} A$, this suggests that in general, $$ x_{t} = x_{t-1} A = x_{t-2} A^2 = \cdots x_{0} A^t $$ If we know the initial state probabilities $x_0$, we can find the probabilities of landing in any state at time $t > 0$.
Suppose the weather is either $R=\text{Rainy}$ or $S=\text{Sunny}$, $$ A = \begin{bmatrix} 0.9 & 0.1 \\ 0.5 & 0.5 \end{bmatrix} $$
Suppose today is sunny, $x_0 = \begin{bmatrix} 1 & 0 \end{bmatrix}$. We can predict tomorrow's weather, $$ x_1 = x_0 A = \begin{bmatrix} 0.9 & 0.1 \end{bmatrix} $$ The weather over the next several days will be $$ \begin{align} x_2 &= x_1 A = \begin{bmatrix} 0.86 & 0.14 \end{bmatrix} \\ x_3 &= x_2 A = \begin{bmatrix} 0.844 & 0.156 \end{bmatrix} \\ x_4 &= x_3 A = \begin{bmatrix} 0.8376 & 0.1624 \end{bmatrix} \end{align} $$
Question: What happens to $x_0 A^n$ as $n \rightarrow \infty$?
If we ever reach a stage $x$ where $$ x = xA $$ then we have reached the stationary distribution of the chain.