Notebook

Graphical Models

Probabilistic graph models provide an intuitive way to express joint probability relationships. The nodes in a graph represent random variables and the edges between nodes represent a probabilistic relationship between variables.

Bayesian Networks

Here we consider **acyclic directed** graphical models that express the joint probability relationship between some set of variables, known as *Bayesian Networks* because of the prominant use of Bayes' Theorem. To begin, consider the joint probability distribution $p\left( x_1, \ldots, x_K \right)$ over $K$ variables. Using repeated application of the product rule, this can be expressed as $$p \left(x_1, \ldots, x_K \right) = p \left(x_K \mid x_1, \ldots, x_{K-1} \right) \ldots p\left(x_2 \mid x_1 \right) p\left(x_1\right) $$

This can be represented graphically as a fully connected directed graph having $K$ nodes, with each node having incoming links from all lower numbered nodes.

Now consider the general case where the graph is not necessarily fully connected. Let the a given node $x_k$ have incoming links from the set of nodes $pa_k$ known as the parent nodes of $x_k$. Then the joint distribution defined by the directed acyclic graph over all nodes is given by the product of a conditional distribution for each node conditioned on its parent nodes. A graph with $K$ nodes has the joint distribution defined as

$$ p\left(\mathbf{x}\right) = \prod_{k=1}^K p\left(x_k \mid pa_k \right) $$

Graphical Notation

There are several conventions used to convey information in graphical models. They are briefly summarized here - see Bishop 363-365 for more detail.

Represent multiple nodes of the form $t_1, \ldots, t_N$ as a single node with a box, known as a plate, drawn around it and annotated with $N$, the number of repeated nodes.
Deterministic model parameters, e.g. the mean of a distribution, are drawn as solid circles, and random variables as open circles
Observed variables are drawn as shaded open circles

Conditional Independence

Consider three random variables $a,b,c$, we use the following notation to indicate that $a$ is *conditionally idependent* of $b$ given $c$ $$a \perp\\!\\!\\!\perp b \mid c$$

When this condition holds, using the product rule, we may write the joint distribution of $a$ conditioned on $b$ and $c$ as

$$p\left(a, b \mid c \right) = p\left(a\mid b,c\right)p\left(b\mid c\right) = p\left(a\mid c\right) p\left(b \mid c\right)$$

D-separation

Given the description of conditional independence above, we now consider a general directed acyclic graph. Let $A$, $B$, and $C$ be arbitrary sets of *nonintersecting* nodes. We wish to determine if the graph structure implies the conditional independence $A \perp\\!\\!\\!\perp B \mid C$. It turns out this conidition is satisfied if **all** possible paths from **any** node in $A$ to **any** node in $B$ are *blocked*, in which case $A$ is said to be *d-separated* from $B$. A path is considered blocked if either of the following hold:

(a) There is a node, $n \in C$, in the path such that the arrows on the path meet head-to-tail or tail-to-tail at n

(b) There is a node, $n \notin C$ with none of its descendants in $C$, in the path such that the arrows meet head-to-head at the node. A node $d$ is considered to a descendant of a node $p$ if there is a path from $p$ to $d$ in which each path step is in the direction of the arrows connecting the nodes on th path.

Note that model parameters (e.g. the mean of a distribution) will always be tail-to-tail and therefore play no role in determining d-separation.

Training a Bayes Net

MLE for training a Bayes Net, estimating the theta parameters for the nodes, is the equivalent to maximizing the likelihood of the data. So for each node, probability of the node being equal to $s$ be conditioned on say $f$ and $a$, then we have - TODO - generalize this to N conditional quantities $$\theta_{s|ij} = \frac{\sum_{k=1}^K \delta \left(f_k = i, a_k =j, s_k =1 \right)}{\sum_{k=1}^K \delta \left(f_k=i, a_k=j \right)}$$

where $\delta(x) = 1$

if $x$ is true, 0 otherwise

Partially Observed Data

Can't use the MLE approach above due to lack of data Let $X$ be all *observed* variables values Let $Z$ be all *unobserved* variables Assume we always observe/unobserve the same variables - it is possible to generalize this

Approach:

$ argmax_{\theta} E_{P(Z|X,\theta)}\left[log P(x,z|\theta) \right]$

using some probability distribution on $Z$, namely, $P(Z|X,\theta)$ - do we need a model for this distribution?

Expectation Maximization (EM) Algorithim

guaranteed to find local maximum of expected log likelihood above

$ argmax_{\theta} E_{P(Z|X,\theta)}\left[log P(x,z|\theta) \right]$

see slide at 29:59

EM - general procedure for learning from partly observed data. Given observed varialbes, $X$, and unobserved variables, $Z$, define

$Q\left(\theta' | \theta \right) = E_{E_{P(Z|X,\theta)}} \left[\log P(X,Z | \theta') \right]$

Iterate until convergence:

E Step: Use $X$ and current $\theta$ to calculate $P(Z|X,\theta)$ - done for every variable in $Z$ for each training example,
M Step: Replace current $\theta$ by

$\theta \leftarrow arg max_{\theta'} Q $

using step 1, plug into into last equation on slide 29:59 and pick max theta

Example: see slide 38:59, 50:59

More generally: Given observed variables $X$ and unobserved variables $Z$ all of which are boolean:

E Step: Calculate for each training example, $k$, the expected value of each unobserved variable in $Z$
M Step: Calculate estimates similar to MLE, but replacing each count (i.e. observed data proportions) by its expected count

$\delta(Y=1) \rightarrow E_{Z|X,\theta}[Y]$

$\delta(Y=0) \rightarrow 1 - \delta(Y=1)$

Example: Linear Gaussian Model

*TODO* A useful example may be a Linear-Gaussian model - see Bishop pages 370-372 -