Unsupervised Learning : problems where the algorithm is given a data set without any “right answers”. The objective is to find some underlying structure in the data, e.g. clustering
Reinforcement Learning : problems where a sequence of decisions are made as opposed to a single decision (or prediction)
Learning Theory : study of how and why (mathematically) a learning algorithm works
The following definition of a compound distribution will also be useful. Let $t$ be a RV with distribution $F$ paramaterized by $\mathbf{w}$ and let $\mathbf{w}$ be a RV distributed by $G$
parameterized by $\mathbf{t}$, then the compound distribution $H$ parameterized by $\mathbf{t}$ for the random variable $t$ is defined by:
$p_H(t|\mathbf{t}) = \int_{\mathbf{w}} P_F(t|\mathbf{w}) P_G(\mathbf{w}|\mathbf{t})d\mathbf{w}$
$p(\mathbf{w}|D) = \frac{p(D|\mathbf{w})p(\mathbf{w})}{p(D)}$
In order to apply a fully Bayesian approach, we must formulate models for both the prior, $p(\mathbf{w})$, and the likelihood function, $p(D|\mathbf{w})$. Given these models and a set of data we can compute appropriate values for our free parameter vector $\mathbf{w}$ by maximizing $p(\mathbf{w}|D) \propto p(D|\mathbf{w})p(\mathbf{w})$. How does this differ from frequentist modeling? The frequentist approach, or maximum likelihood approach, ignores the formulation of a prior, and goes directly to maximizing the likelihood function to find the model parameters. Thus, the frequentist approach can be described as maximizing the probability of the data given the parameters. Under certain conditions the results of Bayesian and frequentist modeling will conincide, but this is not true in general.
One could obtain a point estimate for $\mathbf{w}$ by maximizing the posterior probability model, but this not typical. Instead a predictive distribution of the value of the target variable, $t$, is formed based on the compound distribution definition provided above. Taking the mean of this distribution provides a point estimate of $t$ while distribution itself provides a measure of the uncertainty in the estimate, say by considering the standard deviation.
TODO: Add a simple example illustrating the difference. For now, a good illustration is available here
We assume we have specified a probability density model, $p_{\mathbf{w}}(d)$ for the observed data elements, ${d \in D}$ that is parameterized by $\mathbf{w}$, i.e. $p$ is a parametric model for the distribution of $D$. As
an example, if $D$ ahs a normal distribution with mean $\mu$ and variance $\sigma^2$, then
$\mathbf{w} = (\mu, \sigma^2)$
and
$p_{\mathbf{w}}(d) = \frac{1}{\sqrt{2 \pi} \sigma} e^{-(d-\mu)^2/2\sigma^2}$
The likelihood function, regardless of our choice of model $p$, is defined by
$L(\mathbf{w}; D) = \prod_{i=1}^N p_{\mathbf{w}}(d_i)$
where $N$ is the number of elements in $D$. Thus the likelihood function is simply the product of the probability of all the individual data points, $d_i \in D$, under the probability model, $p_{\mathbf{w}}$. Note that this
definition implicitly assumes these data points are independent events.
Out of mathematical convenience, we will most often work with the log-likelihood function (which turns the product into a sum by properties of the log function), i.e. the logarithm of $L(\mathbf{w}; D)$, defined as
$l(\mathbf{w};D) = \sum_{i=1}^N l(\mathbf{w};d_i) = \sum_{i=1}^N \log p_{\mathbf{w}}(d_i)$
where we recall that $\log(ab) = \log(a) + \log(b)$.
The method of maximum likelihood chooses the value $\mathbf{w} = \widehat{\mathbf{w}}$ that maximizes the log-likelihood function. We will also often work with an error function, $E(\mathbf{w})$, defined as the
negative of the log-likelihood function
$E(\mathbf{w}) = -l(\mathbf{w};D)$
where we note $-\log(a) = \log(1/a)$