Title: Review of Bayesian Decision Theory Author: Thomas Breuel Institution: UniKL
(Joint Density)
In decision problems, we have items with:
These are described by a joint density $P(\omega,x)$.
(Conditional Probabilities)
The class conditional density is $p(x|\omega) = P(\omega,x) / P(\omega)$.
The posterior distribution is $P(\omega|x) = P(\omega,x) / p(x)$.
(Bayes Rule)
These two combine into Bayes Rule:
$$p(x|\omega) P(\omega) = P(\omega|x) p(x)$$or
$$P(\omega|x) = \frac{p(x|\omega) P(\omega)}{p(x)}$$(Optimal Decision Rule)
The optimal decision rule under a zero-one loss function is
$D(x) = \arg\max_\omega P(\omega|x)$
(Justification for Optimal Decision Rule)
The reason for this is that the probability of error for each $x$ is
$$P(\hbox{error}|x) = 1 - \max_\omega P(\omega|x)$$and the total error is given by:
$$P(\hbox{error}) = \int P(\hbox{error}|x) p(x) dx $$from scipy import stats
from scipy.stats import norm
(probability density functions)
We previously computed with distributions by sampling from them. Let's compute now directly with the probability density functions, like $p(x)$.
Remember:
$$ p(x) \geq 0 $$$$ \int p(x) dx = 1 $$# normal density
x = linspace(-8.0,8.0,1000)
plot(norm.pdf(x))
[<matplotlib.lines.Line2D at 0x3130090>]
# Class Conditional Densities
x = linspace(-8.0,8.0,1000)
p_x_given_1 = norm.pdf(x,loc=-1.0)
p_x_given_2 = norm.pdf(x,loc=1.0,scale=2.0)
plot(p_x_given_1); plot(p_x_given_2)
[<matplotlib.lines.Line2D at 0x26b4110>]
# Priors and Sample Distribution
p_1 = 0.8
p_2 = 0.2
p_x = p_x_given_1 * p_1 + p_x_given_2 * p_2
plot(p_x); plot(p_x_given_1*p_1); plot(p_x_given_2*p_2)
[<matplotlib.lines.Line2D at 0x2ecb310>]
# Conditional Distributions
p_1_given_x = p_x_given_1 * p_1 / p_x
p_2_given_x = p_x_given_2 * p_2 / p_x
plot(p_1_given_x); plot(p_2_given_x)
[<matplotlib.lines.Line2D at 0x2edd6d0>]
(Density vs Conditional Distributions)
A density is a non-negative function $f:R^n\rightarrow R$ such that $\int f(x) dx = 1$.
A conditional distribution is a function $f:R^n\rightarrow R^c$ such that $\sum_i f_i(x) = 1$ for all $x$.
# Error at each Point
p_error_given_x = 1-maximum(p_1_given_x,p_2_given_x)
plot(p_error_given_x)
print sum(p_error_given_x*p_x)/sum(p_x)
0.117715838421
Some preliminaries...
xs,ys = meshgrid(linspace(-4.0,4.0,200),linspace(-4.0,4.0,200))
xys = c_[xs.ravel(),ys.ravel()]
def mvpdf(x,mu=zeros(2),sigma=eye(2)):
return ((2*pi)**len(mu)*det(sigma))**-.5 * exp(-0.5*dot(x-mu,dot(inv(sigma),x-mu)))
# class conditional densities
p_x_given_1 = array([mvpdf(x,mu=array([1,1.0])) for x in xys]).reshape(xs.shape)
p_x_given_2 = array([mvpdf(x,sigma=diag([2.0,1])) for x in xys]).reshape(xs.shape)
subplot(121); imshow(p_x_given_1,cmap=cm.gray)
subplot(122); imshow(p_x_given_2,cmap=cm.gray)
<matplotlib.image.AxesImage at 0x5e6b190>
# Priors and Sample Distribution
p_1 = 0.4
p_2 = 0.6
p_x = p_x_given_1 * p_1 + p_x_given_2 * p_2
imshow(p_x,cmap=cm.gray)
<matplotlib.image.AxesImage at 0x693e190>
# sample distribution in 3D
from mpl_toolkits.mplot3d import Axes3D
subplots(1,1,figsize=(8,8))
ax = gcf().add_subplot(111,projection='3d')
ax.plot_wireframe(xs[::10,::10],ys[::10,::10],p_x[::10,::10])
ax.set_zlim3d(0.0,0.1)
(0.0, 0.1)
# conditional distributions
p_1_given_x = p_x_given_1 * p_1 / p_x
p_2_given_x = p_x_given_2 * p_2 / p_x
imshow(p_1_given_x,cmap=cm.gray)
<matplotlib.image.AxesImage at 0x79ebc90>
# conditional distribution in 3D
from mpl_toolkits.mplot3d import Axes3D
subplots(1,1,figsize=(8,8))
ax = gcf().add_subplot(111,projection='3d')
ax.plot_wireframe(xs[::10,::10],ys[::10,::10],p_1_given_x[::10,::10])
ax.set_zlim3d(0.0,1.0)
(0.0, 1.0)
(zero-one loss)
Above, we used a zero-one loss. That is, we used a cost of 1 for an error and a cost of 0 when there was no error.
In the general case, we have a 2D table of costs, saying how much of a penalty we pay if the decision is one thing and the state of nature is another thing.
state 0 state 1
action 0 0 1
action 1 1 0
(other losses)
These costs need not be at all like this. In fact, costs frequently are asymmetric.
state 0 state 1
action 0 0 10
action 1 1000 0
We call this matrix the loss matrix and write the elements as $\lambda_{ij}$, or a function $\Lambda(\alpha,\omega)$, which is the cost of taking action $\alpha$ when the state of nature is $\omega$.
(relationship to game theory)
We will see these kinds of matrices much more later on when we talk about game theory.
(risk = expected loss)
In general, we want to minimize expected loss. We call expected loss the risk of a decision.
Just like there are conditional probabilities, there are conditional risks.
$$R(\alpha_1|x) = \lambda_{11} P(\omega_1|x) + \lambda_{12} P(\omega_2|x)$$(minimizing expected loss)
We want to minimize overall risk, and for that we minimize risk at each point $x$. To do that, we choose action $\alpha_1$ if the risk of that action is lower than action $\alpha_2$ and vice versa.
\begin{eqnarray} R(\alpha_1|x) & \leq & R(\alpha_2|x) \\\ \lambda_{11} P(\omega_1|x) + \lambda_{12} P(\omega_2|x) & \leq & \lambda_{21} P(\omega_1|x) + \lambda_{22} P(\omega_2|x)\\\ (\lambda_{12}-\lambda_{22}) P(\omega_2|x) & \leq & (\lambda_{21}-\lambda_{11}) P(\omega_1|x) \\\ \end{eqnarray}(likelihood ratio vs risk)
With Bayes Rule, we can transform this into a decision rule based on the likelihood ratio:
\begin{equation} \frac{p(x|\omega_1)}{p(x|\omega_2)} \geq \frac{\lambda_{12}-\lambda_{22}}{\lambda_{21}-\lambda_{11}} \frac{P(\omega_2)}{P(\omega_1)} \end{equation}