Bayes Theorem

Rohan L. Fernando

November 2015

Motivation

  • In whole-genome analyses, the number $k$ of marker covariates typically exceeds the number $n$ of observations
  • Cannot use least squares methods to simultaneously estimate the effects of all the $k$ covariates
  • Bayesian inference is widely used to overcome this problem by combining prior information with the observed data
  • In the Bayesian approach, inferences are based on conditional probabilities, and the Bayes theorem is a statement on conditional probability

Conditional Probability of $X$ Given $Y$

Suppose $X$ and $Y$are two random variables with joint probability distribution $\Pr(X,Y)$. Then, the conditional probability of $X$ given $Y$ is given by Bayes theorem as

$$\Pr(X|Y) = \frac{\Pr(X,Y)}{\Pr(Y)} \tag{1}$$

where $\Pr(Y)$ is the probability distribution of $Y$.

Similarly, thethe conditional probability of $Y$ given $X$ is

$$\Pr(Y|X)=\frac{\Pr(X,Y)}{\Pr(X)},$$

which upon rearranging gives

$$\Pr(X,Y)=\Pr(Y|X)\Pr(X). \tag{2}$$

Then, substituting (2) in (1) gives

$$\begin{eqnarray} \Pr(X|Y) &= &\frac{\Pr(X,Y)}{\Pr(Y)}\\ &= &\frac{\Pr(Y|X)\Pr(X)}{\Pr(Y)}, \end{eqnarray}$$

which is the form of the formula that is used for inference of $X$ given $Y.$

Bayes Theorem by Example

Here we consider a simple example to justify the formula (1). The following table gives the joint distribution of smoking and lung cancer in a hypothetical population of 1,000,000 individuals.

$$ \begin{array}{c|lc|r} & \text{Smoking} \\ \hline \text{Cancer} & \text{Yes} & \text{No} & \text{} \\ \hline \text{Yes} & 42,500 & 7,500 & 50,000 \\ \text{No} & 207,500 & 742,500 & 950,000 \\ \hline & 250,000 & 750,000 & 1,000,000 \end{array} $$
  • Given these numbers, consider how you would compute the relative frequency of lung cancer among smokers
  • There are a total of 250,000 smokers in this population, and among these 250,000 individuals, 42,500 have lung cancer
  • So, relative frequency of lung cancer among smokers is $\frac{42,500}{250,000}$
  • As we reason next, this relative frequency is also the conditional probability of lung cancer given the individual is a smoker
  • The frequentist definition of probability of an event is the limiting value of its relative frequency in a “large” number of trials.
  • Suppose we sample with replacement individuals from the 250,000 smokers and compute the relative frequency of the incidence of lung cancer.

  • It can be shown that as the sample size goes to infinity, this relative frequency will approach $\frac{42,500}{250,000}=0.17$.

  • This ratio can also be written as $$\frac{42,500/1,000,000}{250,000/1,000,000}=0.17.$$

  • The ratio in the numerator is the joint probability of smoking and lung cancer, and the ratio in the denominator is the marginal probability of smoking.

  • So, $$\Pr(X|Y) = \frac{\Pr(X,Y)}{\Pr(Y)} $$

In [12]:
;ipython nbconvert --to slides BayesTheorem.ipynb
[NbConvertApp] Converting notebook BayesTheorem.ipynb to slides
[NbConvertApp] Writing 197689 bytes to BayesTheorem.slides.html