Bayes Theorem

Motivation

In whole-genome analyses, the number $k$ of marker covariates typically exceeds the number of $n$ of observations. In this situation, least squares methods cannot be used to simultaneously estimate the effects of all the $k$ marker covariates. One of the most widely used methods to overcome this problem is Bayesian inference, where prior information about marker effects is combined with the data to make inferences about the marker effects. In Bayesian inference, inferences are based on conditional probabilities, and the Bayes theorem is a statement on conditional probability.

Conditional Probability of $X$ Given $Y$

Suppose $X$ and $Y$are two random variables with joint probability distribution $\Pr(X,Y)$. Then, the conditional probability of $X$ given $Y$ is given by Bayes theorem as

$$\Pr(X|Y) = \frac{\Pr(X,Y)}{\Pr(Y)} \tag{1}$$

where $\Pr(Y)$ is the probability distribution of $Y$. Similarly, the the conditional probability of $Y$ given $X$ is

$$\Pr(Y|X)=\frac{\Pr(X,Y)}{\Pr(X)},$$

which upon rearranging gives

$$\Pr(X,Y)=\Pr(Y|X)\Pr(X). \tag{2}$$

Then, substituting (2) in (1) gives

$$\begin{eqnarray} \Pr(X|Y) &= &\frac{\Pr(X,Y)}{\Pr(Y)}\\ &= &\frac{\Pr(Y|X)\Pr(X)}{\Pr(Y)}, \end{eqnarray}$$

which is the form of the formula that is used for inference of $X$ given $Y.$

Bayes Theorem by Example

Here we consider a simple example to justify the formula (1). The following table gives the joint distribution of smoking and lung cancer in a hypothetical population of 1,000,000 individuals.

$$ \begin{array}{c|lcr} \text{Cancer} & \text{Yes} & \text{No} & \text{} \\ \hline \text{Yes} & 42,500 & 7,500 & 50,000 \\ \text{No} & 207,500 & 742,500 & 950,000 \\ & 250,000 & 750,000 \end{array} $$

Given these numbers, consider how you would compute the relative frequency of lung cancer among smokers. There are a total of 250,000 smokers in this population, and among these 250,000 individuals, 42,500 have lung cancer. So, relative frequency of lung cancer among smokers is $\frac{42,500}{250,000}$. As we reason below, this relative frequency is also the conditional probability of lung cancer given the individual is a smoker.

  1. The frequentist definition of probability of an event is the limiting value of its relative frequency in a “large” number of trials.

  2. Suppose we sample with replacement individuals from the 250,000 smokers and compute the relative frequency of the incidence of lung cancer.

  3. It can be shown that as the sample size goes to infinity, this relative frequency will approach $\frac{42,500}{250,000}=0.17$.

  4. This ratio can also be written as $$\frac{42,500/1,000,000}{250,000/1,000,000}=0.17.$$

  5. The ratio in the numerator is the joint probability of smoking and lung cancer, and the ratio in the denominator is the marginal probability of smoking.