Bayesian Inference by Application to Simple Linear Regression

Simple linear regression is used to illustrate Bayesian inference, using the Gibbs sampler. The Gibbs sampler is used to draw samples from the posterior distribution of the intercept, the slope and the residual variance.

The Model

Consider the linear model:

$$y_{i}=\beta_{0}+x_{i}\beta_{1}+e_{i}. \tag{35}$$

where for observation $i$, $y_{i}$ is the value of the dependent variable, $\beta_{0}$ is the intercept, $x_{i}$ is the value of the independent variable and $e_{i}$ is a residual. Flat priors are used for the intercept and slope, and the residuals are assumed to be identically and independently distributed normal random variables with mean zero and variance $\sigma_{e}^{2}.$ A scaled inverted chi-square prior is used for $\sigma_{e}^{2}.$

Simulation of Data

In [1]:
using Distributions
using StatsBase
In [20]:
n = 20 #number of observations
k = 1  #number of covariates

x = sample([0,1,2],(n,k))
X = hcat(ones(Int64,n),x)

betaTrue = [1,2]
y = X*betaTrue+ randn(n);

Least Squares Estimation

In matrix notation, the model (35) is

$$\mathbf{y}=\mathbf{X}\boldsymbol{\beta}+\mathbf{e},$$ where $$\mathbf{X}=\left[\begin{array}{cc} 1 & x_{1}\\ 1 & x_{2}\\ \vdots & \vdots\\ 1 & x_{n} \end{array}\right].$$ Then, the least-squares estimator of $\mathbf{\beta}$ is $$\mathbf{\hat{\boldsymbol{\beta}}}=(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{y},$$ and the variance of this estimator is $$Var(\hat{\mathbf{\boldsymbol{\beta}}})=(\mathbf{X}'\mathbf{X})^{-1}\sigma_{e}^{2}.$$

Calculations in Julia:

In [3]:
XPX = X'X
rhs = X'y
XPXi= inv(XPX)
println(XPXi)
[0.16363636363636364 -0.09090909090909091
 -0.09090909090909091 0.07272727272727274]
In [4]:
betaHat = XPXi*rhs
println(betaHat)
[0.6986138506616033,2.293983905821345]
In [5]:
eHat = y - X*betaHat
resVar = eHat'eHat/(n-2)
println(resVar)
[0.45974834730130465]

Bayesian Inference

Consider making inferences about $\boldsymbol{\beta}$ from $f(\boldsymbol{\beta}|\mathbf{y},\sigma_{e}^{2}).$ By using the Bayes theorem, this conditional density is written as

$$\begin{eqnarray} f(\boldsymbol{\beta}|\mathbf{y},\sigma_{e}^{2}) & = & \frac{f(\mathbf{y}|\boldsymbol{\beta},\sigma_{e}^{2})f(\boldsymbol{\beta})f(\sigma_{e}^{2})}{f(\mathbf{y},\sigma_{e}^{2})}\nonumber \\ & \propto & f(\mathbf{y}|\boldsymbol{\beta},\sigma_{e}^{2})f(\boldsymbol{\beta})f(\sigma_{e}^{2})\nonumber \\ & \propto & f(\mathbf{y}|\boldsymbol{\beta},\sigma_{e}^{2})\nonumber \\ & = & (2\pi\sigma_{e}^{2})^{-n/2}\exp\left\{ -\frac{1}{2}\frac{(\mathbf{y}-\mathbf{X}\boldsymbol{\beta})'(\mathbf{y}-\mathbf{X}\boldsymbol{\beta})}{\sigma_{e}^{2}}\right\}\tag{36} \end{eqnarray}$$

which looks like the $n$-dimensional normal density of $\mathbf{y}$ with mean $\mathbf{X}\boldsymbol{\beta}$ and covariance matrix $\mathbf{I}\sigma_{e}^{2}.$ But, $f(\boldsymbol{\beta}|\mathbf{y},\sigma_{e}^{2})$ should be a two-dimensional density. So, the quadratic $Q=(\mathbf{y}-\mathbf{X}\boldsymbol{\beta})'(\mathbf{y}-\mathbf{X}\boldsymbol{\beta})$ in the exponent of (36) is rearranged as

$$\begin{eqnarray} Q & = & (\mathbf{y}-\mathbf{X}\boldsymbol{\beta})'(\mathbf{y}-\mathbf{X}\boldsymbol{\beta})\\ & = & \mathbf{y}'\mathbf{y}-2\mathbf{y}'\mbox{X}\boldsymbol{\beta}+\boldsymbol{\beta}'(\mathbf{X}'\mathbf{X})\boldsymbol{\beta}\\ & = & \mathbf{y}'\mathbf{y}+(\boldsymbol{\beta}-\hat{\boldsymbol{\beta}})'(\mathbf{X}'\mathbf{X})(\boldsymbol{\beta}-\hat{\boldsymbol{\beta}})-\hat{\boldsymbol{\beta}}'(\mathbf{X}'\mathbf{X})\hat{\boldsymbol{\beta}}, \end{eqnarray}$$

where $\hat{\boldsymbol{\beta}}$ is the solution to $(\mathbf{X}'\mathbf{X})\hat{\boldsymbol{\beta}}=\mathbf{X}'\mathbf{y}$, which is the least-squares estimator of $\boldsymbol{\beta}$. In this expression, only the second term depends on $\boldsymbol{\beta}$. Thus, $f(\boldsymbol{\beta}|\mathbf{y},\sigma_{e}^{2})$ can be written as

$$ f(\boldsymbol{\beta}|\mathbf{y},\sigma_{e}^{2})\propto\exp\left\{ -\frac{1}{2}\frac{(\boldsymbol{\beta}-\hat{\boldsymbol{\beta}})'(\mathbf{X}'\mathbf{X})(\boldsymbol{\beta}-\hat{\boldsymbol{\beta}})}{\sigma_{e}^{2}}\right\}, $$

which can be recognized as proportional to the density for a two-dimensional normal distribution with mean $\hat{\boldsymbol{\beta}}$ and variance $(\mathbf{X}'\mathbf{X})^{-1}\sigma_{e}^{2}$. Thus, in this simple setting, the posterior mean of $\boldsymbol{\beta}$ is given by the least-squares estimate, and drawing samples from the posterior are not needed. But, to illustrate the Gibbs sampler, we will apply it to this simple example.

Gibbs Sampler for $\boldsymbol{\beta}$

The simple regression model can be written as $$\mathbf{y=}\mathbf{1}\beta_{0}+\mathbf{x}\beta_{1}+\mathbf{e}.$$ In the Gibbs sampler, $\beta_{0}$ is sampled from its full-conditional posterior: $f(\beta_{0}|\mathbf{y},\beta_{1},\sigma_{e}^{2}).$ This conditional distribution is computed for the current values of $\beta_{1}$ and $\sigma_{e}^{2}.$ So, we can write the model as $$\mathbf{w}_{0}=\mathbf{1}\beta_{0}+\mathbf{e},$$ where $\mathbf{w}_{0}=\mathbf{y}-\mathbf{x}\beta_{1}.$ Then, the least-squares estimator of $\beta_{0}$ is $$\hat{\beta}_{0}=\frac{\mathbf{1}'\mathbf{w}_{0}}{\mathbf{1}'\mathbf{1}},$$ and the variance of this estimator is $$Var(\hat{\beta_{0}})=\frac{\sigma_{e}^{2}}{\mathbf{1}'\mathbf{1}}.$$ By applying the strategy used to derive $f(\boldsymbol{\beta}|\mathbf{y},\sigma_{e}^{2})$ above, the full-conditional posterior for $\beta_{0}$ can be shown to be a normal distribution with mean $\hat{\beta_{0}}$ and variance $\frac{\sigma_{e}^{2}}{\mathbf{1}'\mathbf{1}}.$ Similarly, the full-conditional posterior for $\beta_{1}$ is a normal distribution with mean $$\hat{\beta}_{1}=\frac{\mathbf{x}'\mathbf{w}_{1}}{\mathbf{x}'\mathbf{x}}$$ and variance $\frac{\sigma_{e}^{2}}{\mathbf{x}'\mathbf{x}}$, where $\mathbf{w}_{1}=\mathbf{y}-1\beta_{0}.$ In the calculations below, we will use the true value of $\sigma_{e}^{2}.$

Calculations in Julia:

In [9]:
# loop for Gibbs sampler
niter = 10000 # number of samples
b     = [0.0, 0.0]
meanB = [0.0, 0.0]
a=Float64[]


for iter = 1:niter
    
    # sampling intercept
    w = y - X[:,2] * b[2]
    x = X[:,1]
    xpxi = 1/(x'x)[1,1]
    bHat = (xpxi*x'w)[1,1] 
    b[1] = rand(Normal(bHat, sqrt(xpxi))) # using residual var = 1 
    
    # sampling slope
    w = y - X[:,1]*b[1]
    x = X[:,2]
    xpxi = 1/(x'x)[1,1]
    bHat = (xpxi*x'w)[1,1]
    b[2] = rand(Normal(bHat, sqrt(xpxi))) # using residual var = 1 
    meanB = meanB + b   
    push!(a,b[2])


    if ((iter%1000) == 0)
        @printf("Intercept = %6.3f \n", meanB[1]/iter)
        @printf("Slope     = %6.3f \n", meanB[2]/iter)
    end
end
Intercept =  0.725 
Slope     =  2.283 
Intercept =  0.695 
Slope     =  2.301 
Intercept =  0.700 
Slope     =  2.297 
Intercept =  0.702 
Slope     =  2.294 
Intercept =  0.700 
Slope     =  2.294 
Intercept =  0.696 
Slope     =  2.296 
Intercept =  0.699 
Slope     =  2.294 
Intercept =  0.709 
Slope     =  2.287 
Intercept =  0.714 
Slope     =  2.283 
Intercept =  0.712 
Slope     =  2.285 
In [11]:
using Gadfly
In [15]:
plot(x=a, Geom.histogram, 
Guide.title("Posterior distribution of β1"),
Guide.ylabel("Frequency"),
Guide.xlabel("β1"))
Out[15]:
β1 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 -4.0 -3.8 -3.6 -3.4 -3.2 -3.0 -2.8 -2.6 -2.4 -2.2 -2.0 -1.8 -1.6 -1.4 -1.2 -1.0 -0.8 -0.6 -0.4 -0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4 2.6 2.8 3.0 3.2 3.4 3.6 3.8 4.0 4.2 4.4 4.6 4.8 5.0 5.2 5.4 5.6 5.8 6.0 6.2 6.4 6.6 6.8 7.0 7.2 7.4 7.6 7.8 8.0 -5 0 5 10 -4.0 -3.5 -3.0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 -400 -300 -200 -100 0 100 200 300 400 500 600 700 -300 -290 -280 -270 -260 -250 -240 -230 -220 -210 -200 -190 -180 -170 -160 -150 -140 -130 -120 -110 -100 -90 -80 -70 -60 -50 -40 -30 -20 -10 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 210 220 230 240 250 260 270 280 290 300 310 320 330 340 350 360 370 380 390 400 410 420 430 440 450 460 470 480 490 500 510 520 530 540 550 560 570 580 590 600 -300 0 300 600 -300 -280 -260 -240 -220 -200 -180 -160 -140 -120 -100 -80 -60 -40 -20 0 20 40 60 80 100 120 140 160 180 200 220 240 260 280 300 320 340 360 380 400 420 440 460 480 500 520 540 560 580 600 Frequency Posterior distribution of β1

Full-conditional Posterior for $\sigma_{e}^{2}$

Recall that we assumed a scaled inverted chi-square prior for $\sigma_{e}^{2}$. The density function for this is:

$$f(\sigma_{e}^{2})=\frac{(S_{e}^{2}\nu_{e}/2)^{\nu_{e}/2}}{\Gamma(\nu_{e}/2)}(\sigma_{e}^{2})^{-(2+\nu_{e})/2}\exp\left\{ -\frac{\nu_{e}S_{e}^{2}}{2\sigma_{e}^{2}}\right\},\tag{37}$$

where $S_{e}^{2}$ and $\nu_{e}$ are the scale and the degrees of freedom parameters for this distribution. Applying Bayes theorem to combine this prior with the “likelihood” (given in (36)), the full-conditional posterior for the residual variance can be written as

$$\begin{aligned} f(\sigma_{e}^{2}|\mathbf{y},\boldsymbol{\beta}) & = \frac{f(\mathbf{y}|\boldsymbol{\beta},\sigma_{e}^{2})f(\boldsymbol{\beta})f(\sigma_{e}^{2})}{f(\mathbf{y},\boldsymbol{\beta})}\nonumber \\ & \propto f(\mathbf{y}|\boldsymbol{\beta},\sigma_{e}^{2})f(\boldsymbol{\beta})f(\sigma_{e}^{2})\nonumber \\ & \propto (\sigma_{e}^{2})^{-n/2}\exp\left\{ -\frac{1}{2}\frac{(\mathbf{y}-\mathbf{X}\boldsymbol{\beta})'(\mathbf{y}-\mathbf{X}\boldsymbol{\beta})}{\sigma_{e}^{2}}\right\} \nonumber \\ & \times (\sigma_{e}^{2})^{-(2+\nu_{e})/2}\exp\left\{ -\frac{\nu_{e}S_{e}^{2}}{2\sigma_{e}^{2}}\right\} \nonumber \\ & = (\sigma_{e}^{2})^{-(n+2+\nu_{e})/2}\exp\left\{ -\frac{(\mathbf{y}-\mathbf{X}\boldsymbol{\beta})'(\mathbf{y}-\mathbf{X}\boldsymbol{\beta})+\nu_{e}S_{e}^{2}}{2\sigma_{e}^{2}}\right\}. \end{aligned}\tag{38}$$

Comparing (38) with (37), can see that it is proportional to a scaled inverse chi-squared density with $\tilde{\nu}_{e}=n+\nu_{e}$ degrees of freedom and $\tilde{S_{e}^{2}}=\frac{(\mathbf{y}-\mathbf{X}\boldsymbol{\beta})'(\mathbf{y}-\mathbf{X}\boldsymbol{\beta})+\nu_{e}S_{e}^{2}}{\tilde{\nu_{e}}}$ scale parameter. A sample from this density can be obtained as $\frac{(\mathbf{y}-\mathbf{X}\boldsymbol{\beta})'(\mathbf{y}-\mathbf{X}\boldsymbol{\beta})+\nu_{e}S_{e}^{2}}{\chi_{\tilde{\nu_{e}}}^{2}}$, where $\chi_{\tilde{\nu_{e}}}^{2}$ is a chi-squared random variable with $\tilde{\nu_{e}}$ degrees of freedom.

Exercise

In the Julia script given here, the simulated value of the residual variance was used in the sampling of $\boldsymbol{\beta}$. Extend this script to also sample $\sigma_{e}^{2}$ from its full-conditional posterior given above. In Julia, rand(Chisq($\nu$),1) gives a chi-squared random variable with $\nu$ degrees of freedom. Solutions can be found here where flat priors for $\sigma_{e}^{2}$ is used.

Model with Normal Prior for Slope

Consider the simple regression model that can be written as $$\mathbf{y=}\mathbf{1}\beta_{0}+\mathbf{x}\beta_{1}+\mathbf{e}.$$ Here we consider a model with a flat prior for $\beta_{0}$ and a normal prior for the slope:

$$\beta_{1}\sim N(0,\sigma_{\beta}^{2}),$$

where $\sigma_{\beta}^{2}$ is assumed to be known.

Then, the full-conditional posterior for $\theta'=[\boldsymbol{\beta},\sigma_{e}^{2}]$ is

$$\begin{aligned} f(\boldsymbol{\theta}|\mathbf{y}) & \propto f(\mathbf{y}|\boldsymbol{\theta})f(\boldsymbol{\theta})\\ & \propto \left(\sigma_{e}^{2}\right)^{-n/2}\exp\left\{ -\frac{(\mathbf{y}-\mathbf{1}\beta_{0}-\mathbf{x}\beta_{1})'(\mathbf{y}-\mathbf{1}\beta_{0}-\mathbf{x}\beta_{1})}{2\sigma_{e}^{2}}\right\} \\ & \times \left(\sigma_{\beta}^{2}\right)^{-1/2}\exp\left\{ -\frac{\beta_{1}^{2}}{2\sigma_{\beta}^{2}}\right\} \\ & \times (\sigma_{e}^{2})^{-(2+\nu_{e})/2}\exp\left\{ -\frac{\nu_{e}S_{e}^{2}}{2\sigma_{e}^{2}}\right\} .\\ \end{aligned}$$

Full-conditional for $\beta_{1}$:

The full-conditional for $\beta_{1}$ is obtained by dropping all terms and factors that do not involve $\beta_{1}$:

$$\begin{aligned} f(\beta_{1}|\text{ELSE}) & \propto \exp\left\{ -\frac{(\mathbf{y}-\mathbf{1}\beta_{0}-\mathbf{x}\beta_{1})'(\mathbf{y}-\mathbf{1}\beta_{0}-\mathbf{x}\beta_{1})}{2\sigma_{e}^{2}}\right\} \times \exp\left\{ -\frac{\beta_{1}^{2}}{2\sigma_{\beta}^{2}}\right\} \\ & \propto \exp\left\{ -\frac{\mathbf{w}'\mathbf{w}-2\mathbf{w}'\mathbf{x}\beta_{1}+\beta_{1}^{2}(\mathbf{x}'\mathbf{x}+\sigma_{e}^{2}/\sigma_{\beta}^{2})}{2\sigma_{e}^{2}}\right\} \\ & \propto \exp\left\{ -\frac{\mathbf{w}'\mathbf{w}-(\beta_{1}-\hat{\beta}_{1})^{2}(\mathbf{x}'\mathbf{x}+\sigma_{e}^{2}/\sigma_{\beta}^{2})-\hat{\beta}_{1}^{2}(\mathbf{x}'\mathbf{x}+\sigma_{e}^{2}/\sigma_{\beta}^{2})}{2\sigma_{e}^{2}}\right\} \\ & \propto \exp\left\{ -\frac{(\beta_{1}-\hat{\beta}_{1})^{2}}{\frac{2\sigma_{e}^{2}}{(\mathbf{x}'\mathbf{x}+\sigma_{e}^{2}/\sigma_{\beta}^{2})}}\right\} , \end{aligned}$$

where $$\hat{\beta}_{1}=\frac{\mathbf{x}'\mathbf{w}}{(\mathbf{x}'\mathbf{x}+\sigma_{e}^{2}/\sigma_{\beta}^{2})},$$ and $\mathbf{w}=\mathbf{y}-\mathbf{1}\beta_{0}.$ So, the full-conditional posterior for $\beta_{1}$ is a normal distribution with $ $mean $\hat{\beta}_{1}$ and variance $\frac{\sigma_{e}^{2}}{(\mathbf{x}'\mathbf{x}+\sigma_{e}^{2}/\sigma_{\beta}^{2})}.$

Exercise

  1. Use Julia to simulate a vector of 1000 values for $\beta_{1}$ from a normal distribution with mean zero and variance 3. Plot a histogram of these values.

  2. Use $\beta_{0}=1$, $\beta_{1}=2$ and $\sigma_{e}^{2}=5$, to generate a vector of observations, $\mathbf{y},$ that follows a simple linear regression model.

  3. Use the Gibbs sampler to draw 10,000 samples for $\beta_{1}$ from its posterior distribution.

    1. Compute the mean and variance of the sampled values.

    2. Draw a histogram of the sampled values. Compare with prior.