Transformations

  • We have been working with linear regression models so far in the course.

  • Some models are nonlinear, but can be transformed to a linear model.

  • We will also see that transformations can sometimes stabilize the variance making constant variance a more reasonable assumption.

  • Finally, we will see how to correct for unequal variance using a technique weighted least squares (WLS).

Bacterial colony decay

Here is a simple dataset showing the number of bacteria alive in a colony, $N_t$ as a function of time $t$. A simple linear regression model is clearly not a very good fit.

In [ ]:
bacteria.table = read.table('http://stats191.stanford.edu/data/bacteria.table', header=T)
plot(bacteria.table$t, bacteria.table$N_t, pch=23, cex=2, bg='orange')
bacteria.lm = lm(N_t ~ t, bacteria.table)
abline(bacteria.lm$coef, lwd=2, col='red')
In [ ]:
par(mfrow=c(2,2))
plot(bacteria.lm, pch=23, bg='orange')

Exponential decay model

  • Suppose the expected number of cells grows like $$E(n_t) = n_0 e^{\beta_1t}, \qquad t=1, 2, 3, \dots$$

  • If we take logs of both sides $$\log E(n_t) = \log n_0 + \beta_1 t.$$

  • A reasonable (?) model: $$\log n_t = \log n_0 + \beta_1 t + \varepsilon_t, \qquad \varepsilon_t \overset{IID}{\sim} N(0,\sigma^2).$$

In [ ]:
bacteria.log.lm <- lm(log(N_t) ~ t, bacteria.table)
plot(bacteria.table$t, bacteria.table$N_t, pch=23, cex=2, bg='orange')
lines(bacteria.table$t, fitted(bacteria.lm), lwd=2, col='red')
lines(bacteria.table$t, exp(fitted(bacteria.log.lm)), lwd=2, col='green')
In [ ]:
par(mfrow=c(2,2))
plot(bacteria.log.lm, pch=23, bg='orange')

Logarithmic transformation

  • This model slightly different than original model: $$E(\log n_t) \leq \log E(n_t)$$ but may be approximately true.

  • If $\varepsilon_t \sim N(0,\sigma^2)$ then $$n_t = n_0 \cdot \epsilon_t \cdot e^{\beta_1 t}.$$

  • $\epsilon_t=e^{\varepsilon_t}$ is called a log-normal random $(0,\sigma^2)$ random variable.

Linearizing regression function

We see that an exponential growth or decay model can be made (approximately) linear. Here are a few other models that can be linearized:

  • $y=\alpha x^{\beta}$, use $\tilde{y}=\log(y), \tilde{x}=\log(x)$;

  • $y=\alpha e^{\beta x}$, use $\tilde{y}=\log(y)$;

  • $y=x/(\alpha x - \beta)$, use $\tilde{y}=1/y, \tilde{x}=1/x$.

  • More in textbook.

Caveats

  • Just because expected value linearizes, doesn’t mean that the errors behave correctly.

  • In some cases, this can be corrected using weighted least squares (more later).

  • Constant variance, normality assumptions should still be checked.

Stabilizing variance

  • Sometimes, a transformation can turn non-constant variance errors to "close to" constant variance. This is another situation in which we might consider a transformation.

  • Example: by the "delta rule", if $$\text{Var}(Y) = \sigma^2 E(Y)$$ then $$\text{Var}(\sqrt{Y}) \simeq \frac{\sigma^2}{4}.$$

  • In practice, we might not know which transformation is best. Box-Cox transformations offer a tool to find a "best" transformation.

Delta rule

The following approximation is ubiquitous in statistics.

  • Taylor series expansion: $$f(Y) = f(E(Y)) + \dot{f}(E(Y)) (Y - E(Y)) + \dots$$

  • Taking expectations of both sides yields: $$\text{Var}(f(Y)) \simeq \dot{f}(E(Y))^2 \cdot \text{Var}(Y)$$

Delta rule

  • So, for our previous example: $$\text{Var}(\sqrt{Y}) \simeq \frac{\text{Var}(Y)}{4 \cdot E(Y)}$$

  • Another example $$\text{Var}(\log(Y)) \simeq \frac{\text{Var}(Y)}{E(Y)^2}.$$

Caveats

  • Just because a transformation makes variance constant doesn’t mean regression function is still linear (or even that it was linear)!

  • The models are approximations, and once a model is selected our standard diagnostics should be used to assess adequacy of fit.

  • It is possible to have non-constant variance but the variance stabilizing transformation may destroy linearity of the regression function.

    • Solution: try weighted least squares (WLS).

Correcting for unequal variance

  • We will now see an example in which there seems to be strong evidence for variance that changes based on Region.

  • After observing this, we will create a new model that attempts to correct for this and come up with better estimates.

  • Correcting for unequal variance, as we describe it here, generally requires a model for how the variance depends on observable quantities.

Correcting for unequal variance

VariableDescription
$Y$Per capita education expenditure by state
$X_1$Per capita income in 1973 by state
$X_2$Proportion of population under 18
$X_3$Proportion in urban areas
`Region`Which region of the country are the states located in
In [ ]:
education.table = read.table('http://stats191.stanford.edu/data/education1975.table', header=T)
education.table$Region = factor(education.table$Region)
education.lm = lm(Y ~ X1 + X2 + X3, data=education.table)
summary(education.lm)
In [ ]:
par(mfrow=c(2,2))
plot(education.lm)
In [ ]:
boxplot(rstandard(education.lm) ~ education.table$Region, 
        col=c('red', 'green', 'blue', 'yellow'))
In [ ]:
keep.subset = (education.table$STATE != 'AK')
education.noAK.lm = lm(Y ~ X1 + X2 + X3, subset=keep.subset, 
                       data=education.table)
summary(education.noAK.lm)
In [ ]:
par(mfrow=c(2,2))
plot(education.noAK.lm)
In [ ]:
par(mfrow=c(1,1))
boxplot(rstandard(education.noAK.lm) ~ education.table$Region[keep.subset], 
        col=c('red', 'green', 'blue', 'yellow'))

Reweighting observations

  • If you have a reasonable guess of variance as a function of the predictors, you can use this to reweight the data.

  • Hypothetical example $$Y_i = \beta_0 + \beta_1 X_i + \varepsilon_i, \qquad \varepsilon_i \sim N(0,\sigma^2 X_i^2).$$

  • Setting $\tilde{Y}_i = Y_i / X_i$, $\tilde{X}_i = 1 / X_i$, model becomes $$\tilde{Y}_i = \beta_0 \tilde{X}_i + \beta_1 + \epsilon_i, \epsilon_i \sim N(0,\sigma^2).$$

Weighted Least Squares

  • Fitting this model is equivalent to minimizing $$\sum_{i=1}^n \frac{1}{X_i^2} \left(Y_i - \beta_0 - \beta_1 X_i\right)^2$$

  • Weighted Least Squares $$SSE(\beta, w) = \sum_{i=1}^n w_i \left(Y_i - \beta_0 - \beta_1 X_i\right)^2, \qquad w_i = \frac{1}{X_i^2}.$$

  • In general, weights should be like: $$w_i = \frac{1}{\text{Var}(\varepsilon_i)}.$$

  • Our education expenditure example assumes $$ w_i = W_{\tt Region[i]} $$

Common weighting schemes

  • If you have a qualitative variable, then it is easy to estimate weight within groups (our example today).

  • "Often" $$\text{Var}(\varepsilon_i) = \text{Var}(Y_i) = V(E(Y_i))$$

  • Many non-Gaussian (non-Normal) models behave like this: logistic, Poisson regression.

Two stage procedure

  • Suppose we have a hypothesis about the weights, i.e. they are constant within Region, or they are something like $$w_i^{-1} = \text{Var}(\epsilon_i) = \alpha_0 + \alpha_1 X_{i1}^2.$$

  • We pre-whiten:

    1. Fit model using OLS (Ordinary Least Squares) to get initial estimate $\widehat{\beta}_{OLS}$

    2. Use predicted values from this model to estimate $w_i$.

    3. Refit model using WLS (Weighted Least Squares).

    4. If needed, iterate previous two steps.

In [ ]:
educ.weights = 0 * education.table$Y
for (region in levels(education.table$Region)) {
  subset.region = (education.table$Region[keep.subset] == region)
  educ.weights[subset.region] <- 1.0 / (sum(resid(education.noAK.lm)
                                            [subset.region]^2) / 
                                        sum(subset.region))
}
In [ ]:
unique(educ.weights)

Here is our new model. Note that the scale of the estimates is unchanged. Numerically the estimates are similar. What changes most is the Std. Error column.

In [ ]:
education.noAK.weight.lm <- lm(Y ~ X1 + X2 + X3, 
                               weights=educ.weights, 
                               subset=keep.subset, 
                               data=education.table)
summary(education.noAK.weight.lm)
In [ ]:
summary(education.noAK.lm)
In [ ]:
par(mfrow=c(2,2))
plot(education.noAK.weight.lm)

Let's look at the boxplot again. It looks better, but perhaps not perfect.

In [ ]:
par(mfrow=c(1,1))
boxplot(resid(education.noAK.weight.lm, type='pearson') ~ education.table$Region[keep.subset],
        col=c('red', 'green', 'blue', 'yellow'))

Unequal variance: effects on inference

  • So far, we have just mentioned that things may have unequal variance, but not thought about how it affects inference.

  • In general, if we ignore unequal variance, our estimates of variance are no longer unbiased. The covariance has the “sandwich form” $$\text{Cov}(\widehat{\beta}_{OLS}) = (X'X)^{-1}(X'W^{-1}X)(X'X)^{-1}.$$ with $W=\text{diag}(1/\sigma^2_i)$.

  • If our Std. Error is incorrect, so are our conclusions based on $t$-statistics!

  • In this example, correcting for weights seemed to make the $t$-statistics larger. This will not always be the case!

Efficiency

  • The efficiency of an unbiased estimator of $\beta$ is 1 / variance.

  • Estimators can be compared by their efficiency: the more efficient, the better.

  • The other reason to correct for unequal variance (besides so that we get valid inference) is for efficiency.

Illustrative example

  • Suppose $$Z_i = \mu + \varepsilon_i, \qquad \varepsilon_i \sim N(0, i^2 \cdot \sigma^2), 1 \leq i \leq n.$$

  • Two unbiased estimators of $\mu$: $$\begin{aligned} \widehat{\mu}_1 &= \frac{1}{n}\sum_{i=1}^n Z_i \\ \widehat{\mu}_2 &= \frac{1}{\sum_{i=1}^n i^{-2}}\sum_{i=1}^n i^{-2}Z_i \\ \widehat{\mu}_3 &= \frac{1}{\sum_{i=1}^n i^{-1}}\sum_{i=1}^n i^{-1}Z_i \\ \end{aligned}$$

Illustrative example

  • The estimator $\widehat{\mu}_2$ will always have lower variance, hence tighter confidence intervals.

  • The estimator $\widehat{\mu}_3$ has incorrect weights, but they are "closer" to correct than the naive mean's weights which assume each observation has equal variance.

Let's simulate a number of experiments and compare the three estimates.

In [ ]:
ntrial = 1000
for (i in 1:ntrial) {
  cur.sample = get.sample()
  unweighted.estimate[i] = mean(cur.sample)
  weighted.estimate[i] = sum(cur.sample/sd^2) / sum(1/sd^2)
  suboptimal.estimate[i] = sum(cur.sample/sd) / sum(1/sd)
}

data.frame(mean(unweighted.estimate),
           sd(unweighted.estimate))
data.frame(mean(weighted.estimate),
           sd(weighted.estimate))
data.frame(mean(suboptimal.estimate),
           sd(suboptimal.estimate))
In [ ]:
plot(densX, densY, type='n', main='Comparison of densities of the estimators', cex=0.8)
lines(density(weighted.estimate), col='red', lwd=4)
lines(density(unweighted.estimate), col='blue', lwd=4)
lines(density(suboptimal.estimate), col='purple', lwd=4)
legend(6,0.3, c('optimal', 'suboptimal', 'mean'), col=c('red', 'purple', 'blue'), lwd=rep(4,3), cex=0.8)