Transformations to achieve linearity

We have been working with

*linear*regression models so far in the course.Some models are nonlinear, but can be

*transformed*to a linear model.We will also see that transformations can sometimes

*stabilize*the variance making constant variance a more reasonable assumption.Finally, we will see how to correct for unequal variance using a technique weighted least squares (WLS).

Here is a simple dataset showing the number of bacteria alive in a colony, $N_t$ as a function of time $t$. A simple linear regression model is clearly not a very good fit.

In [1]:

```
bacteria.table = read.table('http://stats191.stanford.edu/data/bacteria.table', header=T)
plot(bacteria.table$t, bacteria.table$N_t, pch=23, cex=2, bg='orange')
bacteria.lm = lm(N_t ~ t, bacteria.table)
abline(bacteria.lm$coef)
```

In [2]:

```
par(mfrow=c(2,2))
plot(bacteria.lm, pch=23, bg='orange')
```

Suppose the expected number of cells grows like $$E(n_t) = n_0 e^{\beta_1t}, \qquad t=1, 2, 3, \dots$$

If we take logs of both sides $$\log E(n_t) = \log n_0 + \beta_1 t.$$

A reasonable (?) model: $$\log n_t = \log n_0 + \beta_1 t + \varepsilon_t, \qquad \varepsilon_t \overset{IID}{\sim} N(0,\sigma^2).$$

In [3]:

```
bacteria.log.lm <- lm(log(N_t) ~ t, bacteria.table)
plot(bacteria.table$t, bacteria.table$N_t, pch=23, cex=2, bg='orange')
lines(bacteria.table$t, fitted(bacteria.lm), lwd=2, col='red')
lines(bacteria.table$t, exp(fitted(bacteria.log.lm)), lwd=2, col='green')
```

In [4]:

```
par(mfrow=c(2,2))
plot(bacteria.log.lm, pch=23, bg='orange')
```

This model slightly different than original model: $$E(\log n_t) \leq \log E(n_t)$$ but may be approximately true.

If $\varepsilon_t \sim N(0,\sigma^2)$ then $$n_t = n_0 \cdot \epsilon_t \cdot e^{\beta_1 t}.$$

$\epsilon_t=e^{\varepsilon_t}$ is called a log-normal random $(0,\sigma^2)$ random variable.

We see that an exponential growth or decay model can be made (approximately) linear. Here are a few other models that can be linearized:

$y=\alpha x^{\beta}$, use $\tilde{y}=\log(y), \tilde{x}=\log(x)$;

$y=\alpha e^{\beta x}$, use $\tilde{y}=\log(y)$;

$y=x/(\alpha x - \beta)$, use $\tilde{y}=1/y, \tilde{x}=1/x$.

More in textbook.

Just because expected value linearizes, doesn’t mean that the errors behave correctly.

In some cases, this can be corrected using weighted least squares (more later).

Constant variance, normality assumptions should still be checked.

Sometimes, a transformation can turn non-constant variance errors to "close to" constant variance. This is another situation in which we might consider a transformation.

Example: by the "delta rule", if $$Var(Y) = \sigma^2 E(Y)$$ then $$\text{Var}(\sqrt{Y}) \simeq \frac{\sigma^2}{4}.$$

In practice, we might not know which transformation is best. Box-Cox transformations offer a tool to find a "best" transformation.

The following approximation is ubiquitous in statistics.

Taylor series expansion: $$f(Y) = f(E(Y)) + \dot{f}(E(Y)) (Y - E(Y)) + \dots$$

Taking expectations of both sides yields: $$\text{Var}(f(Y)) \simeq \dot{f}(E(Y))^2 \cdot \text{Var}(Y)$$

So, for our previous example: $$\text{Var}(\sqrt{Y}) \simeq \frac{\text{Var}(Y)}{4 \cdot E(Y)}$$

Another example $$\text{Var}(\log(Y)) \simeq \frac{\text{Var}(Y)}{4 \cdot E(Y)^2}.$$

Just because a transformation makes variance constant doesn’t mean regression function is still linear (or even that it was linear)!

The models are approximations, and once a model is selected our standard "diagnostics" should be used to assess adequacy of fit.

It is possible to have non-constant variance but the variance stabilizing transformation may destroy linearity of the regression function.

*Solution:*try weighted least squares (WLS).

We will now see an example in which there seems to be strong evidence
for variance that changes based on `Region`

.

After observing this, we will create a new model that
attempts to *correct* for this and come up with better estimates.

*Correcting* for unequal variance, as we describe it here, generally
requires a model for how the variance depends on observable quantities.

Variable | Description |

$Y$ | Per capita education expenditure by state |

$X_1$ | Per capita income in 1973 by state |

$X_2$ | Proportion of population under 18 |

$X_3$ | Proportion in urban areas |

`Region` | Which region of the country are the states located in |

In [5]:

```
education.table = read.table('http://stats191.stanford.edu/data/education1975.table', header=T)
education.table$Region = factor(education.table$Region)
education.lm = lm(Y ~ X1 + X2 + X3, data=education.table)
summary(education.lm)
```

In [6]:

```
par(mfrow=c(2,2))
plot(education.lm)
```

In [7]:

```
boxplot(rstandard(education.lm) ~ education.table$Region,
col=c('red', 'green', 'blue', 'yellow'))
```

In [8]:

```
keep.subset = (education.table$STATE != 'AK')
education.noAK.lm = lm(Y ~ X1 + X2 + X3, subset=keep.subset, data=education.table)
summary(education.noAK.lm)
```

In [9]:

```
par(mfrow=c(2,2))
plot(education.noAK.lm)
```

In [10]:

```
par(mfrow=c(1,1))
boxplot(rstandard(education.noAK.lm) ~ education.table$Region[keep.subset],
col=c('red', 'green', 'blue', 'yellow'))
```

If you have a reasonable guess of variance as a function of the predictors, you can use this to

*reweight*the data.Hypothetical example $$Y_i = \beta_0 + \beta_1 X_i + \varepsilon_i, \qquad \varepsilon_i \sim N(0,\sigma^2 X_i^2).$$

Setting $\tilde{Y}_i = Y_i / X_i$, $\tilde{X}_i = 1 / X_i$, model becomes $$\tilde{Y}_i = \beta_0 \tilde{X}_i + \beta_1 + \epsilon_i, \epsilon_i \sim N(0,\sigma^2).$$

Fitting this model is equivalent to minimizing $$\sum_{i=1}^n \frac{1}{X_i^2} \left(Y_i - \beta_0 - \beta_1 X_i\right)^2$$

Weighted Least Squares $$SSE(\beta, w) = \sum_{i=1}^n w_i \left(Y_i - \beta_0 - \beta_1 X_i\right)^2, \qquad w_i = \frac{1}{X_i^2}.$$

In general, weights should be like: $$w_i = \frac{1}{\text{Var}(\varepsilon_i)}.$$

Our education expenditure example assumes $$ w_i = W_{\tt Region[i]} $$

If you have a qualitative variable, then it is easy to estimate weight within groups (our example today).

"Often" $$\text{Var}(\varepsilon_i) = \text{Var}(Y_i) = V(E(Y_i))$$

Many non-Gaussian (non-Normal) models behave like this: logistic, Poisson regression.

Suppose we have a hypothesis about the weights, i.e. they are constant within Region, or they are something like $$w_i^{-1} = \text{Var}(\epsilon_i) = \alpha_0 + \alpha_1 X_{i1}^2.$$

We pre-whiten:

Fit model using OLS (Ordinary Least Squares) to get initial estimate $\widehat{\beta}_{OLS}$

Use predicted values from this model to estimate $w_i$.

Refit model using WLS (Weighted Least Squares).

If needed, iterate previous two steps.

In [11]:

```
educ.weights = 0 * education.table$Y
for (region in levels(education.table$Region)) {
subset.region = (education.table$Region[keep.subset] == region)
educ.weights[subset.region] <- 1.0 / (sum(resid(education.noAK.lm)[subset.region]^2) / sum(subset.region))
}
```

In [12]:

```
unique(educ.weights)
```

*unchanged*. Numerically
the estimates are similar. What changes most is the `Std. Error`

column.

In [13]:

```
education.noAK.weight.lm <- lm(Y ~ X1 + X2 + X3, weights=educ.weights, subset=keep.subset, data=education.table)
summary(education.noAK.weight.lm)
```

In [14]:

```
summary(education.noAK.lm)
```

In [15]:

```
par(mfrow=c(2,2))
plot(education.noAK.weight.lm)
```

Let's look at the boxplot again. It looks better, but perhaps not perfect.

In [16]:

```
par(mfrow=c(1,1))
boxplot(resid(education.noAK.weight.lm, type='pearson') ~ education.table$Region[keep.subset],
col=c('red', 'green', 'blue', 'yellow'))
```

So far, we have just mentioned that things

*may*have unequal variance, but not thought about how it affects inference.In general, if we ignore unequal variance, our estimates of variance are no longer unbiased. The covariance has the “sandwich form” $$\text{Cov}(\widehat{\beta}_{OLS}) = (X'X)^{-1}(XW^{-1}X)(X'X)^{-1}.$$ with $W=\text{diag}(1/\sigma^2_i)$.

**If our**`Std. Error`

is incorrect, so are our conclusions based on $t$-statistics!In this example, correcting for weights seemed to make the $t$-statistics larger.

**This will not always be the case!**

The efficiency of an unbiased estimator of $\beta$ is 1 / variance.

Estimators can be compared by their efficiency: the more efficient, the better.

The other reason to correct for unequal variance (besides so that we get valid inference) is for efficiency.

Suppose $$Z_i = \mu + \varepsilon_i, \qquad \varepsilon_i \sim N(0, i^2 \cdot \sigma^2), 1 \leq i \leq n.$$

Two unbiased estimators of $\mu$: $$\begin{aligned} \widehat{\mu}_1 &= \frac{1}{n}\sum_{i=1}^n Z_i \\ \widehat{\mu}_2 &= \frac{1}{\sum_{i=1}^n i^{-2}}\sum_{i=1}^n i^{-2}Z_i \\ \widehat{\mu}_3 &= \frac{1}{\sum_{i=1}^n i^{-1}}\sum_{i=1}^n i^{-1}Z_i \\ \end{aligned}$$

The estimator $\widehat{\mu}_2$ will always have lower variance, hence tighter confidence intervals.

The estimator $\widehat{\mu}_3$ has incorrect weights, but they are "closer" to correct than the naive mean's weights which assume each observation has equal variance.

In [17]:

```
ntrial = 10000 # how many trials will we be doing?
nsample = 20 # how many points in each trial
sd = c(1:20) # how does the variance change
mu = 2.0
get.sample <- function() {
return(rnorm(nsample)*sd + mu)
}
unweighted.estimate <- function(cur.sample) {
return(mean(cur.sample))
}
unweighted.estimate <- numeric(ntrial)
weighted.estimate <- numeric(ntrial)
suboptimal.estimate <- numeric(ntrial)
```

Let's simulate a number of experiments and compare the three estimates.

In [18]:

```
set.seed(0)
for (i in 1:ntrial) {
cur.sample = get.sample()
unweighted.estimate[i] = mean(cur.sample)
weighted.estimate[i] = sum(cur.sample/sd^2) / sum(1/sd^2)
suboptimal.estimate[i] = sum(cur.sample/sd) / sum(1/sd)
}
print(data.frame(mean(unweighted.estimate),
sd(unweighted.estimate)))
print(data.frame(mean(weighted.estimate),
sd(weighted.estimate)))
print(data.frame(mean(suboptimal.estimate),
sd(suboptimal.estimate)))
```

In [19]:

```
Y = c(density(unweighted.estimate)$y, density(weighted.estimate)$y, density(suboptimal.estimate)$y)
X = c(density(unweighted.estimate)$x, density(weighted.estimate)$x, density(suboptimal.estimate)$x)
```

In [20]:

```
plot(X, Y, type='n', main='Comparison of densities of the estimators')
lines(density(weighted.estimate), col='red', lwd=4)
lines(density(unweighted.estimate), col='blue', lwd=4)
lines(density(suboptimal.estimate), col='purple', lwd=4)
legend(6,0.3, c('optimal', 'suboptimal', 'mean'), col=c('red', 'purple', 'blue'), lwd=rep(4,3))
```