# Transformations¶

• We have been working with linear regression models so far in the course.

• Some models are nonlinear, but can be transformed to a linear model.

• We will also see that transformations can sometimes stabilize the variance making constant variance a more reasonable assumption.

• Finally, we will see how to correct for unequal variance using a technique weighted least squares (WLS).

In [ ]:
options(repr.plot.width=4, repr.plot.height=4)


## Bacterial colony decay¶

Here is a simple dataset showing the number of bacteria alive in a colony, $N_t$ as a function of time $t$. A simple linear regression model is clearly not a very good fit.

In [ ]:
bacteria.table = read.table('http://stats191.stanford.edu/data/bacteria.table', header=T)
plot(bacteria.table$t, bacteria.table$N_t, pch=23, cex=2, bg='orange')
bacteria.lm = lm(N_t ~ t, bacteria.table)
abline(bacteria.lm$coef, lwd=2, col='red')  In [ ]: par(mfrow=c(2,2)) plot(bacteria.lm, pch=23, bg='orange')  ## Exponential decay model¶ • Suppose the expected number of cells grows like $$E(n_t) = n_0 e^{\beta_1t}, \qquad t=1, 2, 3, \dots$$ • If we take logs of both sides $$\log E(n_t) = \log n_0 + \beta_1 t.$$ • A reasonable (?) model: $$\log n_t = \log n_0 + \beta_1 t + \varepsilon_t, \qquad \varepsilon_t \overset{IID}{\sim} N(0,\sigma^2).$$ In [ ]: bacteria.log.lm <- lm(log(N_t) ~ t, bacteria.table) plot(bacteria.table$t, bacteria.table$N_t, pch=23, cex=2, bg='orange') lines(bacteria.table$t, fitted(bacteria.lm), lwd=2, col='red')
lines(bacteria.table$t, exp(fitted(bacteria.log.lm)), lwd=2, col='green')  In [ ]: par(mfrow=c(2,2)) plot(bacteria.log.lm, pch=23, bg='orange')  ### Logarithmic transformation¶ • This model slightly different than original model: $$E(\log n_t) \leq \log E(n_t)$$ but may be approximately true. • If$\varepsilon_t \sim N(0,\sigma^2)$then $$n_t = n_0 \cdot \epsilon_t \cdot e^{\beta_1 t}.$$ •$\epsilon_t=e^{\varepsilon_t}$is called a log-normal random$(0,\sigma^2)$random variable. ## Linearizing regression function¶ We see that an exponential growth or decay model can be made (approximately) linear. Here are a few other models that can be linearized: •$y=\alpha x^{\beta}$, use$\tilde{y}=\log(y), \tilde{x}=\log(x)$; •$y=\alpha e^{\beta x}$, use$\tilde{y}=\log(y)$; •$y=x/(\alpha x - \beta)$, use$\tilde{y}=1/y, \tilde{x}=1/x$. • More in textbook. ### Caveats¶ • Just because expected value linearizes, doesn’t mean that the errors behave correctly. • In some cases, this can be corrected using weighted least squares (more later). • Constant variance, normality assumptions should still be checked. ## Stabilizing variance¶ • Sometimes, a transformation can turn non-constant variance errors to "close to" constant variance. This is another situation in which we might consider a transformation. • Example: by the "delta rule", if $$\text{Var}(Y) = \sigma^2 E(Y)$$ then $$\text{Var}(\sqrt{Y}) \simeq \frac{\sigma^2}{4}.$$ • In practice, we might not know which transformation is best. Box-Cox transformations offer a tool to find a "best" transformation. ## Delta rule¶ The following approximation is ubiquitous in statistics. • Taylor series expansion: $$f(Y) = f(E(Y)) + \dot{f}(E(Y)) (Y - E(Y)) + \dots$$ • Taking expectations of both sides yields: $$\text{Var}(f(Y)) \simeq \dot{f}(E(Y))^2 \cdot \text{Var}(Y)$$ ## Delta rule¶ • So, for our previous example: $$\text{Var}(\sqrt{Y}) \simeq \frac{\text{Var}(Y)}{4 \cdot E(Y)}$$ • Another example $$\text{Var}(\log(Y)) \simeq \frac{\text{Var}(Y)}{E(Y)^2}.$$ ### Caveats¶ • Just because a transformation makes variance constant doesn’t mean regression function is still linear (or even that it was linear)! • The models are approximations, and once a model is selected our standard diagnostics should be used to assess adequacy of fit. • It is possible to have non-constant variance but the variance stabilizing transformation may destroy linearity of the regression function. • Solution: try weighted least squares (WLS). ## Correcting for unequal variance¶ • We will now see an example in which there seems to be strong evidence for variance that changes based on Region. • After observing this, we will create a new model that attempts to correct for this and come up with better estimates. • Correcting for unequal variance, as we describe it here, generally requires a model for how the variance depends on observable quantities. ## Correcting for unequal variance¶  Variable Description$Y$Per capita education expenditure by state$X_1$Per capita income in 1973 by state$X_2$Proportion of population under 18$X_3$Proportion in urban areas Region Which region of the country are the states located in In [ ]: education.table = read.table('http://stats191.stanford.edu/data/education1975.table', header=T) education.table$Region = factor(education.table$Region) education.lm = lm(Y ~ X1 + X2 + X3, data=education.table) summary(education.lm)  In [ ]: par(mfrow=c(2,2)) plot(education.lm)  In [ ]: boxplot(rstandard(education.lm) ~ education.table$Region,
col=c('red', 'green', 'blue', 'yellow'))

In [ ]:
keep.subset = (education.table$STATE != 'AK') education.noAK.lm = lm(Y ~ X1 + X2 + X3, subset=keep.subset, data=education.table) summary(education.noAK.lm)  In [ ]: par(mfrow=c(2,2)) plot(education.noAK.lm)  In [ ]: par(mfrow=c(1,1)) boxplot(rstandard(education.noAK.lm) ~ education.table$Region[keep.subset],
col=c('red', 'green', 'blue', 'yellow'))


## Reweighting observations¶

• If you have a reasonable guess of variance as a function of the predictors, you can use this to reweight the data.

• Hypothetical example $$Y_i = \beta_0 + \beta_1 X_i + \varepsilon_i, \qquad \varepsilon_i \sim N(0,\sigma^2 X_i^2).$$

• Setting $\tilde{Y}_i = Y_i / X_i$, $\tilde{X}_i = 1 / X_i$, model becomes $$\tilde{Y}_i = \beta_0 \tilde{X}_i + \beta_1 + \epsilon_i, \epsilon_i \sim N(0,\sigma^2).$$

## Weighted Least Squares¶

• Fitting this model is equivalent to minimizing $$\sum_{i=1}^n \frac{1}{X_i^2} \left(Y_i - \beta_0 - \beta_1 X_i\right)^2$$

• Weighted Least Squares $$SSE(\beta, w) = \sum_{i=1}^n w_i \left(Y_i - \beta_0 - \beta_1 X_i\right)^2, \qquad w_i = \frac{1}{X_i^2}.$$

• In general, weights should be like: $$w_i = \frac{1}{\text{Var}(\varepsilon_i)}.$$

• Our education expenditure example assumes $$w_i = W_{\tt Region[i]}$$

## Common weighting schemes¶

• If you have a qualitative variable, then it is easy to estimate weight within groups (our example today).

• "Often" $$\text{Var}(\varepsilon_i) = \text{Var}(Y_i) = V(E(Y_i))$$

• Many non-Gaussian (non-Normal) models behave like this: logistic, Poisson regression.

## Two stage procedure¶

• Suppose we have a hypothesis about the weights, i.e. they are constant within Region, or they are something like $$w_i^{-1} = \text{Var}(\epsilon_i) = \alpha_0 + \alpha_1 X_{i1}^2.$$

• We pre-whiten:

1. Fit model using OLS (Ordinary Least Squares) to get initial estimate $\widehat{\beta}_{OLS}$

2. Use predicted values from this model to estimate $w_i$.

3. Refit model using WLS (Weighted Least Squares).

4. If needed, iterate previous two steps.

In [ ]:
educ.weights = 0 * education.table$Y for (region in levels(education.table$Region)) {
subset.region = (education.table$Region[keep.subset] == region) educ.weights[subset.region] <- 1.0 / (sum(resid(education.noAK.lm) [subset.region]^2) / sum(subset.region)) }  In [ ]: unique(educ.weights)  Here is our new model. Note that the scale of the estimates is unchanged. Numerically the estimates are similar. What changes most is the Std. Error column. In [ ]: education.noAK.weight.lm <- lm(Y ~ X1 + X2 + X3, weights=educ.weights, subset=keep.subset, data=education.table) summary(education.noAK.weight.lm)  In [ ]: summary(education.noAK.lm)  In [ ]: par(mfrow=c(2,2)) plot(education.noAK.weight.lm)  Let's look at the boxplot again. It looks better, but perhaps not perfect. In [ ]: par(mfrow=c(1,1)) boxplot(resid(education.noAK.weight.lm, type='pearson') ~ education.table$Region[keep.subset],
col=c('red', 'green', 'blue', 'yellow'))


## Unequal variance: effects on inference¶

• So far, we have just mentioned that things may have unequal variance, but not thought about how it affects inference.

• In general, if we ignore unequal variance, our estimates of variance are no longer unbiased. The covariance has the “sandwich form” $$\text{Cov}(\widehat{\beta}_{OLS}) = (X'X)^{-1}(X'W^{-1}X)(X'X)^{-1}.$$ with $W=\text{diag}(1/\sigma^2_i)$.

• If our Std. Error is incorrect, so are our conclusions based on $t$-statistics!

• In this example, correcting for weights seemed to make the $t$-statistics larger. This will not always be the case!

## Efficiency¶

• The efficiency of an unbiased estimator of $\beta$ is 1 / variance.

• Estimators can be compared by their efficiency: the more efficient, the better.

• The other reason to correct for unequal variance (besides so that we get valid inference) is for efficiency.

## Illustrative example¶

• Suppose $$Z_i = \mu + \varepsilon_i, \qquad \varepsilon_i \sim N(0, i^2 \cdot \sigma^2), 1 \leq i \leq n.$$

• Two unbiased estimators of $\mu$: \begin{aligned} \widehat{\mu}_1 &= \frac{1}{n}\sum_{i=1}^n Z_i \\ \widehat{\mu}_2 &= \frac{1}{\sum_{i=1}^n i^{-2}}\sum_{i=1}^n i^{-2}Z_i \\ \widehat{\mu}_3 &= \frac{1}{\sum_{i=1}^n i^{-1}}\sum_{i=1}^n i^{-1}Z_i \\ \end{aligned}

## Illustrative example¶

• The estimator $\widehat{\mu}_2$ will always have lower variance, hence tighter confidence intervals.

• The estimator $\widehat{\mu}_3$ has incorrect weights, but they are "closer" to correct than the naive mean's weights which assume each observation has equal variance.

In [ ]:
ntrial = 10000   # how many trials will we be doing?
nsample = 20   # how many points in each trial
sd = c(1:20)   # how does the variance change
mu = 2.0

get.sample <- function() {
return(rnorm(nsample)*sd + mu)
}

unweighted.estimate <- function(cur.sample) {
return(mean(cur.sample))
}

unweighted.estimate <- numeric(ntrial)
weighted.estimate <- numeric(ntrial)
suboptimal.estimate <- numeric(ntrial)


Let's simulate a number of experiments and compare the three estimates.

In [ ]:
ntrial = 1000
for (i in 1:ntrial) {
cur.sample = get.sample()
unweighted.estimate[i] = mean(cur.sample)
weighted.estimate[i] = sum(cur.sample/sd^2) / sum(1/sd^2)
suboptimal.estimate[i] = sum(cur.sample/sd) / sum(1/sd)
}

data.frame(mean(unweighted.estimate),
sd(unweighted.estimate))
data.frame(mean(weighted.estimate),
sd(weighted.estimate))
data.frame(mean(suboptimal.estimate),
sd(suboptimal.estimate))

In [ ]:
densY = c(density(unweighted.estimate)$y, density(weighted.estimate)$y, density(suboptimal.estimate)$y) densX = c(density(unweighted.estimate)$x, density(weighted.estimate)$x, density(suboptimal.estimate)$x)
options(repr.plot.width=5, repr.plot.height=5)

In [ ]:
plot(densX, densY, type='n', main='Comparison of densities of the estimators', cex=0.8)
lines(density(weighted.estimate), col='red', lwd=4)
lines(density(unweighted.estimate), col='blue', lwd=4)
lines(density(suboptimal.estimate), col='purple', lwd=4)
legend(6,0.3, c('optimal', 'suboptimal', 'mean'), col=c('red', 'purple', 'blue'), lwd=rep(4,3), cex=0.8)