Ian Dworkin

September 18, 2012

In this tutorial we are going to examine how **R** fits some simple linear models,
and how we can interpret some of the outputs. I generally like to call all of the
libraries that I will use first, and double check that there are no conflicts (in
terms of names of functions).

In [87]:

```
%load_ext rmagic
```

In [88]:

```
%%R
# changes some default display options
options(show.signif.stars=F)
options(digits=3)
```

In [89]:

```
%%R
require(arm)
require(car)
require(lattice)
```

As usual, we set our working directory and read in the data. You will want to change the working directory to your home directory.

In [90]:

```
%%R -nd dll.data
dll.data <- read.csv("/Users/olsonran/Documents/dll.csv", header=TRUE)
```

I always start by looking at the input of the data
rst using `summary(your.data.frame)`

and `str(your.data.frame)`

. I also use `head(your.data.frame)`

and `tail(your.data.frame)`

to con rm it was all imported as expected. For space here, I will not call them

but you should run them yourself `summary(dll.data)`

, `str(dll.data)`

, etc...

If you have used these functions, you will see that some of the variable in

`dll.data`

are treated numerically, when in fact we want them to be treated

as factors. In particular temp and replicate. So we have **R** change these to

factors.

In [91]:

```
%%R
dll.data$temp <- factor(dll.data$temp)
dll.data$replicate <- factor(dll.data$replicate)
```

We also want to make sure that the default level of the factor `genotype=="wt"`

.
By default factors in **R** go alpha-numerically. That is the lowest letter of the
alphabet, or smallest numbers are the default base level. However, we want to
reset this to aid in intepretation. To do this we use the relevel function, which
has the name of the variable that we are changing, and the new default level

In [92]:

```
%%R
dll.data$genotype <- relevel(dll.data$genotype, "wt")
# Setting the wild-type (wt) as reference
```

For the purposes of this tutorial we are also going to remove all rows that have any missing observations. This is not generally advisable, and the smarter approach would be to just remove specic rows that have missingness for the variables used in the model.

In [93]:

```
%%R
dll.data <- na.omit(dll.data)
```

At this point it is wise to call `summary()`

and `str()`

for your data frame
again to make sure the expected changes occurred.

Before we fit the models, we will quickly do some exploratory plots. Please review the exploratory data analysis tutorial for more details.

In [94]:

```
%%R
# The plot
plot(dll.data$SCT ~ dll.data$tarsus,
col=c("red", "blue")[dll.data$genotype], ylim=c(5,25),
pch=c(1,2)[dll.data$temp],cex=0.8, ylab="# of SCT",
xlab="Tarsus length",
main="scatterplot of Sex comb teeth and basi tarsus length")
# Add a legend
legend(0.23,25, legend=c("Dll 25","Dll 30", "wt 25", "wt 30"),
col=c("red", "red", "blue", "blue"), pch=c(1,2,1,2))
# lowess curves
lines(lowess(dll.data$tarsus[dll.data$genotype=="Dll" & dll.data$temp=="25"],
dll.data$SCT[dll.data$genotype=="Dll" & dll.data$temp=="25"]),
lwd=2, type="o", col="red")
lines(lowess(dll.data$tarsus[dll.data$genotype=="Dll" & dll.data$temp=="30"],
dll.data$SCT[dll.data$genotype=="Dll" & dll.data$temp=="30"]),
lwd=2, type="o", col="red", pch=2)
lines(lowess(dll.data$tarsus[dll.data$genotype=="wt" & dll.data$temp=="25"],
dll.data$SCT[dll.data$genotype=="wt" & dll.data$temp=="25"]),
lwd=2, type="o", col="blue")
lines(lowess(dll.data$tarsus[dll.data$genotype=="wt" & dll.data$temp=="30"],
dll.data$SCT[dll.data$genotype=="wt" & dll.data$temp=="30"]),
lwd=2, type="o", col="blue", pch=2)
```

In this case, the lowess lines are not actually very informative on the plot,
but perhaps we can see the dierence in the trend for the relationship between
`SCT`

and `tarsus`

across the two genotypes.

We could do a fair bit of additional plotting, but I leave this as an exercise
(see associated **R** code).

To perform this regression in **R** we use the `lm()`

function. lm stands for Linear
Model. As we will learn linear regression is just a special case of the *general
linear model* in statistics.

In **R** we regress `y`

onto `x`

like:

`lm(y ~ x, data=YourData)`

, where the tilde ('~') means 'is modeled as'.

So you would read this as "y is modeled as a function of x". Using the `lm()`

additionally implies that we are assuming that the residual (unexplained) variation
in the model is normally distributed, and that the response is (to a rst
approximation) a continuous variable. So we are really fitting the model.

$y = f(x) ~ N(\beta_0 + \beta_1x, \sigma)$ , where

$N()$ means normally distributed with mean ($\mu$) equal to $\beta_0 + \beta_1x$

$\beta_0$ is the intercept

$\beta_1$ is the slope of the regression line

$\sigma$ is the distribution for the residual (unexplained) variation. The variation remaining in our response ($y$) after accounting for the model. Here I have it in the same units as $y$, but you will often see it expressed as a variance ($\sigma^2$), and being labeled the unexplained variance of the model.

Using the `lm()`

in **R** assumes that you want to fit an intercept term in your
model (which we almost want to do). So we did not need to explicitly include
it in the function call to `lm()`

. However, we certainly can (and it is probably a
good idea when you are rst starting out). In any case it means that these two
calls are equivalent:

$lm(y ~ 1 + x)$

$lm(y ~ x)$

Now we actually fit the model in **R**.

In [95]:

```
%%R
regression.1 <- lm(SCT ~ tarsus, data=dll.data) # The intercept is implicit
regression.1.Intercept <- lm(SCT ~ 1 + tarsus, data=dll.data)
# With explicit intercept, but the same as above!
```

It is useful to remind ourselves of the design matrix. We can use the
`model.matrix()`

. Remember that it will have as many rows as observations!
We will just look at the first few observations.

In [96]:

```
%%R
print(head(model.matrix(regression.1)))
```

This is just a matrix so we can use indexing like normal in **R**.

In [97]:

```
%%R
print(model.matrix(regression.1)[500:508,])
```

Let us first look at the estimated coecients from this regression. We use the
`coef()`

function.

In [98]:

```
%%R
print(coef(regression.1))
```

These represent the intercept and the slope of the model. However (and
we will return to this), the intercept tells us the expected number of `SCT`

when
`tarsus=0`

, which is clearly not a sensible value. The tarsus could not have a
zero length, and how could sex comb teeth be present on a structure with a zero
length?

We should also look at the confidence intervals for our estimated slope and
intercept. There is a handy function `confint()`

to get the 95% condence
intervals for these parameters.

In [99]:

```
%%R
print(confint(regression.1))
```

One thing that is clear from these values, the 95% CI do not overlap with zero, consistent with a so-called 'statistically signicant' effect.

As we often do with **R**, we will use the `summary()`

function. However before
we do for the linear model, I want to make a slight detour for a discussion
about generic functions in **R**. When you call `summary()`

, the rst thing it does
is look at the class of the object, and then it uses a class specic method. What
this means is that calling `summary()`

actually calls different functions depending
upon the class of the object.

If you look up the help for `summary()`

, `?summary`

under the description it
will say "summary is a generic function used to produce result summaries of
the results of various model tting functions. The function invokes particular
methods which depend on the class of the first argument." What this means
is that if you call `summary()`

on a data frame, i.e. `summary(your.data)`

, it
is actually calling the function `summary.data.frame(your.data)`

. If you call
`summary(your.model.object)`

on a linear model object, `lm(your.model)`

, it
will actually call the function `summary.lm(your.model.object)`

. To get a better
sense of this type `methods(summary)`

into the **R** console, and you will see a
list of different class specific methods for the `summary()`

generic function. This
is why summary seems to do so many different things!!! It also means if you
want to understand more about `summary.lm`

you need to use `?summary.lm`

.

Now back to the linear model.

In [100]:

```
%%R
print(summary(regression.1))
```

There are a couple of useful aspects to this output. In addition to the
estimated coefficients for the slope and intercept, we have their standard errors,
t-values (which is just the parameter estimate divided by the standard error)
and p-values. We can double the standard errors to get approximate condence
intervals but the `confint()`

allows us to get these anyways. We also get the
quantiles for the residuals. We expect that these should be symmetric in value,
with the median close to zero. That is clearly not the case here, so we may want
to look further at the residuals (which we will do in a later tutorial).

$R^2$, also known as the coefficient of determination provides a quantitative
assessment of *How much of the observed variation in our response (SCT) can
be accounted for by our model?* In this case it is approximately 8%. While this
is small, for a single variable such as a measure of length, that is not negligible
by any means.

It is also worth interpreting the Residual Standard Error (RSE). The RSE
of the model is ~ 1:55, which tells us we can predict the number of SCT on
an individual with an accuracy of 1:55 SCT *given this model*. See page 41 of
Gelman and Hill for more discussion.

I also think it is generally wise to take a look at the co-variances for the estimated parameters. I am not speaking of co-variation or correlation of the observed values, but of the estimates we just produced for the slope and intercept. These helps you recognize whether the estimated values depend on each other (and suggests some lack of independence between predictors, or other issues in your model that need to be fixed).

In [101]:

```
%%R
print(vcov(regression.1))
```

Question: How would we get the Standard errors for the estimates from this matrix? I will ask this in class.

To address the degree of co-variation between the estimated slope and intercept,
we look at the off-diagonals. Sometimes these are difficult to interpret
as co-variances, so we convert this to a correlation matrix using the `cov2cor()`

function.

In [102]:

```
%%R
print(cov2cor(vcov(regression.1)))
```

The diagonals are correlations of a parameter estimate on itself, so these have to be one, and are not interesting. We are again focused on the off-diagonal elements. In this case the value is $\sim - .99$ which means that our parameter estimates are almost perfectly negatively correlated. This is clearly not good news. This problem is due almost entirely to the fact that we are estimating an intercept way outside the range of our data.

Let's take a look at the estimated parameters again.

In [103]:

```
%%R
print(coef(regression.1))
```

The intercept 6.08 is interpreted as the number of `SCT`

when `tarsus=0`

. This
is clearly a very silly interpretation, inflates the SE for the intercept, and the
co-variation between parameters. This plot demonstrates how crazy this would
be. The **R** code for this (including confidence intervals) is in the .R file of the
same name with some comments to explain what we have done.

In [104]:

```
%%R
plot(SCT ~ tarsus, data=dll.data, xlim=c(0, 0.258))
abline(regression.1)
# We can look at the confidence region (confidence bands ) as well to see how this influences our uncertainty.
# To make confidence bands, we generally need to generate a set of "new" values for our explanatory variable (tarsus), and then use these to predict fitted values (and Confidence intervals on fitted values)
# We make the range go from 0, since we are fitting the intercept back to zero. Clearly a crazy idea!
# make a new data frame with our "new" values of tarsus length
new_tarsus <- data.frame(tarsus = seq(0, range(dll.data$tarsus)[2], by=0.005) ) # Makes a new data frame
# "predict" values for our new values of tarsus
predicted.model.1 <- predict(regression.1, new_tarsus, interval="confidence")
#Then we add these lines back on
lines(x = new_tarsus[,1], y = predicted.model.1[,1], lwd=4) # this would be the same as abline(model.1)
lines(x = new_tarsus[,1], y = predicted.model.1[,3], lwd=3, lty=2, col="grey")
lines(x = new_tarsus[,1], y = predicted.model.1[,2], lwd=3, lty=2, col = "grey")
```

Clearly we are estimating well outside the range of the observed data, including the confidence bands around the best fit line. We can also look at a plot in parameter space (the previous plot was in the observation space). We do so by looking at the confidence ellipse around the estimated parameters (slope and intercept).

In [105]:

```
%%R
confidenceEllipse(regression.1)
# The estimates are highly correlated with one another.
```

This makes it very clear how much the parameter estimates for slope and intercept depend upon each other. If we decrease (by slight changes in the data or estimation procedure) the estimate of the slope, we get a pretty substantial increase in the intercept. So what do we do?

Proper interpretation of the parameters of a model is the most important part of the inferential process (at least once you have a model to fit). This is what allows us to turn a hum drum statistical analysis into Biological inference!!!! However, often the "default" way we code our data and our analyses does not allow for easy interpretation. In this next part we are going to consider some simple things to make it easier to interpret our data.

Not only is the intercept uninterpretable, but the CI's are really big at the
intercept when `tarsus=0`

. The fact that we are trying to estimate a parameter
WAY OUTSIDE the range of our data is causing the highly inflated SE, and the
covariance between parameter estimates. As we discussed in class, one simple
way of dealing this is to center the covariates (in this case, just tarsus length),
to help interpretability of the intercept. This will (as we will see) also get rid
of the correlation between parameter estimates and the inflated standard error
for the intercept. So let's center the data.

In [106]:

```
%%R
dll.data$cent.tarsus <- dll.data$tarsus - mean(dll.data$tarsus)
```

You can also use the `scale(your.x, center=T, scale=F)`

. If you keep
`scale=T`

, then it will also rescale the variable by its standard deviation (which
we do not want at the moment).

In [107]:

```
%%R
par(mfrow=c(2,1))
hist(dll.data$tarsus, breaks=15)
hist(dll.data$cent.tarsus, breaks=15) #just changes the mean
```

The mean changes as we expect.

In [108]:

```
%%R
print(mean(dll.data$tarsus))
```

In [109]:

```
%%R
print(mean(dll.data$cent.tarsus))
```

However the standard deviation does not change.

In [110]:

```
%%R
print(sd(dll.data$tarsus))
```

In [111]:

```
%%R
print(sd(dll.data$cent.tarsus))
```

We now re-run the model with the centered co-variate.

In [112]:

```
%%R
regression.2 <-lm(SCT ~ cent.tarsus, data=dll.data)
par(mfrow=c(2,1))
plot(SCT~cent.tarsus,data=dll.data, cex=0.5, pch=16)
abline(regression.2)
new_tarsus.2 <- data.frame(cent.tarsus = seq(range(dll.data$cent.tarsus)[1], range(dll.data$cent.tarsus)[2], by=0.005) ) # Makes a new data frame
# "predict" values for our new values of tarsus
predicted.model.2 <- predict(regression.2, new_tarsus.2, interval="confidence")
#Then we add these lines back on
lines(x = new_tarsus.2[,1], y = predicted.model.2[,1], lwd=3) # this would be the same as abline(model.1)
lines(x = new_tarsus.2[,1], y = predicted.model.2[,3], lwd=2, lty=2)
lines(x = new_tarsus.2[,1], y = predicted.model.2[,2], lwd=2, lty=2)
confidenceEllipse(regression.2)
```

We can see that we are estimating within the range of our data now. We can also see that the correlation between the parameter estimates is ~ 0.

In [113]:

```
%%R
print(cov2cor(vcov(regression.2)))
```

So how do we interpret the coefficients with the centered data? First o , you

will notice that only the estimate of the intercept and its standard error have

changed for this model.

In [114]:

```
%%R
print(summary(regression.2))
```

The intercept now tells us what the expected number of `SCT`

is when tarsus
is at its mean, i.e. 11.1. You can also see that the standard error associated
with the intercept is much smaller.

We can use the `coefplot()`

in the arm library to take a quick look at our
estimates. The thick error bars represent one standard error, and the thinner
bars are 2$\times$ standard errors.

In [115]:

```
%%R
coefplot(regression.2, intercept=T, vertical=F, ylim=c(0, 35))
```

One thing to keep in mind is what the estimated slope 26.9 is telling us. It is
saying that for every increase by "1" unit of tarsus (of whatever unit tarsus
is measured in), the number of `SCT`

is predicted to increase by 26:9. However,
let us look at the range of tarsus values.

In [116]:

```
%%R
print(quantile(dll.data$tarsus))
```

The range of tarsus lengths is between 0.10 $-$ 0:258, so a change of unit length (i.e. from 0.10 to 1.1) goes far outside the biologically plausible range for this variable.

What can we do to interpret the slope? One thing you can do is just interpret
carefully. Instead of thinking about what happens if we increase tarsus length by
1, we can instead think about what we would predict if tarsus length increased
by 0.1 (we would predict an increase of 26.9/10 = 2.69 `SCT`

for a 0.1 increase in
tarsus). This is much more reasonable, and easier to explain.

In this instance you can think about how the range of values for tarsus length is

In [117]:

```
%%R
print(range(dll.data$tarsus))
```

or

In [118]:

```
%%R
print(max(dll.data$tarsus) - min(dll.data$tarsus))
```

Which is ~ 0.15mm. So we can just calculate 0.15 * 26.9, the slope times the range of values. So over the whole range of values for tarsus we increase SCT by 4.035. Is this a lot? Let's compare it to the standard deviation for the observed sample.

In [119]:

```
%%R
print(sd(dll.data$SCT))
```

It is ~ 2$\times$ greater than the sd for SCT, suggesting that tarsus length is probably a biologically (as well as statistically) significant covariate for accounting for variation for SCT. In combination with $R^2$, I find I am usually able to think about the biological importance of my variables in this manner. We will go through a more formal analysis for effect sizes in the coming weeks. It is also important to remember that there is uncertainty in your point estimate, so it is worth doing this for the confidence intervals.

In [120]:

```
%%R
print(confint(regression.2))
```

So the lower bounds for the estimate is 0.15 $*$ 23.01 ~ 3.5 SCT. The upper bound is 0.15$*$30.8 ~ 4.62 SCT. Neither of these really changes our interpretation of the biological importance of tarsus length as a covariate.

We could also just rescale our tarsus measures from milli-metres to micro-metres.

In [121]:

```
%%R
dll.data$cent.tarsus.microM <- 1000*dll.data$cent.tarsus
regression.3 <- lm(SCT ~ cent.tarsus.microM, data=dll.data )
print(coef(regression.3))
```

This will not change anything other than the estimate for the slope and its SE (both 1000 times smaller). This may make it easier to interpret, as a unit increase in tarsus length (1 microM), increases expected number of SCT by ~ 0.0269.

You could also normalize the data by dividing all values of tarsus by `sd(tarsus)`

.
If you do this on the centered data, it is called unit standardized data.
You can also accomplish this using `scale(x, center=T, scale=T)`

. Note
that `scale=T`

now.

If you normalize/standardize like this then you are no longer thinking in units of the measured variable, but in changes in units of standard deviation. This can be very useful for covariates that are difficult to compare (say nutrient concentration and water temperature). However it does add some complication to interpretation. I should also point out that several people (David Houle being a strong member of this community) suggest that more often scaling by the mean, not the standard deviation makes sense. I am happy to share references if you are interested. This form of re-scaling of data (either mean or sd) is very commonly done when estimating natural selection, so that estimates can be compared across studies.

Let's take a look how it influences our estimates.

In [122]:

```
%%R
dll.data$tarsus.std <- as.numeric(scale(dll.data$tarsus))
regression.4 <- lm(SCT ~ tarsus.std, data=dll.data )
print(summary(regression.4)$coef[,1:2])
```

The estimate for the slope and its SE have changed. What do they mean?
They are now in units of standard deviations of the `tarsus`

. I find it helps to
think about the range of new values.

In [123]:

```
%%R
print(range(dll.data$tarsus.std))
```

This is now in terms of sd of the original variable (tarsus length) so the min value is -4.5 sd from the mean of tarsus length, and max ~ 3.9 sd from the mean. But how much is a sd unit for tarsus?

In [124]:

```
%%R
print(sd(dll.data$tarsus))
```

So a sd "unit" is ~ 0.018 mm of tarsus length. The slope is now interpreted as saying for every increase of 1 sd in tarsus length, SCT number increases by ~ 0.48

In [125]:

```
%%R
plot(jitter(SCT) ~ tarsus.std , data=dll.data )
abline(regression.4)
new_tarsus.4 <- data.frame(tarsus.std = seq(-4.5, 4, by=0.01) ) # Makes a new data frame
# "predict" values for our new values of tarsus
predicted.model.4 <-predict(regression.4, new_tarsus.4 , interval="confidence")
#Then we add these lines back on
lines(x = new_tarsus.4[,1], y = predicted.model.4[,1], lwd=3) # this would be the same as abline(model.1)
lines(x = new_tarsus.4[,1], y = predicted.model.4[,3], lwd=2, lty=2)
lines(x = new_tarsus.4[,1], y = predicted.model.4[,2], lwd=2, lty=2)
```

As a note. We will start to build complex models. Especially once we build
them in a MLE framework, it is useful to understand formula objects in **R**.
`?formula`

provides a useful example of how to do this with `paste()`

, and I will
use that.