Homework for Week 2¶

Table of Contents¶

[0. Statistical power and errors in hypothesis testing](#0.-Statistical power and errors in hypothesis testing)
1. Simple linear regression
2. Non-linear relationship: particle decay
3. Non-linear regression
4. Analysis of variance
5. ANCOVA

0. Statistical power and errors in hypothesis testing¶

Visit the app hosted on a ShinyIO hosted by UCSD Biostats here to build your intuition about statistical power, sampling distributions, and hypothesis testing.

(End of Section 0. Return to top.)

1. Simple linear regression¶

Load the data and look at it. This is data on plant growth and tannin content.

1. Simple linear regression¶

Load the data and look at it. This is data on plant growth and tannin content.

In [4]:

reg.data<-read.csv('shinyexample/tannin.csv')
reg.data

Out[4]:

	growth	tannin
1	12	0
2	10	1
3	8	2
4	11	3
5	6	4
6	7	5
7	2	6
8	3	7
9	3	8

Plot the data

In [5]:

plot(reg.data$tannin,reg.data$growth,pch=21,bg="blue",asp=1,xlab="tannin content",ylab="growth")

Make a simple linear model where growth is a linear function of tannin.
R syntax for models looks like response~explanatory. lm() is a function to make a linear model. In this case, the model is a regression line: $growth_i = a+b*tannin_i$ where a is the intercept and b is the slope and i is a subscript indicating one individual in the dataset.

In [6]:

reg.model=lm(growth~tannin,data=reg.data)
reg.model

Out[6]:

Call:
lm(formula = growth ~ tannin, data = reg.data)

Coefficients:
(Intercept)       tannin  
     11.756       -1.217

Plot this line

In [7]:

plot(reg.data$tannin,reg.data$growth,pch=21,bg="blue",asp=1,xlab="tannin content",ylab="growth")
abline(lm(growth~tannin,data=reg.data),col="cyan",lwd=2)

Note that the line does not go through all the points. But it does capture the gist of the relationship between the two variables.

Store the model
In R, "<-" and "=" mean the same thing. The arrow is literally saying "take this and store it in there." In this case, take the prediction (model) and store it in the R variable named fitted.

In [8]:

fitted <- predict(lm(growth~tannin,data=reg.data))

Plot the error (residuals)

In [9]:

plot(reg.data$tannin,reg.data$growth,pch=21,bg="blue",asp=1,xlab="tannin content",ylab="growth")
abline(lm(growth~tannin,data=reg.data),col="cyan",lwd=2)
for (i in 1:9) {
  lines(c(reg.data$tannin[i],reg.data$tannin[i]),c(reg.data$growth[i],fitted[i]),col="magenta")
}

Residuals show the difference between the prediction (the line here) and the actual values. Notice that the residuals are vertical. This is because we are trying to predict the response variable (which is on the y-axis) given that we know the values of the explanatory variable ( on the x-axis).

Plot square residuals
Why is this line the "best" line? It is best in a very specific sense. It minimizes the sum of the squared residuals. This is called a least squares fit. In other words the regression line minimizes the quantity: $$\sum_{i=1}^{i=n}(y_i-\hat{y}_i)^2$$ where $y_i$ is the actual response value and $\hat{y}_i$ is the predicted value.
Least squares fitting is a very common criterion for fitting a model to data.

In [11]:

plot(reg.data$tannin,reg.data$growth,pch=21,bg="blue",asp=1,xlab="tannin content",ylab="growth")
abline(lm(growth~tannin,data=reg.data),col="cyan",lwd=2)
for (i in 1:9) {
  lines(c(reg.data$tannin[i],reg.data$tannin[i]),c(reg.data$growth[i],fitted[i]),col="magenta")
res=abs(reg.data$growth[i]-fitted[i])
    if (reg.data$growth[i]<fitted[i]) {#then square goes to left
    lines(c(reg.data$tannin[i],reg.data$tannin[i]-res,reg.data$tannin[i]-res,reg.data$tannin[i]),c(reg.data$growth[i],reg.data$growth[i],fitted[i],fitted[i]),col="magenta")
    }
    else {#then square goes to left
    lines(c(reg.data$tannin[i],reg.data$tannin[i]+res,reg.data$tannin[i]+res,reg.data$tannin[i]),c(reg.data$growth[i],reg.data$growth[i],fitted[i],fitted[i]),col="magenta")
    }
    }

Estimating the line
This code shows the squared residuals for different slopes. You can see it is a quadratic function concave downwards which means it has a single minimum (if you were to use summed absolute residuals instead of summed squared residuals you would not necessarily have a unique minimum).

In [13]:

b <-  seq(-1.43,-1,0.002)
sse <- numeric(length(b))
for (i in 1:length(b)) {
  a <- mean(reg.data$growth)-b[i]*mean(reg.data$tannin)
  residual <- reg.data$growth - a - b[i]*reg.data$tannin
  sse[i] <- sum(residual^2)
}
mlval=b[which(sse==min(sse))]
plot(b,sse,type="l",ylim=c(19,24))

In [14]:

b <-  seq(-1.43,-1,0.002)
sse <- numeric(length(b))
for (i in 1:length(b)) {
  a <- mean(reg.data$growth)-b[i]*mean(reg.data$tannin)
  residual <- reg.data$growth - a - b[i]*reg.data$tannin
  sse[i] <- sum(residual^2)
}
mlval=b[which(sse==min(sse))]
plot(b,sse,type="l",ylim=c(19,24))
arrows(mlval,min(sse),mlval,min(sse)-1,col="magenta")
abline(h=min(sse),col="cyan",lty=2)
lines(b,sse)

Sums of squares
The response value $y_i$ can be broken into three parts. One part is the mean response ($\bar{y})$. One part is explained by the model $(\hat{y}_i-\bar{y})$. The other part is the residual, or unexplained part.

In [15]:

plot(c(0,10),c(0,10),xlab="",ylab="",type="n")
abline(h=5,lty=2)
lines(c(0,10),c(8,2))
text(2,6.2,expression(hat(y) - bar(y)))
text(2,8.45,expression(y - hat(y)))
arrows(7,5,7,9.5,code=3,length=0.1)
arrows(1,5,1,7.4,code=3,length=0.1)
arrows(1,9.5,1,7.4,code=3,length=0.1)
points(1,9.5,pch=16)
text(8,7.4,expression(y - bar(y)))
text(0.2,5,expression(bar(y)))
text(.2,7.4,expression(hat(y)))
text(.2,9.5,"y")

Regression model
R can do hypothesis tests on the slope and intercept to see if they are significantly different from zero. What would it mean if the slope were zero?

One important number to notice in the output below is $R^2$, the Coefficient of determination. This is the fraction of the total variation in the response variable that is explained by variation in the explanatory variable. This ranges between 0 and 1. If it is zero, then variation in the explanatory variable doesn't explain any of the variation in the response variable - not a very good model. If it is 1, then variation in the response variable is perfectly explained by variation in the explanatory variable - a great model!

In [16]:

model <- lm(growth~tannin,data=reg.data)
summary(model)
#summary.aov(model)

Out[16]:

Call:
lm(formula = growth ~ tannin, data = reg.data)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.4556 -0.8889 -0.2389  0.9778  2.8944 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  11.7556     1.0408  11.295 9.54e-06 ***
tannin       -1.2167     0.2186  -5.565 0.000846 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.693 on 7 degrees of freedom
Multiple R-squared:  0.8157,	Adjusted R-squared:  0.7893 
F-statistic: 30.97 on 1 and 7 DF,  p-value: 0.0008461

(End of Section 1. Return to top.)

2. Non-linear relationships¶

Particle decay

In [ ]:

decay<-read.csv('shinyexample/decay.csv')
decay

Plot the data

In [ ]:

plot(decay$time,decay$amount,pch=21,col="blue",bg="cyan")

Linear fit

In [ ]:

plot(decay$time,decay$amount,pch=21,col="blue",bg="cyan")
abline(lm(amount~time,data=decay),col="magenta")
linmodel=lm(amount~time,data=decay)
summary(lm(amount~time,data=decay))

The linear fit is okay, but it really looks like the data itself doesn't follow a line.

Transform data to make it more linear
Instead of using the variable itself, work with a function of the variable (in this case the log of the response variable).

In [17]:

plot(decay$time,log(decay$amount),pch=21,col="blue",bg="magenta")
abline(lm(log(amount)~time,data=decay),col="blue")
model <- lm(log(amount)~time,data=decay)
summary(model)

Error in plot(decay$time, log(decay$amount), pch = 21, col = "blue", bg = "magenta"): object 'decay' not found

Error in is.data.frame(data): object 'decay' not found

Error in is.data.frame(data): object 'decay' not found

Out[17]:

Call:
lm(formula = growth ~ tannin, data = reg.data)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.4556 -0.8889 -0.2389  0.9778  2.8944 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  11.7556     1.0408  11.295 9.54e-06 ***
tannin       -1.2167     0.2186  -5.565 0.000846 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.693 on 7 degrees of freedom
Multiple R-squared:  0.8157,	Adjusted R-squared:  0.7893 
F-statistic: 30.97 on 1 and 7 DF,  p-value: 0.0008461

Plot the model against the raw data

In [ ]:

plot(decay$time,decay$amount,pch=21,col="blue",bg="cyan")
xv <- seq(0,30,0.25)
yv <- exp(model$coefficients[1]) * exp(model$coefficients[2] * xv)
lines(xv,yv,col="magenta")

The non-linear model looks like it fits better, but how can you prove this? There are various statistics that can be used to compare models. Common ones are comparing the log-likelihoods using a likelihood ratio test or various information criteria (AIC, BIC).

In [ ]:

AIC(linmodel,model)

(End of Section 2. Return to top.)

3. Non-linear regression¶

Example 2

In [ ]:

deer<-read.csv('shinyexample/jaws.csv')
deer

Plot the data

In [ ]:

plot(deer$age,deer$bone,pch=21,bg="lightgrey")

Sometimes it is not possible to transform the data such that a linear model would fit it nicely. So you can use a non-linear model. In this example, we are comparing two possible non-linear models: $$(1)\ \ bone_i=a-b*e^{-c*age_i}$$ $$(2)\ \ bone_i=a-a*e^{-c*age_i}$$ The first has three parameters (a,b,c). The second only has two (a,c)

In [ ]:

model <- nls(bone~a-b*exp(-c*age),data=deer,start=list(a=120,b=110,c=0.064))
summary(model)
av <- seq(0,50,0.1)
bv <- predict(model,list(age=av),data=deer)
plot(deer$age,deer$bone,pch=21,bg="lightgrey")
lines(av,bv,col="blue")

In [ ]:

model2 <- nls(bone~a*(1-exp(-c*age)),data=deer,start=list(a=120,c=0.064))
av2 <- seq(0,50,0.1)
bv2 <- predict(model2,list(age=av2),data=deer)
plot(deer$age,deer$bone,pch=21,bg="lightgrey")
lines(av,bv,col="blue")
lines(av2,bv2,col="magenta")
summary(model2)

In [ ]:

AIC(model,model2)
#anova(model,model2)

(End of Section 3. Return to top.)

4. Analysis of variance¶

In [18]:

oneway<-read.csv('shinyexample/oneway.csv')
oneway

Warning message:
In file(file, "rt"): cannot open file 'shinyexample/oneway.csv': No such file or directory

Error in file(file, "rt"): cannot open the connection

Out[18]:

	ozone	garden
1	3	A
2	5	B
3	4	A
4	5	B
5	4	A
6	6	B
7	3	A
8	7	B
9	2	A
10	4	B
11	3	A
12	4	B
13	1	A
14	3	B
15	3	A
16	5	B
17	5	A
18	6	B
19	2	A
20	5	B

Plot the data

In [24]:

plot(1:20,oneway$ozone,ylim=c(0,8),ylab="ozone",xlab="individual datapoints",pch=21,bg="magenta")
abline(h=mean(oneway$ozone),col="blue")
for(i in 1:20) lines(c(i,i),c(mean(oneway$ozone),oneway$ozone[i]),col="cyan")

What if your explanatory variable is categorical? Think about the situation where you measure some quantitative response variable on different groups and want to know how group membership affects the response variable. These groups could be treatments (drug, no drug) or plots (fertilized, watered, shaded, etc.) or whatever. We want to know what effect each treatment has. We express those as deviations from the mean. In the plot above, the overall mean is the blue line and the deviations are marked in cyan.

In [35]:

plot(1:20,oneway$ozone,ylim=c(0,8),xlim=c(1,21),ylab="ozone",xlab="individual datapoints",
     pch=21,bg=as.numeric(oneway$garden))
## 5f ##
abline(h=mean(oneway$ozone[oneway$garden=="A"]))
abline(h=mean(oneway$ozone[oneway$garden=="B"]),col="red")
index <- 1:length(oneway$ozone)
for (i in 1:length(index)){
  if (oneway$garden[i] == "A" )
    lines(c(index[i],index[i]),c(mean(oneway$ozone[oneway$garden=="A"]),oneway$ozone[i]))
  else 
    lines(c(index[i],index[i]),c(mean(oneway$ozone[oneway$garden=="B"]),oneway$ozone[i]), col="red")
}

arrows(21,mean(oneway$ozone),21,mean(oneway$ozone[oneway$garden=="A"]),col="black",length=.1)
      arrows(21,mean(oneway$ozone),21,mean(oneway$ozone[oneway$garden=="B"]),col="red",length=.1)
      points(21,mean(oneway$ozone),pch=19,col="blue")

The red line in the plot is the mean ozone for the red group (B). The black line is the mean ozone for the black group. Notice that the residuals are smaller. By breaking the data apart into groups, we have reduced the unexplained variation in ozone levels. The arrows show the deviations of the group means from the overall means and represent the effect of being in the red or black group.

In [36]:

summary(aov(ozone~garden,data=oneway))

Out[36]:

            Df Sum Sq Mean Sq F value  Pr(>F)   
garden       1     20  20.000      15 0.00111 **
Residuals   18     24   1.333                   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(End of Section 4. Return to top.)

5. ANCOVA¶

In [38]:

ancdata<-read.csv('shinyexample/ipomopsis.csv')
ancdata

Warning message:
In file(file, "rt"): cannot open file 'shinyexample/ipomopsis.csv': No such file or directory

Error in file(file, "rt"): cannot open the connection

Out[38]:

	Root	Fruit	Grazing
1	6.225	59.77	Ungrazed
2	6.487	60.98	Ungrazed
3	4.919	14.73	Ungrazed
4	5.13	19.28	Ungrazed
5	5.417	34.25	Ungrazed
6	5.359	35.53	Ungrazed
7	7.614	87.73	Ungrazed
8	6.352	63.21	Ungrazed
9	4.975	24.25	Ungrazed
10	6.93	64.34	Ungrazed
11	6.248	52.92	Ungrazed
12	5.451	32.35	Ungrazed
13	6.013	53.61	Ungrazed
14	5.928	54.86	Ungrazed
15	6.264	64.81	Ungrazed
16	7.181	73.24	Ungrazed
17	7.001	80.64	Ungrazed
18	4.426	18.89	Ungrazed
19	7.302	75.49	Ungrazed
20	5.836	46.73	Ungrazed
21	10.253	116.05	Grazed
22	6.958	38.94	Grazed
23	8.001	60.77	Grazed
24	9.039	84.37	Grazed
25	8.91	70.11	Grazed
26	6.106	14.95	Grazed
27	7.691	70.7	Grazed
28	8.988	80.31	Grazed
29	8.975	82.35	Grazed
30	9.844	105.07	Grazed
31	8.508	73.79	Grazed
32	7.354	50.08	Grazed
33	8.643	78.28	Grazed
34	7.916	41.48	Grazed
35	9.351	98.47	Grazed
36	7.066	40.15	Grazed
37	8.158	52.26	Grazed
38	7.382	46.64	Grazed
39	8.515	71.01	Grazed
40	8.53	83.03	Grazed

Plot the data

In [39]:

plot(ancdata$Root,ancdata$Fruit,pch=16,col="blue",xlab='Root size',ylab='Fruit')

If you look at the data, you see that there is another variable. In fact, there are two explanatory variables (one quantitative, one categorical) and one response variable (quantitative) in this example.

Another view of the data

In [40]:

plot(ancdata$Grazing,ancdata$Fruit,col="lightgreen",ylab="Fruit")

From this plot (which does not include root size) it looks like grazing increases fruit yield.

Model: fruit depends on grazing status:

In [42]:

justGrazingModel=lm(Fruit~Grazing,data=ancdata)
summary(justGrazingModel)

Out[42]:

Call:
lm(formula = Fruit ~ Grazing, data = ancdata)

Residuals:
    Min      1Q  Median      3Q     Max 
-52.991 -18.028   2.915  14.050  48.110 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)       67.941      5.236  12.976 1.54e-15 ***
GrazingUngrazed  -17.060      7.404  -2.304   0.0268 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 23.41 on 38 degrees of freedom
Multiple R-squared:  0.1226,	Adjusted R-squared:  0.09949 
F-statistic: 5.309 on 1 and 38 DF,  p-value: 0.02678

What about root size?

In this model, fruit size depends on both root size and grazing group and the effect of root size on fruit size could differ between grazing groups.

In [ ]:

fullmodel <- lm(Fruit~Root*Grazing,data=ancdata)

In [ ]:

summary.aov(fullmodel)

In [ ]:

AIC(justGrazingModel,fullmodel)

Model without interaction
In this model, fruit size depends on both root size and grazing group but the effect of root size on fruit size is the same regardless of grazing group.

In [ ]:

noInteractionModel=lm(Fruit~Grazing+Root,data=ancdata)

In [ ]:

anova(noInteractionModel,fullmodel)

In [ ]:

plot(ancdata$Root,ancdata$Fruit,pch=21,bg=(4*as.numeric(ancdata$Grazing)))
legend("topleft",c("grazed","ungrazed"),col=c(4,8),pch=16)
abline(-127.829,23.56,col="blue")
abline(-127.829+36.103,23.56,col="blue")

(End of Section 5. End of Module 2. Return to top.)

In [ ]:

	ozone	garden
1	3	A
2	5	B
3	4	A
4	5	B
5	4	A
6	6	B
7	3	A
8	7	B
9	2	A
10	4	B
11	3	A
12	4	B
13	1	A
14	3	B
15	3	A
16	5	B
17	5	A
18	6	B
19	2	A
20	5	B

	ozone	garden
1	3	A
2	5	B
3	4	A
4	5	B
5	4	A
6	6	B
7	3	A
8	7	B
9	2	A
10	4	B
11	3	A
12	4	B
13	1	A
14	3	B
15	3	A
16	5	B
17	5	A
18	6	B
19	2	A
20	5	B

	ozone	garden
1	3	A
2	5	B
3	4	A
4	5	B
5	4	A
6	6	B
7	3	A
8	7	B
9	2	A
10	4	B
11	3	A
12	4	B
13	1	A
14	3	B
15	3	A
16	5	B
17	5	A
18	6	B
19	2	A
20	5	B