In these notes we are using a lot of Tidyverse functions. However, they are not essential for this homework, and merely reflect a stylistic preference.
We read in the data using scripts we have used before. We also load the tidyverse package.
# downloading data file from Dryad
dryadFileDownload <- function(filenum,filename,baseurl="https://datadryad.org/api/v2")
{
download.file(paste(baseurl,"/files/",filenum,"/download",sep=""),filename,mode="wb")
}
# load tidyverse
library(tidyverse)
# read in the dataset and show the top few lines
vernFileID <- "22636"
tmpfile <- tempfile(fileext="txt")
dryadFileDownload(vernFileID,tmpfile)
vern <- read.table(tmpfile)
head(vern)
── Attaching core tidyverse packages ────────────────────────────────── tidyverse 2.0.0 ── ✔ dplyr 1.1.4 ✔ readr 2.1.5 ✔ forcats 1.0.0 ✔ stringr 1.5.1 ✔ ggplot2 3.4.4 ✔ tibble 3.2.1 ✔ lubridate 1.9.3 ✔ tidyr 1.3.1 ✔ purrr 1.0.2 ── Conflicts ──────────────────────────────────────────────────── tidyverse_conflicts() ── ✖ dplyr::filter() masks stats::filter() ✖ dplyr::lag() masks stats::lag() ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Genotype | Background | Background.simple | Treatment | Treatment.V | Chamber.ID | Chamber.Irradiance | Vernalization | Daylength | Temperature | ⋯ | Survival.Bolt | Bolt | Days.to.Bolt | Days.to.Flower | Rosette.leaf.num | Cauline.leaf.num | Blade.length.mm | Total.leaf.length.mm | Blade.ratio | Notes | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
<chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <int> | <int> | ⋯ | <chr> | <chr> | <int> | <int> | <int> | <int> | <dbl> | <dbl> | <dbl> | <chr> | |
1 | agl24-1 | Col | Col | 12ConLD | 12ConLDNV | 7B | Low | NV | 16 | 12 | ⋯ | Y | Y | 35 | 48 | 18 | 7 | 15.6 | 29.7 | 0.5252525 | |
2 | agl24-1 | Col | Col | 12ConLD | 12ConLDNV | 7B | Low | NV | 16 | 12 | ⋯ | Y | Y | 35 | 48 | 17 | 5 | 16.4 | 32.9 | 0.4984802 | |
3 | agl24-1 | Col | Col | 12ConLD | 12ConLDNV | 4T | High | NV | 16 | 12 | ⋯ | Y | Y | 36 | 48 | 22 | 6 | 10.6 | 20.4 | 0.5196078 | |
4 | agl24-1 | Col | Col | 12ConLD | 12ConLDNV | 4T | High | NV | 16 | 12 | ⋯ | Y | Y | 36 | 47 | 18 | 5 | 15.9 | 25.6 | 0.6210938 | |
5 | agl24-1 | Col | Col | 12ConLD | 12ConLDNV | 4T | High | NV | 16 | 12 | ⋯ | Y | Y | 38 | 50 | 24 | 8 | 14.4 | 26.4 | 0.5454545 | |
6 | agl24-1 | Col | Col | 12ConLD | 12ConLDNV | 7B | Low | NV | 16 | 12 | ⋯ | Y | Y | 39 | 53 | 20 | 7 | 15.9 | 29.7 | 0.5353535 |
We select the data for the hua2-3
and hua2-3 FRI
genotypes.
# select the three columns of interest: genotype, vernalization, and phenotype
hua <- select(vern,Genotype,Vernalization,Cauline.leaf.num)
summary(hua)
Genotype Vernalization Cauline.leaf.num Length:3197 Length:3197 Min. : 1.000 Class :character Class :character 1st Qu.: 4.000 Mode :character Mode :character Median : 7.000 Mean : 6.474 3rd Qu.: 9.000 Max. :29.000 NA's :120
# filter only observations with `hua2-3` and `hua2-3 FRI` genotypes.
hua <- filter(hua, (Genotype == "hua2-3") | (Genotype == "hua2-3 FRI"))
summary(hua)
Genotype Vernalization Cauline.leaf.num Length:270 Length:270 Min. : 1.000 Class :character Class :character 1st Qu.: 3.000 Mode :character Mode :character Median : 5.000 Mean : 5.801 3rd Qu.: 8.000 Max. :16.000 NA's :3
We update factor levels after filtering the data frame, so that the levels we have dropped do not appear in the data.
# update factor levels
hua <- droplevels(hua)
summary(hua)
Genotype Vernalization Cauline.leaf.num Length:270 Length:270 Min. : 1.000 Class :character Class :character 1st Qu.: 3.000 Mode :character Mode :character Median : 5.000 Mean : 5.801 3rd Qu.: 8.000 Max. :16.000 NA's :3
Calculate the mean cauline leaf number for both genotypes in both vernalization conditions. We also calculate the sample sizes (number of non-missing observations) in each of the four categories. Note that they are not exactly balanced.
# calculate means grouping by genotype and vernalization
huameans <- group_by(hua, Genotype, Vernalization) %>%
summarise(meancauline = mean(Cauline.leaf.num, na.rm = T),
n = sum(!is.na(Cauline.leaf.num)))
huameans
`summarise()` has grouped output by 'Genotype'. You can override using the `.groups` argument.
Genotype | Vernalization | meancauline | n |
---|---|---|---|
<chr> | <chr> | <dbl> | <int> |
hua2-3 | NV | 5.720588 | 68 |
hua2-3 | V | 4.761194 | 67 |
hua2-3 FRI | NV | 6.453125 | 64 |
hua2-3 FRI | V | 6.294118 | 68 |
6.294118 - 6.453125 # vern effect in FRI
4.761194 - 5.720588 # vern effect in hua2-3
-0.159007 - (-0.959394000000001)
6.453125 - 5.720588
The main effect is defined as the difference in the means of the two genotype categories. In this case, since we have genotype means calculated for two vernalization conditions, we have to use a weighted mean (weighted by the number of observations).
# group the previous dataframe by genotype
# calculate a weighted mean since the two vernalization groups
# have difference sample sizes
huagenomeans <- group_by(huameans,Genotype) %>%
summarize(meancauline = weighted.mean(meancauline, w = n ))
huagenomeans
Genotype | meancauline |
---|---|
<chr> | <dbl> |
hua2-3 | 5.244444 |
hua2-3 FRI | 6.371212 |
# the main effect is the difference
huagenomeans$meancauline |> diff() |> round(4)
We could have used the direct definition, as the difference in genotype means as follows. In that case we don't have to use a weighted mean, as we are ignoring vernalization altogether.
# the main effect calulated from the hua data
group_by(hua,Genotype) %>%
summarize(meancauline = mean(Cauline.leaf.num,na.rm=T))
Genotype | meancauline |
---|---|
<chr> | <dbl> |
hua2-3 | 5.244444 |
hua2-3 FRI | 6.371212 |
We use the the definition of an interaction effect. The interaction term is the difference in the main effect of vernalization calculated in two genotype categories.
# interaction term is the difference in the vernalization main effects stratified by genotype
# main effect of vernalization in FRI
( (huameans$meancauline[4] - huameans$meancauline[3]) -
# main effect of vernalization in hua2-3
(huameans$meancauline[2] - huameans$meancauline[1]) ) |>
round(4) # round to 4 digits
lm
¶Note that the main effect is the same as what we calculated above.
# main effect of genotype
lm(Cauline.leaf.num ~ Genotype, data = hua) |> summary()
Call: lm(formula = Cauline.leaf.num ~ Genotype, data = hua) Residuals: Min 1Q Median 3Q Max -5.3712 -2.3712 -0.3712 2.6288 9.6288 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 5.2444 0.2564 20.45 < 2e-16 *** Genotypehua2-3 FRI 1.1268 0.3646 3.09 0.00221 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 2.979 on 265 degrees of freedom (3 observations deleted due to missingness) Multiple R-squared: 0.03478, Adjusted R-squared: 0.03114 F-statistic: 9.548 on 1 and 265 DF, p-value: 0.002215
# main effect of vernalization
lm(Cauline.leaf.num ~ Vernalization, data = hua) |> summary()
Call: lm(formula = Cauline.leaf.num ~ Vernalization, data = hua) Residuals: Min 1Q Median 3Q Max -4.5333 -3.0758 -0.5333 2.4667 9.9242 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 6.0758 0.2629 23.115 <2e-16 *** VernalizationV -0.5424 0.3697 -1.467 0.143 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 3.02 on 265 degrees of freedom (3 observations deleted due to missingness) Multiple R-squared: 0.00806, Adjusted R-squared: 0.004316 F-statistic: 2.153 on 1 and 265 DF, p-value: 0.1435
The same is true for the interaction term.
# interaction effect of genotype and vernalization
lm(Cauline.leaf.num ~ Genotype * Vernalization, data = hua) |> summary()
Call: lm(formula = Cauline.leaf.num ~ Genotype * Vernalization, data = hua) Residuals: Min 1Q Median 3Q Max -5.2941 -2.7206 -0.2941 2.5469 9.5469 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 5.7206 0.3602 15.883 <2e-16 *** Genotypehua2-3 FRI 0.7325 0.5172 1.416 0.1579 VernalizationV -0.9594 0.5112 -1.877 0.0617 . Genotypehua2-3 FRI:VernalizationV 0.8004 0.7273 1.101 0.2721 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 2.97 on 263 degrees of freedom (3 observations deleted due to missingness) Multiple R-squared: 0.04787, Adjusted R-squared: 0.03701 F-statistic: 4.408 on 3 and 263 DF, p-value: 0.004801
Notice how the genotype and vernalization effects change when the interaction term is included.
lmabc
¶The lmabc
package uses abundance weighted contrasts, i.e. a difference between the category mean from the overall (weighted) mean. This has the property that the value of the main effects does not change much, whether or not the interaction term is included.
library(lmabc)
lmabc(Cauline.leaf.num ~ Genotype, data = hua) |> summary()
Call: lmabc(formula = Cauline.leaf.num ~ Genotype, data = hua) Residuals: Min 1Q Median 3Q Max -5.3712 -2.3712 -0.3712 2.6288 9.6288 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 5.8015 0.1823 31.82 < 2e-16 *** Genotypehua2-3 -0.5571 0.1803 -3.09 0.00221 ** Genotypehua2-3 FRI 0.5697 0.1844 3.09 0.00221 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 2.979 on 265 degrees of freedom Multiple R-squared: 0.03478, Adjusted R-squared: 0.03114 F-statistic: 9.548 on 1 and 265 DF, p-value: 0.002215
lmabc(Cauline.leaf.num ~ Vernalization, data = hua) |> summary()
Call: lmabc(formula = Cauline.leaf.num ~ Vernalization, data = hua) Residuals: Min 1Q Median 3Q Max -4.5333 -3.0758 -0.5333 2.4667 9.9242 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 5.8015 0.1848 31.390 <2e-16 *** VernalizationNV 0.2743 0.1869 1.467 0.143 VernalizationV -0.2682 0.1828 -1.467 0.143 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 3.02 on 265 degrees of freedom Multiple R-squared: 0.00806, Adjusted R-squared: 0.004316 F-statistic: 2.153 on 1 and 265 DF, p-value: 0.1435
lmabc(Cauline.leaf.num ~ Genotype*Vernalization, data = hua) |> summary()
Call: lmabc(formula = Cauline.leaf.num ~ Genotype * Vernalization, data = hua) Residuals: Min 1Q Median 3Q Max -5.2941 -2.7206 -0.2941 2.5469 9.5469 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 5.8015 0.1818 31.919 < 2e-16 *** Genotypehua2-3 -0.5623 0.1798 -3.128 0.00196 ** Genotypehua2-3 FRI 0.5751 0.1838 3.128 0.00196 ** VernalizationNV 0.2851 0.1838 1.551 0.12216 VernalizationV -0.2788 0.1798 -1.551 0.12216 Genotypehua2-3:VernalizationNV 0.1963 0.1784 1.101 0.27210 Genotypehua2-3 FRI:VernalizationNV -0.2086 0.1895 -1.101 0.27210 Genotypehua2-3:VernalizationV -0.1992 0.1810 -1.101 0.27210 Genotypehua2-3 FRI:VernalizationV 0.1963 0.1784 1.101 0.27210 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 2.97 on 263 degrees of freedom Multiple R-squared: 0.04787, Adjusted R-squared: 0.03701 F-statistic: 4.408 on 3 and 263 DF, p-value: 0.004801
We select all six genotypes for this exercise, and the cauline leaf number trait. We check the numbers in each group which are approximately the same.
# filter by the six genotypes
vern1 <- filter(vern,(Genotype == "hua2-3") | (Genotype == "hua2-3 FRI")|
(Genotype == "vin3-4") | (Genotype == "vin3-4 FRI")|
(Genotype == "Col Ama")| (Genotype == "Col FRI")) %>% droplevels()
vern1$Genotype <- factor(vern1$Genotype)
# get counts by genotype
group_by(vern1,Genotype) %>% summarise(n=n())
Genotype | n |
---|---|
<fct> | <int> |
Col Ama | 135 |
Col FRI | 128 |
hua2-3 | 136 |
hua2-3 FRI | 134 |
vin3-4 | 135 |
vin3-4 FRI | 121 |
We calculate the mean trait for all six genotypes when the plants vere not vernalized. First, we select the non-vernalized plants, and then calculate means for each genotype group.
# filter by not vernalized
# group by genotype
# get mean cauline leaf number (make sure that NA's are removed)
cmeans <- filter(vern1, Vernalization == "NV") %>%
group_by(Genotype) %>%
summarise(meancauline = mean(Cauline.leaf.num,na.rm=T),
n = sum(!is.na(Cauline.leaf.num)))
cmeans
Genotype | meancauline | n |
---|---|---|
<fct> | <dbl> | <int> |
Col Ama | 5.803030 | 66 |
Col FRI | 8.176471 | 51 |
hua2-3 | 5.720588 | 68 |
hua2-3 FRI | 6.453125 | 64 |
vin3-4 | 5.878788 | 66 |
vin3-4 FRI | 7.729167 | 48 |
For this experiment (with six genotypes), there is no natural reference genotype, so we use the mean of all genotypes mean as the reference. Then we subtract each genotype mean from that number.
# main effect with the sum contrast is the genotype mean minus mean of all genotype means
dfMainEffect <- mutate(cmeans, maineffect = meancauline-mean(meancauline))
dfMainEffect
Genotype | meancauline | n | maineffect |
---|---|---|---|
<fct> | <dbl> | <int> | <dbl> |
Col Ama | 5.803030 | 66 | -0.8238311 |
Col FRI | 8.176471 | 51 | 1.5496091 |
hua2-3 | 5.720588 | 68 | -0.9062732 |
hua2-3 FRI | 6.453125 | 64 | -0.1737364 |
vin3-4 | 5.878788 | 66 | -0.7480736 |
vin3-4 FRI | 7.729167 | 48 | 1.1023052 |
levels(as.factor(vern1$Genotype))
Note that the main effects sum to zero.
# the sum of the main effects is zero
mutate(cmeans,maineffect = meancauline - mean(meancauline)) %>% summarise(sum = sum(maineffect)) %>% round(5)
sum |
---|
<dbl> |
0 |
lm
¶We change our options to use contr.sum
as the contrasts function. Then we use lm
to calculate the main effect of each genotype. The values agree with what we get above.
options()$contrasts
# change contrast option
options(contrasts=c("contr.sum","contr.poly"))
# use lm to get main effects
(out1 <- lm(Cauline.leaf.num ~ Genotype, data = vern1[vern1$Vernalization=="NV",])) %>% summary
Call: lm(formula = Cauline.leaf.num ~ Genotype, data = vern1[vern1$Vernalization == "NV", ]) Residuals: Min 1Q Median 3Q Max -5.1765 -2.7206 -0.1765 2.2708 9.5469 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 6.6269 0.1458 45.444 < 2e-16 *** Genotype1 -0.8238 0.3127 -2.635 0.00879 ** Genotype2 1.5496 0.3468 4.468 1.06e-05 *** Genotype3 -0.9063 0.3091 -2.932 0.00358 ** Genotype4 -0.1737 0.3165 -0.549 0.58337 Genotype5 -0.7481 0.3127 -2.392 0.01725 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 2.752 on 357 degrees of freedom (25 observations deleted due to missingness) Multiple R-squared: 0.1043, Adjusted R-squared: 0.09175 F-statistic: 8.314 on 5 and 357 DF, p-value: 1.898e-07
lmabc(Cauline.leaf.num ~ Genotype, data = vern1[vern1$Vernalization=="NV",]) |> summary()
Call: lmabc(formula = Cauline.leaf.num ~ Genotype, data = vern1[vern1$Vernalization == "NV", ]) Residuals: Min 1Q Median 3Q Max -5.1765 -2.7206 -0.1765 2.2708 9.5469 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 6.50413 0.14445 45.028 < 2e-16 *** GenotypeCol Ama -0.70110 0.30642 -2.288 0.02272 * GenotypeCol FRI 1.67234 0.35727 4.681 4.06e-06 *** Genotypehua2-3 -0.78354 0.30086 -2.604 0.00959 ** Genotypehua2-3 FRI -0.05101 0.31222 -0.163 0.87032 Genotypevin3-4 -0.62534 0.30642 -2.041 0.04200 * Genotypevin3-4 FRI 1.22503 0.37004 3.311 0.00103 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 2.752 on 357 degrees of freedom Multiple R-squared: 0.1043, Adjusted R-squared: 0.09175 F-statistic: 8.314 on 5 and 357 DF, p-value: 1.898e-07
options()$contrasts
One genotype level is left out of the lm
output. Since the sum of the main effects is zero, the one left out is the negative of the sum of the others.
# all but one main effect
(meff <- t(coef(out1)[-1])) %>% round(4)
Genotype1 | Genotype2 | Genotype3 | Genotype4 | Genotype5 |
---|---|---|---|---|
-0.8238 | 1.5496 | -0.9063 | -0.1737 | -0.7481 |
# negative of the sum
-sum(meff) %>% round(4)
Thus, the main effect of the "Col Ama" (which is first in alphabetical order) genotype is 1.1023. We can get this number by also releveling the factor to make "vin3-4 FRI" the reference.
# relevel
vern1$Genotype <- relevel(vern1$Genotype,ref="vin3-4 FRI")
# fit new model
out2 <- lm(Cauline.leaf.num ~ Genotype,
data = vern1[vern1$Vernalization=="NV",])
summary(out2)
Call: lm(formula = Cauline.leaf.num ~ Genotype, data = vern1[vern1$Vernalization == "NV", ]) Residuals: Min 1Q Median 3Q Max -5.1765 -2.7206 -0.1765 2.2708 9.5469 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 6.6269 0.1458 45.444 < 2e-16 *** Genotype1 1.1023 0.3556 3.100 0.00209 ** Genotype2 -0.8238 0.3127 -2.635 0.00879 ** Genotype3 1.5496 0.3468 4.468 1.06e-05 *** Genotype4 -0.9063 0.3091 -2.932 0.00358 ** Genotype5 -0.1737 0.3165 -0.549 0.58337 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 2.752 on 357 degrees of freedom (25 observations deleted due to missingness) Multiple R-squared: 0.1043, Adjusted R-squared: 0.09175 F-statistic: 8.314 on 5 and 357 DF, p-value: 1.898e-07
We can see that the intercept is unchanged, and that the main effect for the releveled factor indicates that the "Col Ama" genotype has main effect 1.1023.
data.frame(original=coef(out1),releveled=coef(out2))
original | releveled | |
---|---|---|
<dbl> | <dbl> | |
(Intercept) | 6.6268614 | 6.6268614 |
Genotype1 | -0.8238311 | 1.1023052 |
Genotype2 | 1.5496091 | -0.8238311 |
Genotype3 | -0.9062732 | 1.5496091 |
Genotype4 | -0.1737364 | -0.9062732 |
Genotype5 | -0.7480736 | -0.1737364 |