Author: Shantel A. Martinez
Purpose: Conduct genomic selection using a one-step approach instead of a two-step approach.
Last updated: 2019.03.24
Load: Packages and df
The dataframes were tidyed up in the D:/.../GS/PHS GS Input Data Prep.R
file.
We will start with all of the environments calculated together (no subsetting of env), and the kernel colored subsetted to i) Red, ii) White, and iii) Combined red and white.
load("D:/PHS Genomic Selection Project/Data Analysis/GS/PHS GS All Env myYs myGDs.RData")
library(BGLR); library(rrBLUP)
library(tidyverse); font<-element_text(face = "bold", size = 16); font2<-element_text(size = 16)
setwd("D:/PHS Genomic Selection Project/Data Analysis/GS/20190319")
I need to make lists of each of the KC groups for the genotypic data G
, the phenotype df PHS
and the names of each group name
.
G <- list(myGDc ,myGDr ,myGDw )
PHS <- list(PHSComb, PHSred7, PHSwhite)
name <- list("Comb", "Red", "White")
Define the parameters needed to run the model, in this case, rrBLUP.
h<-0.4 ; folds=5 ; i=2
The bandwidth parameter, h
, will not be needed until we run the RKHS models.
I only define i=2
until I get to the point of running 5-fold CV, then i
will be defined in a for()
loop
rrBLUP Function
Next, make a function allend_rrblup
for the complete rrBLUP model, so we can run the function on all three KC datasets.
NOTE: This is trained on 100% of the data
m1_rrblup <- function(G, PHS, name) {
rownames(G) = G[,1] ; G$GID <- NULL
K=A.mat(G)
obs <- kin.blup(data = PHS ,geno="GIDx",pheno = "RawMean", fixed = c("Year","Loc"))
obsu <- obs$pred ; obsu <-as.data.frame(obsu)
obsu$GIDx <- rownames(obsu) ; names(obsu) <- c("BLUP","GID"); obsu <- obsu[c("GID","BLUP") ]
obsu <- obsu[order(obsu$GID),]; obsu$KC <- name; obsu$model <- "rrBLUP"; obsu$fixed <- "YrLoc"
write.csv(obsu,file=paste("AllEnvrrBLUP_",name,"_ObsPHS",sep='',".csv"), row.names=FALSE)
ans <- kin.blup(data = train ,geno="GIDx",pheno = "RawMean", K=K, fixed = c("Year","Loc"))
pred=ans$pred ; pred <- as.data.frame(pred)
pred$GIDx <- rownames(pred); names(pred) <- c("pred","GID"); pred <- pred[c("GID","pred") ]
pred <- pred[order(pred$GID),] ; pred$KC <- name
write.csv(pred,file=paste("Run",i,k,"_AllEnvrrBLUP_YrLoc_",name,"_PredPHS_",sep='',".csv"), row.names=FALSE)
tpred <- pred$pred; tpred[which(i!=tst)]<-NA ; tpred <- tpred[!is.na(tpred)]
tobsu <- obsu$BLUP; tobsu[which(i!=tst)]<-NA ; tobsu <- tobsu[!is.na(tobsu)]
cor(tpred,tobsu)
}
Broken down into more detail:
After prepping the matrix G
for analysis, we ran A.mat
on the genetic matrix G
Then a mixed model was used to calulate the BLUPs of the RawMean sprouting score with Year
and Loc
used as fixed effects and we called the analysis obs
.
After organization, the dataframe obsu
now includes the GID lines names and the caluculated BLUPs.
Then a GEBVs were calulated from the RawMean sprouting score AND the kinship matrix K
with Year
and Loc
used as fixed effects and we called the analysis pred
.
After organization, the dataframe pred
now includes the GID lines names and the caluculated GEBVs translated into sprout estimates (using the intercept).
I knew the GID lines should be the same between obsu
and pred
, but I wasnt sure if they were in the same order, so I just ordered them by GID name. (This may be unneccessary)
The dataframes were saved in the wd as .csv files and the correlation between the observed phenotype obsu$BLUPs
were compared the the predicted phenotypes pred$GEBVs
accuracyr <- m1_rrblup(myGDr, PHSred7, "Red")
accuracyw <- m1_rrblup(myGDw, PHSwhite, "White")
accuracyc <- m1_rrblup(myGDc, PHSComb, "Comb")
write.csv(accuracyc,"accuracyc.csv")
write.csv(accuracyr,"accuracyr.csv")
write.csv(accuracyw,"accuracyw.csv")
accuracyr <- read.csv("accuracyr.csv")
accuracyw <- read.csv("accuracyw.csv")
accuracyc <-read.csv("accuracyc.csv")
df <- rbind(accuracyc, accuracyr, accuracyw); df$condition <- "AllEnv"
df1 <- na.omit(df)
write.csv(df1,"AllEnvrrBLUP_FixedYrLoc.csv", row.names=FALSE)
kc_ggplot(df1, "PA", "", "Sprout~GID+Yr+Loc", "rrBLUP")
ggsave("20190319_PA_FixedYrLoc.PNG", width = 10, height = 8, units = "in")
Setting up Five-fold Cross-Validation
Should I use a for()
loop or a foreach()
command. Using foreach()
will allow me to run 5 cores on the Small Grains Mac, however I cant test me code on my laptop. I also need to embedd the model into the code below, but that is not shown here just yet.
Below is just working out how to subset the data for training and testing, 5 randomly different times
PHS=PHSred7
G=myGDr
m1_rrblup <- function(G, PHS, name) {
accuracy = as.data.frame(matrix(NA,2,4))
names(accuracy) <- c("results", "KC", "model", "fixed")
obs <- kin.blup(data = PHS ,geno="GIDx",pheno = "RawMean", fixed = c("Year","Loc"))
obsu <- obs$pred ; obsu <-as.data.frame(obsu)
obsu$GIDx <- rownames(obsu) ; names(obsu) <- c("BLUP","GID"); obsu <- obsu[c("GID","BLUP") ]
obsu <- obsu[order(obsu$GID),]; obsu$KC <- name; obsu$model <- "rrBLUP"; obsu$fixed <- "YrLoc"
write.csv(obsu,file=paste("AllEnvrrBLUP_",name,"_ObsPHS",sep='',".csv"), row.names=FALSE)
for(k in 1:5){
set.seed(k)
tst1 <- sample(rep(1:folds,length.out=nrow(G)))
G1 <- G
G1$tst <- tst1
Gr1 <- select(G1, GID, tst)
G1$tst <- NULL
rownames(G1) = G1[,1] ; G1$GID <- NULL
K=A.mat(G1)
results <- c()
foreach (i = 1:folds, .combine='c') %do% {
tst <- tst1 ; Gr <- Gr1
Gr$tst[which(i==Gr$tst)]<-NA
train <- PHS
train$index <- train$GIDx
train$index <- lapply(train$index, function(x) {
inds <- match(x, Gr$GID)
ifelse(is.na(inds),x, Gr$tst[inds])
})
train <- train[!(is.na(train$index)),]
ans <- kin.blup(data = train ,geno="GIDx",pheno = "RawMean", K=K, fixed = c("Year","Loc"))
pred=ans$pred ; pred <- as.data.frame(pred)
pred$GIDx <- rownames(pred); names(pred) <- c("pred","GID"); pred <- pred[c("GID","pred") ]
pred <- pred[order(pred$GID),] ; pred$KC <- name
write.csv(pred,file=paste("Run",i,k,"_AllEnvrrBLUP_YrLoc_",name,"_PredPHS_",sep='',".csv"), row.names=FALSE)
tpred <- pred$pred; tpred[which(i!=tst)]<-NA ; tpred <- tpred[!is.na(tpred)]
tobsu <- obsu$BLUP; tobsu[which(i!=tst)]<-NA ; tobsu <- tobsu[!is.na(tobsu)]
cor(tpred,tobsu)
results <- append(results,cor(tpred,tobsu))
results
}
resultsa <- as.data.frame(results)
resultsa$KC <- name
resultsa$model <- "rrBLUP"
resultsa$fixed <- "YrLoc"
accuracy <- rbind(accuracy, resultsa)
}
accuracy
}
obsu
can be calculated outside of the for
loop once, since it is the observed phenotype mixed model estimates on the whole panel. So it doesnt need to be calculated with every round of cross-fold validation.
Every time the loop starts, a new seed is set to i
, 1-5, creating the random selection of the subsets.
To clarify how we are subsetting:
- I am using the length of the
G
matrix as my total number of lines in the dataset, in the red KC case its 369.
- We will make a vector called
tst
with values 1-5, 369 times. What we are setting up, is to randomly select either 1, 2, 3, 4, or 5 and omit those GIDs from the training datsets (1/5 = 20%), leaving 80% of the dataset to train the model on.
- With
i=1
, for example, all of the 1
in the tst
column are called NA
(this is omitting the 20%)
- I then need to subset the phenotype file
PHS
that has all of the GIDs multiple times, over years and locations. However, since they have multiple replications, there is more than just 1-369. Which is why I needed to align the 1-5 tst values from the 1-369 GID names to the GID names in the PHS
df. This is done using thelappyly
function on the column train$index
matched to Gr$X
. Then all of the NAs
in the train$index
column are all then removed, which removes the 20% GIDs and leaves only the 80% GIDs (5-fold CV).
- Then the model is trainted on the
train
phenotype df.
- I then need to extract just the 20% GIDs from both the predicted model and the observed model called
tpred
and tobsu
, respectively.
- The prediction accuracy is calculated between the predicted values of the 20% GIDs compared to the observed values of the 20%
cor(tpred,tobsu)
---
title: "PHS GS One-Step Approach"
output:
  html_document:
    code_download: yes
    toc: true
    toc_depth: 5
    toc_float: true
    theme: flatly
  pdf_document: default
---
**Author:** Shantel A. Martinez  
**Purpose:** Conduct genomic selection using a one-step approach instead of a two-step approach.     
**Last updated:** 2019.03.24   

### Load: Packages and df  
The dataframes were tidyed up in the `D:/.../GS/PHS GS Input Data Prep.R` file.  
We will start with all of the environments calculated together (no subsetting of env), and the kernel colored subsetted to i) Red, ii) White, and iii) Combined red and white.  
```{R message=FALSE, warning=FALSE}
## All Env
load("D:/PHS Genomic Selection Project/Data Analysis/GS/PHS GS All Env myYs myGDs.RData")
library(BGLR); library(rrBLUP)
library(tidyverse); font<-element_text(face = "bold",  size = 16); font2<-element_text(size = 16)
setwd("D:/PHS Genomic Selection Project/Data Analysis/GS/20190319")
```
I need to make lists of each of the KC groups for the genotypic data `G`, the phenotype df `PHS` and the names of each group `name`.  

```{r}
G <- list(myGDc ,myGDr ,myGDw ) 
PHS <- list(PHSComb, PHSred7, PHSwhite)
name <- list("Comb", "Red", "White")
```

Define the parameters needed to run the model, in this case, rrBLUP.
```{r}
##Modeling Parameters~~~~~~~~~
h<-0.4 ; folds=5 ; i=2 
``` 
The bandwidth parameter, `h`, will not be needed until we run the RKHS models.  
I only define `i=2` until I get to the point of running 5-fold CV, then `i` will be defined in a `for()`loop  

### rrBLUP Function  
Next, make a function `allend_rrblup` for the complete rrBLUP model, so we can run the function on all three KC datasets.  
*NOTE: This is trained on 100% of the data*  
```{r}
m1_rrblup <- function(G, PHS, name) {
  rownames(G) = G[,1] ; G$GID <- NULL
  K=A.mat(G)

  obs <- kin.blup(data = PHS ,geno="GIDx",pheno = "RawMean", fixed = c("Year","Loc"))
  obsu <- obs$pred ; obsu <-as.data.frame(obsu)
  obsu$GIDx <- rownames(obsu) ; names(obsu) <- c("BLUP","GID"); obsu <- obsu[c("GID","BLUP") ]
  obsu <- obsu[order(obsu$GID),]; obsu$KC <- name; obsu$model <- "rrBLUP"; obsu$fixed <- "YrLoc"
  write.csv(obsu,file=paste("AllEnvrrBLUP_",name,"_ObsPHS",sep='',".csv"), row.names=FALSE)
  
  ans <- kin.blup(data = train ,geno="GIDx",pheno = "RawMean", K=K, fixed = c("Year","Loc"))
  pred=ans$pred ; pred <-  as.data.frame(pred)
  pred$GIDx <- rownames(pred); names(pred) <- c("pred","GID"); pred <- pred[c("GID","pred") ]
  pred <- pred[order(pred$GID),] ;  pred$KC <- name 
  write.csv(pred,file=paste("Run",i,k,"_AllEnvrrBLUP_YrLoc_",name,"_PredPHS_",sep='',".csv"), row.names=FALSE)
      
  tpred <- pred$pred; tpred[which(i!=tst)]<-NA ; tpred <- tpred[!is.na(tpred)]
  tobsu <- obsu$BLUP; tobsu[which(i!=tst)]<-NA ; tobsu <- tobsu[!is.na(tobsu)]
  cor(tpred,tobsu)  
}
```
**Broken down into more detail:**  
After prepping the matrix `G` for analysis, we ran `A.mat` on the genetic matrix `G`  

Then a mixed model was used to calulate the BLUPs of the RawMean sprouting score with `Year` and `Loc` used as fixed effects and we called the analysis `obs`.  
After organization, the dataframe `obsu` now includes the GID lines names and the caluculated BLUPs.  

Then a GEBVs were calulated from the RawMean sprouting score AND the kinship matrix `K` with `Year` and `Loc` used as fixed effects and we called the analysis `pred`.  
After organization, the dataframe `pred` now includes the GID lines names and the caluculated GEBVs translated into sprout estimates (using the intercept).   

I knew the GID lines should be the same between `obsu` and `pred`, but I wasnt sure if they were in the same order, so I just ordered them by GID name. (This may be unneccessary)  
The dataframes were saved in the wd as .csv files and the correlation between the observed phenotype `obsu$BLUPs` were compared the the predicted phenotypes `pred$GEBVs`  

```{r eval=FALSE}
accuracyr <- m1_rrblup(myGDr, PHSred7, "Red")
accuracyw <- m1_rrblup(myGDw, PHSwhite, "White")
accuracyc <- m1_rrblup(myGDc, PHSComb, "Comb")
write.csv(accuracyc,"accuracyc.csv")
write.csv(accuracyr,"accuracyr.csv")
write.csv(accuracyw,"accuracyw.csv")

accuracyr <- read.csv("accuracyr.csv")
accuracyw <- read.csv("accuracyw.csv")
accuracyc <-read.csv("accuracyc.csv")

df <- rbind(accuracyc, accuracyr, accuracyw); df$condition <- "AllEnv"
df1 <-  na.omit(df)#[!is.na(df)]
write.csv(df1,"AllEnvrrBLUP_FixedYrLoc.csv", row.names=FALSE)
kc_ggplot(df1, "PA", "", "Sprout~GID+Yr+Loc", "rrBLUP") 
ggsave("20190319_PA_FixedYrLoc.PNG", width = 10, height = 8, units = "in")

```


### Setting up Five-fold Cross-Validation 
Should I use a `for()` loop or a `foreach()` command. Using `foreach()` will allow me to run 5 cores on the Small Grains Mac, however I cant test me code on my laptop. I also need to embedd the model into the code below, but that is not shown here just yet.  

Below is just working out how to subset the data for training and testing, 5 randomly different times
```{r eval=FALSE}
PHS=PHSred7
G=myGDr

m1_rrblup <- function(G, PHS, name) {
  accuracy = as.data.frame(matrix(NA,2,4))
  names(accuracy) <- c("results", "KC", "model", "fixed")
  obs <- kin.blup(data = PHS ,geno="GIDx",pheno = "RawMean", fixed = c("Year","Loc"))
  obsu <- obs$pred ; obsu <-as.data.frame(obsu)
  obsu$GIDx <- rownames(obsu) ; names(obsu) <- c("BLUP","GID"); obsu <- obsu[c("GID","BLUP") ]
  obsu <- obsu[order(obsu$GID),]; obsu$KC <- name; obsu$model <- "rrBLUP"; obsu$fixed <- "YrLoc"
  write.csv(obsu,file=paste("AllEnvrrBLUP_",name,"_ObsPHS",sep='',".csv"), row.names=FALSE)
  
  for(k in 1:5){
    set.seed(k)
    tst1 <-  sample(rep(1:folds,length.out=nrow(G))) #has 396 as an array
    G1 <- G
    G1$tst <-  tst1 #Make a 1-5 vector the length of myGD
    Gr1 <- select(G1, GID, tst)
    G1$tst <- NULL
    rownames(G1) = G1[,1] ; G1$GID <- NULL
    K=A.mat(G1)
    
    results <- c()
  
    foreach (i = 1:folds, .combine='c') %do% {
      #First subset train (80%) and test (20%) phenotypes
      tst <- tst1 ; Gr <- Gr1
      Gr$tst[which(i==Gr$tst)]<-NA #Now I want to make one of the 1-5 numbers NA
      
      train <- PHS #Align the Gr$tst to the PHS files based on X and GIDx names, Then omit columns with NA in them (the 20%)
      train$index <- train$GIDx
      train$index <- lapply(train$index, function(x) {
         inds <- match(x, Gr$GID)
         ifelse(is.na(inds),x, Gr$tst[inds]) 
      })
      train <- train[!(is.na(train$index)),]
    
      ans <- kin.blup(data = train ,geno="GIDx",pheno = "RawMean", K=K, fixed = c("Year","Loc"))
      pred=ans$pred ; pred <-  as.data.frame(pred)
      pred$GIDx <- rownames(pred); names(pred) <- c("pred","GID"); pred <- pred[c("GID","pred") ]
      pred <- pred[order(pred$GID),] ;  pred$KC <- name 
      write.csv(pred,file=paste("Run",i,k,"_AllEnvrrBLUP_YrLoc_",name,"_PredPHS_",sep='',".csv"), row.names=FALSE)
      
      tpred <- pred$pred; tpred[which(i!=tst)]<-NA ; tpred <- tpred[!is.na(tpred)]
      tobsu <- obsu$BLUP; tobsu[which(i!=tst)]<-NA ; tobsu <- tobsu[!is.na(tobsu)]
      cor(tpred,tobsu)  
      results <- append(results,cor(tpred,tobsu))
      results
    }
    resultsa <- as.data.frame(results)
    resultsa$KC <- name 
    resultsa$model <- "rrBLUP"
    resultsa$fixed <- "YrLoc"
    accuracy <- rbind(accuracy, resultsa)
  }
  accuracy
}

``` 
`obsu` can be calculated outside of the `for` loop once, since it is the observed phenotype mixed model estimates on the whole panel. So it doesnt need to be calculated with every round of cross-fold validation.  

Every time the loop starts, a new seed is set to `i`, 1-5, creating the random selection of the subsets.  
To clarify how we are subsetting:  

- I am using the length of the `G` matrix as my total number of lines in the dataset, in the red KC case its 369.   
- We will make a vector called `tst` with values 1-5, 369 times. What we are setting up, is to randomly select either 1, 2, 3, 4, or 5 and omit those GIDs from the training datsets (1/5 = 20%), leaving 80% of the dataset to train the model on.     
- With `i=1`, for example, all of the `1` in the `tst` column are called `NA` (this is omitting the 20%)  
- I then need to  subset the phenotype file `PHS` that has all of the GIDs multiple times, over years and locations. However, since they have multiple replications, there is more than just 1-369. Which is why I needed to align the 1-5 tst values from the 1-369 GID names to the GID names in the `PHS` df. This is done using the`lappyly` function on the column `train$index` matched to `Gr$X`. Then all of the `NAs` in the `train$index` column are all then removed, which removes the 20% GIDs and leaves only the 80% GIDs (5-fold CV).  
- Then the model is trainted on the `train` phenotype df.  
- I then need to extract just the 20% GIDs from both the predicted model and the observed model called `tpred` and `tobsu`, respectively. 
- The prediction accuracy is calculated between the predicted values of the 20% GIDs compared to the observed values of the 20% 

```{r eval=FALSE}
cor(tpred,tobsu)  
```
