• Load: Packages and df
  • rrBLUP Function
  • Setting up Five-fold Cross-Validation

Author: Shantel A. Martinez
Purpose: Conduct genomic selection using a one-step approach instead of a two-step approach.
Last updated: 2019.03.24

Load: Packages and df

The dataframes were tidyed up in the D:/.../GS/PHS GS Input Data Prep.R file.
We will start with all of the environments calculated together (no subsetting of env), and the kernel colored subsetted to i) Red, ii) White, and iii) Combined red and white.

## All Env
load("D:/PHS Genomic Selection Project/Data Analysis/GS/PHS GS All Env myYs myGDs.RData")
library(BGLR); library(rrBLUP)
library(tidyverse); font<-element_text(face = "bold",  size = 16); font2<-element_text(size = 16)
setwd("D:/PHS Genomic Selection Project/Data Analysis/GS/20190319")

I need to make lists of each of the KC groups for the genotypic data G, the phenotype df PHS and the names of each group name.

G <- list(myGDc ,myGDr ,myGDw ) 
PHS <- list(PHSComb, PHSred7, PHSwhite)
name <- list("Comb", "Red", "White")

Define the parameters needed to run the model, in this case, rrBLUP.

##Modeling Parameters~~~~~~~~~
h<-0.4 ; folds=5 ; i=2 

The bandwidth parameter, h, will not be needed until we run the RKHS models.
I only define i=2 until I get to the point of running 5-fold CV, then i will be defined in a for()loop

rrBLUP Function

Next, make a function allend_rrblup for the complete rrBLUP model, so we can run the function on all three KC datasets.
NOTE: This is trained on 100% of the data

m1_rrblup <- function(G, PHS, name) {
  rownames(G) = G[,1] ; G$GID <- NULL
  K=A.mat(G)

  obs <- kin.blup(data = PHS ,geno="GIDx",pheno = "RawMean", fixed = c("Year","Loc"))
  obsu <- obs$pred ; obsu <-as.data.frame(obsu)
  obsu$GIDx <- rownames(obsu) ; names(obsu) <- c("BLUP","GID"); obsu <- obsu[c("GID","BLUP") ]
  obsu <- obsu[order(obsu$GID),]; obsu$KC <- name; obsu$model <- "rrBLUP"; obsu$fixed <- "YrLoc"
  write.csv(obsu,file=paste("AllEnvrrBLUP_",name,"_ObsPHS",sep='',".csv"), row.names=FALSE)
  
  ans <- kin.blup(data = train ,geno="GIDx",pheno = "RawMean", K=K, fixed = c("Year","Loc"))
  pred=ans$pred ; pred <-  as.data.frame(pred)
  pred$GIDx <- rownames(pred); names(pred) <- c("pred","GID"); pred <- pred[c("GID","pred") ]
  pred <- pred[order(pred$GID),] ;  pred$KC <- name 
  write.csv(pred,file=paste("Run",i,k,"_AllEnvrrBLUP_YrLoc_",name,"_PredPHS_",sep='',".csv"), row.names=FALSE)
      
  tpred <- pred$pred; tpred[which(i!=tst)]<-NA ; tpred <- tpred[!is.na(tpred)]
  tobsu <- obsu$BLUP; tobsu[which(i!=tst)]<-NA ; tobsu <- tobsu[!is.na(tobsu)]
  cor(tpred,tobsu)  
}

Broken down into more detail:
After prepping the matrix G for analysis, we ran A.mat on the genetic matrix G

Then a mixed model was used to calulate the BLUPs of the RawMean sprouting score with Year and Loc used as fixed effects and we called the analysis obs.
After organization, the dataframe obsu now includes the GID lines names and the caluculated BLUPs.

Then a GEBVs were calulated from the RawMean sprouting score AND the kinship matrix K with Year and Loc used as fixed effects and we called the analysis pred.
After organization, the dataframe pred now includes the GID lines names and the caluculated GEBVs translated into sprout estimates (using the intercept).

I knew the GID lines should be the same between obsu and pred, but I wasnt sure if they were in the same order, so I just ordered them by GID name. (This may be unneccessary)
The dataframes were saved in the wd as .csv files and the correlation between the observed phenotype obsu$BLUPs were compared the the predicted phenotypes pred$GEBVs

accuracyr <- m1_rrblup(myGDr, PHSred7, "Red")
accuracyw <- m1_rrblup(myGDw, PHSwhite, "White")
accuracyc <- m1_rrblup(myGDc, PHSComb, "Comb")
write.csv(accuracyc,"accuracyc.csv")
write.csv(accuracyr,"accuracyr.csv")
write.csv(accuracyw,"accuracyw.csv")

accuracyr <- read.csv("accuracyr.csv")
accuracyw <- read.csv("accuracyw.csv")
accuracyc <-read.csv("accuracyc.csv")

df <- rbind(accuracyc, accuracyr, accuracyw); df$condition <- "AllEnv"
df1 <-  na.omit(df)#[!is.na(df)]
write.csv(df1,"AllEnvrrBLUP_FixedYrLoc.csv", row.names=FALSE)
kc_ggplot(df1, "PA", "", "Sprout~GID+Yr+Loc", "rrBLUP") 
ggsave("20190319_PA_FixedYrLoc.PNG", width = 10, height = 8, units = "in")

Setting up Five-fold Cross-Validation

Should I use a for() loop or a foreach() command. Using foreach() will allow me to run 5 cores on the Small Grains Mac, however I cant test me code on my laptop. I also need to embedd the model into the code below, but that is not shown here just yet.

Below is just working out how to subset the data for training and testing, 5 randomly different times

PHS=PHSred7
G=myGDr

m1_rrblup <- function(G, PHS, name) {
  accuracy = as.data.frame(matrix(NA,2,4))
  names(accuracy) <- c("results", "KC", "model", "fixed")
  obs <- kin.blup(data = PHS ,geno="GIDx",pheno = "RawMean", fixed = c("Year","Loc"))
  obsu <- obs$pred ; obsu <-as.data.frame(obsu)
  obsu$GIDx <- rownames(obsu) ; names(obsu) <- c("BLUP","GID"); obsu <- obsu[c("GID","BLUP") ]
  obsu <- obsu[order(obsu$GID),]; obsu$KC <- name; obsu$model <- "rrBLUP"; obsu$fixed <- "YrLoc"
  write.csv(obsu,file=paste("AllEnvrrBLUP_",name,"_ObsPHS",sep='',".csv"), row.names=FALSE)
  
  for(k in 1:5){
    set.seed(k)
    tst1 <-  sample(rep(1:folds,length.out=nrow(G))) #has 396 as an array
    G1 <- G
    G1$tst <-  tst1 #Make a 1-5 vector the length of myGD
    Gr1 <- select(G1, GID, tst)
    G1$tst <- NULL
    rownames(G1) = G1[,1] ; G1$GID <- NULL
    K=A.mat(G1)
    
    results <- c()
  
    foreach (i = 1:folds, .combine='c') %do% {
      #First subset train (80%) and test (20%) phenotypes
      tst <- tst1 ; Gr <- Gr1
      Gr$tst[which(i==Gr$tst)]<-NA #Now I want to make one of the 1-5 numbers NA
      
      train <- PHS #Align the Gr$tst to the PHS files based on X and GIDx names, Then omit columns with NA in them (the 20%)
      train$index <- train$GIDx
      train$index <- lapply(train$index, function(x) {
         inds <- match(x, Gr$GID)
         ifelse(is.na(inds),x, Gr$tst[inds]) 
      })
      train <- train[!(is.na(train$index)),]
    
      ans <- kin.blup(data = train ,geno="GIDx",pheno = "RawMean", K=K, fixed = c("Year","Loc"))
      pred=ans$pred ; pred <-  as.data.frame(pred)
      pred$GIDx <- rownames(pred); names(pred) <- c("pred","GID"); pred <- pred[c("GID","pred") ]
      pred <- pred[order(pred$GID),] ;  pred$KC <- name 
      write.csv(pred,file=paste("Run",i,k,"_AllEnvrrBLUP_YrLoc_",name,"_PredPHS_",sep='',".csv"), row.names=FALSE)
      
      tpred <- pred$pred; tpred[which(i!=tst)]<-NA ; tpred <- tpred[!is.na(tpred)]
      tobsu <- obsu$BLUP; tobsu[which(i!=tst)]<-NA ; tobsu <- tobsu[!is.na(tobsu)]
      cor(tpred,tobsu)  
      results <- append(results,cor(tpred,tobsu))
      results
    }
    resultsa <- as.data.frame(results)
    resultsa$KC <- name 
    resultsa$model <- "rrBLUP"
    resultsa$fixed <- "YrLoc"
    accuracy <- rbind(accuracy, resultsa)
  }
  accuracy
}

obsu can be calculated outside of the for loop once, since it is the observed phenotype mixed model estimates on the whole panel. So it doesnt need to be calculated with every round of cross-fold validation.

Every time the loop starts, a new seed is set to i, 1-5, creating the random selection of the subsets.
To clarify how we are subsetting:

  • I am using the length of the G matrix as my total number of lines in the dataset, in the red KC case its 369.
  • We will make a vector called tst with values 1-5, 369 times. What we are setting up, is to randomly select either 1, 2, 3, 4, or 5 and omit those GIDs from the training datsets (1/5 = 20%), leaving 80% of the dataset to train the model on.
  • With i=1, for example, all of the 1 in the tst column are called NA (this is omitting the 20%)
  • I then need to subset the phenotype file PHS that has all of the GIDs multiple times, over years and locations. However, since they have multiple replications, there is more than just 1-369. Which is why I needed to align the 1-5 tst values from the 1-369 GID names to the GID names in the PHS df. This is done using thelappyly function on the column train$index matched to Gr$X. Then all of the NAs in the train$index column are all then removed, which removes the 20% GIDs and leaves only the 80% GIDs (5-fold CV).
  • Then the model is trainted on the train phenotype df.
  • I then need to extract just the 20% GIDs from both the predicted model and the observed model called tpred and tobsu, respectively.
  • The prediction accuracy is calculated between the predicted values of the 20% GIDs compared to the observed values of the 20%
cor(tpred,tobsu)  
---
title: "PHS GS One-Step Approach"
output:
  html_document:
    code_download: yes
    toc: true
    toc_depth: 5
    toc_float: true
    theme: flatly
  pdf_document: default
---
**Author:** Shantel A. Martinez  
**Purpose:** Conduct genomic selection using a one-step approach instead of a two-step approach.     
**Last updated:** 2019.03.24   

### Load: Packages and df  
The dataframes were tidyed up in the `D:/.../GS/PHS GS Input Data Prep.R` file.  
We will start with all of the environments calculated together (no subsetting of env), and the kernel colored subsetted to i) Red, ii) White, and iii) Combined red and white.  
```{R message=FALSE, warning=FALSE}
## All Env
load("D:/PHS Genomic Selection Project/Data Analysis/GS/PHS GS All Env myYs myGDs.RData")
library(BGLR); library(rrBLUP)
library(tidyverse); font<-element_text(face = "bold",  size = 16); font2<-element_text(size = 16)
setwd("D:/PHS Genomic Selection Project/Data Analysis/GS/20190319")
```
I need to make lists of each of the KC groups for the genotypic data `G`, the phenotype df `PHS` and the names of each group `name`.  

```{r}
G <- list(myGDc ,myGDr ,myGDw ) 
PHS <- list(PHSComb, PHSred7, PHSwhite)
name <- list("Comb", "Red", "White")
```

Define the parameters needed to run the model, in this case, rrBLUP.
```{r}
##Modeling Parameters~~~~~~~~~
h<-0.4 ; folds=5 ; i=2 
``` 
The bandwidth parameter, `h`, will not be needed until we run the RKHS models.  
I only define `i=2` until I get to the point of running 5-fold CV, then `i` will be defined in a `for()`loop  

### rrBLUP Function  
Next, make a function `allend_rrblup` for the complete rrBLUP model, so we can run the function on all three KC datasets.  
*NOTE: This is trained on 100% of the data*  
```{r}
m1_rrblup <- function(G, PHS, name) {
  rownames(G) = G[,1] ; G$GID <- NULL
  K=A.mat(G)

  obs <- kin.blup(data = PHS ,geno="GIDx",pheno = "RawMean", fixed = c("Year","Loc"))
  obsu <- obs$pred ; obsu <-as.data.frame(obsu)
  obsu$GIDx <- rownames(obsu) ; names(obsu) <- c("BLUP","GID"); obsu <- obsu[c("GID","BLUP") ]
  obsu <- obsu[order(obsu$GID),]; obsu$KC <- name; obsu$model <- "rrBLUP"; obsu$fixed <- "YrLoc"
  write.csv(obsu,file=paste("AllEnvrrBLUP_",name,"_ObsPHS",sep='',".csv"), row.names=FALSE)
  
  ans <- kin.blup(data = train ,geno="GIDx",pheno = "RawMean", K=K, fixed = c("Year","Loc"))
  pred=ans$pred ; pred <-  as.data.frame(pred)
  pred$GIDx <- rownames(pred); names(pred) <- c("pred","GID"); pred <- pred[c("GID","pred") ]
  pred <- pred[order(pred$GID),] ;  pred$KC <- name 
  write.csv(pred,file=paste("Run",i,k,"_AllEnvrrBLUP_YrLoc_",name,"_PredPHS_",sep='',".csv"), row.names=FALSE)
      
  tpred <- pred$pred; tpred[which(i!=tst)]<-NA ; tpred <- tpred[!is.na(tpred)]
  tobsu <- obsu$BLUP; tobsu[which(i!=tst)]<-NA ; tobsu <- tobsu[!is.na(tobsu)]
  cor(tpred,tobsu)  
}
```
**Broken down into more detail:**  
After prepping the matrix `G` for analysis, we ran `A.mat` on the genetic matrix `G`  

Then a mixed model was used to calulate the BLUPs of the RawMean sprouting score with `Year` and `Loc` used as fixed effects and we called the analysis `obs`.  
After organization, the dataframe `obsu` now includes the GID lines names and the caluculated BLUPs.  

Then a GEBVs were calulated from the RawMean sprouting score AND the kinship matrix `K` with `Year` and `Loc` used as fixed effects and we called the analysis `pred`.  
After organization, the dataframe `pred` now includes the GID lines names and the caluculated GEBVs translated into sprout estimates (using the intercept).   

I knew the GID lines should be the same between `obsu` and `pred`, but I wasnt sure if they were in the same order, so I just ordered them by GID name. (This may be unneccessary)  
The dataframes were saved in the wd as .csv files and the correlation between the observed phenotype `obsu$BLUPs` were compared the the predicted phenotypes `pred$GEBVs`  

```{r eval=FALSE}
accuracyr <- m1_rrblup(myGDr, PHSred7, "Red")
accuracyw <- m1_rrblup(myGDw, PHSwhite, "White")
accuracyc <- m1_rrblup(myGDc, PHSComb, "Comb")
write.csv(accuracyc,"accuracyc.csv")
write.csv(accuracyr,"accuracyr.csv")
write.csv(accuracyw,"accuracyw.csv")

accuracyr <- read.csv("accuracyr.csv")
accuracyw <- read.csv("accuracyw.csv")
accuracyc <-read.csv("accuracyc.csv")

df <- rbind(accuracyc, accuracyr, accuracyw); df$condition <- "AllEnv"
df1 <-  na.omit(df)#[!is.na(df)]
write.csv(df1,"AllEnvrrBLUP_FixedYrLoc.csv", row.names=FALSE)
kc_ggplot(df1, "PA", "", "Sprout~GID+Yr+Loc", "rrBLUP") 
ggsave("20190319_PA_FixedYrLoc.PNG", width = 10, height = 8, units = "in")

```


### Setting up Five-fold Cross-Validation 
Should I use a `for()` loop or a `foreach()` command. Using `foreach()` will allow me to run 5 cores on the Small Grains Mac, however I cant test me code on my laptop. I also need to embedd the model into the code below, but that is not shown here just yet.  

Below is just working out how to subset the data for training and testing, 5 randomly different times
```{r eval=FALSE}
PHS=PHSred7
G=myGDr

m1_rrblup <- function(G, PHS, name) {
  accuracy = as.data.frame(matrix(NA,2,4))
  names(accuracy) <- c("results", "KC", "model", "fixed")
  obs <- kin.blup(data = PHS ,geno="GIDx",pheno = "RawMean", fixed = c("Year","Loc"))
  obsu <- obs$pred ; obsu <-as.data.frame(obsu)
  obsu$GIDx <- rownames(obsu) ; names(obsu) <- c("BLUP","GID"); obsu <- obsu[c("GID","BLUP") ]
  obsu <- obsu[order(obsu$GID),]; obsu$KC <- name; obsu$model <- "rrBLUP"; obsu$fixed <- "YrLoc"
  write.csv(obsu,file=paste("AllEnvrrBLUP_",name,"_ObsPHS",sep='',".csv"), row.names=FALSE)
  
  for(k in 1:5){
    set.seed(k)
    tst1 <-  sample(rep(1:folds,length.out=nrow(G))) #has 396 as an array
    G1 <- G
    G1$tst <-  tst1 #Make a 1-5 vector the length of myGD
    Gr1 <- select(G1, GID, tst)
    G1$tst <- NULL
    rownames(G1) = G1[,1] ; G1$GID <- NULL
    K=A.mat(G1)
    
    results <- c()
  
    foreach (i = 1:folds, .combine='c') %do% {
      #First subset train (80%) and test (20%) phenotypes
      tst <- tst1 ; Gr <- Gr1
      Gr$tst[which(i==Gr$tst)]<-NA #Now I want to make one of the 1-5 numbers NA
      
      train <- PHS #Align the Gr$tst to the PHS files based on X and GIDx names, Then omit columns with NA in them (the 20%)
      train$index <- train$GIDx
      train$index <- lapply(train$index, function(x) {
         inds <- match(x, Gr$GID)
         ifelse(is.na(inds),x, Gr$tst[inds]) 
      })
      train <- train[!(is.na(train$index)),]
    
      ans <- kin.blup(data = train ,geno="GIDx",pheno = "RawMean", K=K, fixed = c("Year","Loc"))
      pred=ans$pred ; pred <-  as.data.frame(pred)
      pred$GIDx <- rownames(pred); names(pred) <- c("pred","GID"); pred <- pred[c("GID","pred") ]
      pred <- pred[order(pred$GID),] ;  pred$KC <- name 
      write.csv(pred,file=paste("Run",i,k,"_AllEnvrrBLUP_YrLoc_",name,"_PredPHS_",sep='',".csv"), row.names=FALSE)
      
      tpred <- pred$pred; tpred[which(i!=tst)]<-NA ; tpred <- tpred[!is.na(tpred)]
      tobsu <- obsu$BLUP; tobsu[which(i!=tst)]<-NA ; tobsu <- tobsu[!is.na(tobsu)]
      cor(tpred,tobsu)  
      results <- append(results,cor(tpred,tobsu))
      results
    }
    resultsa <- as.data.frame(results)
    resultsa$KC <- name 
    resultsa$model <- "rrBLUP"
    resultsa$fixed <- "YrLoc"
    accuracy <- rbind(accuracy, resultsa)
  }
  accuracy
}

``` 
`obsu` can be calculated outside of the `for` loop once, since it is the observed phenotype mixed model estimates on the whole panel. So it doesnt need to be calculated with every round of cross-fold validation.  

Every time the loop starts, a new seed is set to `i`, 1-5, creating the random selection of the subsets.  
To clarify how we are subsetting:  

- I am using the length of the `G` matrix as my total number of lines in the dataset, in the red KC case its 369.   
- We will make a vector called `tst` with values 1-5, 369 times. What we are setting up, is to randomly select either 1, 2, 3, 4, or 5 and omit those GIDs from the training datsets (1/5 = 20%), leaving 80% of the dataset to train the model on.     
- With `i=1`, for example, all of the `1` in the `tst` column are called `NA` (this is omitting the 20%)  
- I then need to  subset the phenotype file `PHS` that has all of the GIDs multiple times, over years and locations. However, since they have multiple replications, there is more than just 1-369. Which is why I needed to align the 1-5 tst values from the 1-369 GID names to the GID names in the `PHS` df. This is done using the`lappyly` function on the column `train$index` matched to `Gr$X`. Then all of the `NAs` in the `train$index` column are all then removed, which removes the 20% GIDs and leaves only the 80% GIDs (5-fold CV).  
- Then the model is trainted on the `train` phenotype df.  
- I then need to extract just the 20% GIDs from both the predicted model and the observed model called `tpred` and `tobsu`, respectively. 
- The prediction accuracy is calculated between the predicted values of the 20% GIDs compared to the observed values of the 20% 

```{r eval=FALSE}
cor(tpred,tobsu)  
```
