# Single-Step GBLUP¶

## Introduction¶

• Genotypes are available only on a few thousand (non-random) individuals

• Phenotype and pedigree information available on millions

• Often, phenotypes are not available on genotyped individuals (sires)

• Training (estimation of marker effects) based on deregressed EBV

• Marker-based EBV combined with pedigree-based EBV using selection index theory

• An alternative, single-step approach proposed by Legarra et al. @Legarra:2009:J-Dairy-Sci:19700729 [@Misztal:2009:J-Dairy-Sci:19700728; @Aguilar:2010:J-Dairy-Sci:20105546]

## Theory¶

### Marker Effect Model¶

$$\mathbf{y}=\mathbf{X}\boldsymbol{\beta}+\mathbf{M}\boldsymbol{\alpha}+\mathbf{e},$$

• $\boldsymbol{\beta}$ are fixed effects

• $\mathbf{X}$ incidence matrix for fixed effects

• $\mathbf{M}$ marker covariates

• $\boldsymbol{\alpha}\sim N(\mathbf{0},\mathbf{I}\sigma_{\alpha}^{2})$

• $\mathbf{e}\sim N(\mathbf{0},\mathbf{I\sigma_{e}^{2})}$

### Breeding Value Models¶

• Two mixed linear models are linearly equivalent and will lead to the same inferences if the vector $\mathbf{y}$ of observations has the same first and second moments in both models @Henderson.CR:1984.

• In this sense, a model that is equivalent to ([eq:2-1]) can be written as

$$\mathbf{y}=\mathbf{X}\boldsymbol{\beta}+\mathbf{g}+\mathbf{e},$$

• $\mathbf{g}=\mathbf{M}\boldsymbol{\alpha}$ has

• null means and

• covariance matrix

\begin{aligned} Var(\mathbf{g}|\mathbf{M}) &=Var(\mathbf{M}\boldsymbol{\alpha})\\ &=\mathbf{M}Var(\boldsymbol{\alpha})\mathbf{M}'. \end{aligned}

• Then, in both models ([eq:2-1]) and ([eq:GM]), the mean of $\mathbf{y}$ is $\mathbf{X}\boldsymbol{\beta}$ and

• the covariance matrix is $$Var(\mathbf{y}|\mathbf{M})=\mathbf{M}Var(\boldsymbol{\alpha})\mathbf{M}'+\mathbf{I}\sigma_{e}^{2}.$$ Thus, these two models are linearly equivalent and will lead to the same inferences.

• When the number of markers is large relative to the size of $\mathbf{g}$, BLUP of $\mathbf{g}$ can be obtained efficiently @VanRaden:2008:J-Dairy-Sci:18946147 [@Stranden:2009:J-Dairy-Sci:19448030] by solving the MME that correspond to ([eq:GM]).

• Under some assumptions,

$$\sigma_{\alpha}^{2}=\frac{\sigma_{g}^{2}}{\sum_{j}2p_{j}(1-p_{j})}$$

• So,

\begin{aligned} Var(\mathbf{g}|\mathbf{M}) & = \frac{\mathbf{MM}'}{\sum_{j}2p_{j}(1-p_{j})}\sigma_{g}^{2}\nonumber \\ & = \mathbf{G}\sigma_{g}^{2}. \end{aligned}

## BLUP combining genotype and pedigree relationships¶

• Suppose $\mathbf{g}$ is partitioned as $$\mathbf{g}=\left[\begin{array}{c} \mathbf{g}_{1}\\ \mathbf{g}_{2} \end{array}\right]=\left[\begin{array}{l} \mathbf{g}_{1}\\ \mathbf{M}_{2}\boldsymbol{\alpha} \end{array}\right],$$
• $\mathbf{g_{1}}$ are the genomic BVs of the animals with missing genotypes $\mathbf{M}_{1}$

• $\mathbf{g}_{2}$ are the BVs of those with observed genotypes $\mathbf{M}_{2}$.

• Following Legarra et al. @Legarra:2009:J-Dairy-Sci:19700729, the vector $\mathbf{g}_{1}$ can be written as

\begin{aligned} \mathbf{g}_{1} & = \mathbf{A}_{12}\mathbf{A}_{22}^{-1}\mathbf{g}_{2}+(\mathbf{g}_{1}-\mathbf{A}_{12}\mathbf{A}_{22}^{-1}\mathbf{g}_{2})\nonumber \\ & = \hat{\mathbf{g}}_{1}+\boldsymbol{\epsilon}, \end{aligned}

• $\mathbf{A}_{ij}$ are partitions of $\mathbf{A}$ that correspond to $\mathbf{g}_{1}$ and $\mathbf{g}_{2}$.

• The first term in ([eq:g1.2]) is the best linear predictor (BLP) of $\mathbf{g}_{1}$ given $\mathbf{g}_{2}$,

• the second is a residual.

• It is easy to see that $\boldsymbol{\epsilon}$ in ([eq:g1.2]) is uncorrelated to $\mathbf{g}_{2}$,

• therefore if $\mathbf{g}_{1}$ and $\mathbf{g}_{2}$ are multivariate normal, $\mathbf{\epsilon}$ and $\mathbf{g}_{2}$ are independent.

• Consider first the conditional distribution of $\mathbf{g}_{1}$ given $\mathbf{P}$. Then, as expected, the

• variance of $\mathbf{g}_{1}$ is

\begin{aligned} Var(\mathbf{g_{1}|}\mathbf{P}) & = [\mathbf{A}_{12}\mathbf{A}_{22}^{-1}\mathbf{A}_{21}+(\mathbf{A}_{11}-\mathbf{A}_{12}\mathbf{A}_{22}^{-1}\mathbf{A}_{21})]\sigma_{g}^{2}\\ & = \mathbf{A}_{11}\sigma_{g}^{2},\end{aligned}

where the first term of ([eq:varg1]) is the variance of the $\hat{{\bf{g}}}_{1}$ and the second term is the variance of $\boldsymbol{\epsilon}$.

• Similarly, $Var(\mathbf{g}_{2}|\mathbf{P})=\mathbf{A}_{22}\sigma_{g}^{2}$.

• Consider now the conditional distribution of $\mathbf{g}_{1}$ given $\mathbf{M}_{2}$.

• Note that, given the observed genotypes $\mathbf{M}_{2}$, the distribution of $\mathbf{g}_{2}$ changes to a multivariate normal with

• mean $\mathbf{0}$ and

• covariance matrix $\mathbf{M}_{2}\mathbf{M}_{2}'\sigma_{\alpha}^{2}$.

• As explained below, this change in the distribution of $\mathbf{g}_{2}$, produces an associated change in the distribution of $\mathbf{g}_{1}$ to a normal with

• mean $\mathbf{0}$ and
• covariance matrix: $$Var(\mathbf{g_{1}|M}_{2})=\mathbf{A}_{12}\mathbf{A}_{22}^{-1}\mathbf{M}_{2}\mathbf{M}'_{2}\mathbf{A}_{22}^{-1}\mathbf{A}_{21}\sigma_{\alpha}^{2}+(\mathbf{A}_{11}-\mathbf{A}_{12}\mathbf{A}_{22}^{-1}\mathbf{A}_{21})\sigma_{g}^{2},$$
• where now the vector $\hat{\mbox{g}}_{1}=\mathbf{A}_{12}\mathbf{A}_{22}^{-1}\mathbf{g}_{2}$ has covariance matrix given by the first term of ([eq:varg|z]),
• because $\boldsymbol{\epsilon}$ is independent of $\mathbf{g}_{2}$, the second term of ([eq:varg|z]) remains identical to that of ([eq:varg1]).
• Similarly, the covariance between $\mathbf{g}_{1}$ and $\mathbf{g}_{2}$ conditional on $\mathbf{M}_{2}$ is $$Cov(\mathbf{g}_{1},\mathbf{g}_{2})=\mathbf{A}_{12}\mathbf{A}_{22}^{-1}\mathbf{M}_{2}\mathbf{M}'_{2}\sigma_{\alpha}^{2}.$$
• Further, assuming ([eq:MarkerGenVar]), the above results can be combined to show that conditional on $\mathbf{M}_{2}$,

• $\mathbf{g}$ has a multivariate normal distribution with null mean and covariance matrix:

$$Var(\mathbf{g}|\mathbf{M}_{2})=\mathbf{H}=\left[\begin{array}{cc} \mathbf{A}_{12}\mathbf{A}_{22}^{-1}\mathbf{G}\mathbf{A}_{22}^{-1}\mathbf{A}_{21}+(\mathbf{A}_{11}-\mathbf{A}_{12}\mathbf{A}_{22}^{-1}\mathbf{A}_{21}) & \mathbf{A}_{12}\mathbf{A}_{22}^{-1}\mathbf{G}\\ \mathbf{G}\mathbf{A}_{22}^{-1}\mbox{A}_{21} & \mathbf{G} \end{array}\right]\sigma_{g}^{2},$$

where $\mathbf{G=}\mathbf{M}_{2}\mathbf{M}_{2}'/[\sum2p_{i}(1-p_{i})]$.

• The inverse of this matrix is needed to setup the MME, and this is computed as $$\mathbf{H}^{-1}=\mathbf{A}^{-1}+\left[\begin{array}{ll} \mathbf{0} & \mathbf{0}\\ \mathbf{0} & \mathbf{G}^{-1}-\mathbf{A}_{22}^{-1} \end{array}\right].$$

• Note that this requires computing both $\mathbf{G}^{-1}$ and $\mathbf{A}_{22}^{-1}$, which are dense and not easy to compute.

• Due to the increased adoption of SNP genotyping in livestock, $\mathbf{G}^{-1}$ and $\mathbf{A}_{22}^{-1}$ are becoming too large for SS-GBLUP to remain computationally feasible (e.g. >100,000 animals).

• A second problem in SS-GBLUP is related to the scaling that is done using the SNP frequencies.

• As mentioned earlier, when all data that were used for selection are available for computing the conditional mean, it can be computed as if selection had not taken place @Gianola.D.ea:1986 [@Im.S.ea:1989; @Sorensen.DA.ea:2001].

• If selection has taken place, this requires using SNP frequencies from the founders, as these frequencies are not changed by selection.

• In most situations, however, SNP genotypes are not available in the founders and frequencies observed in the genotyped animals are used, which can lead to biased evaluations, particularly in a multi-breed context.

• An approach very similar to that of using ([eq:g1.2]) to model missing genotypes was proposed by Fernando (see equation (43) [email protected]:_genom) in the context of genomic prediction using kernel regression, where missing genotypes were replaced by their conditional expectation computed using all available information, in contrast to using BLP as in SS-GBLUP.

• Also, a residual that is similar to $\boldsymbol{\epsilon}$ was included in the model.

• When these residuals are modeled exactly, the inverse of their covariance matrix is not sparse and SS-GBLUP would not be computationally feasible.