Genotypes are available only on a few thousand (non-random) individuals

Phenotype and pedigree information available on millions

Often, phenotypes are not available on genotyped individuals (sires)

Training (estimation of marker effects) based on deregressed EBV

Marker-based EBV combined with pedigree-based EBV using selection index theory

An alternative, single-step approach proposed by Legarra et al. @Legarra:2009:J-Dairy-Sci:19700729 [@Misztal:2009:J-Dairy-Sci:19700728; @Aguilar:2010:J-Dairy-Sci:20105546]

$$\mathbf{y}=\mathbf{X}\boldsymbol{\beta}+\mathbf{M}\boldsymbol{\alpha}+\mathbf{e},$$

$\boldsymbol{\beta}$ are fixed effects

$\mathbf{X}$ incidence matrix for fixed effects

$\mathbf{M}$ marker covariates

$\boldsymbol{\alpha}\sim N(\mathbf{0},\mathbf{I}\sigma_{\alpha}^{2})$

$\mathbf{e}\sim N(\mathbf{0},\mathbf{I\sigma_{e}^{2})}$

Two mixed linear models are linearly equivalent and will lead to the same inferences if the vector $\mathbf{y}$ of observations has the same first and second moments in both models @Henderson.CR:1984.

In this sense, a model that is equivalent to ([eq:2-1]) can be written as

$$\mathbf{y}=\mathbf{X}\boldsymbol{\beta}+\mathbf{g}+\mathbf{e},$$

$\mathbf{g}=\mathbf{M}\boldsymbol{\alpha}$ has

null means and

covariance matrix

$$\begin{aligned} Var(\mathbf{g}|\mathbf{M}) &=Var(\mathbf{M}\boldsymbol{\alpha})\\ &=\mathbf{M}Var(\boldsymbol{\alpha})\mathbf{M}'. \end{aligned}$$

Then, in both models ([eq:2-1]) and ([eq:GM]), the mean of $\mathbf{y}$ is $\mathbf{X}\boldsymbol{\beta}$ and

the covariance matrix is $$Var(\mathbf{y}|\mathbf{M})=\mathbf{M}Var(\boldsymbol{\alpha})\mathbf{M}'+\mathbf{I}\sigma_{e}^{2}.$$ Thus, these two models are linearly equivalent and will lead to the same inferences.

When the number of markers is large relative to the size of $\mathbf{g}$, BLUP of $\mathbf{g}$ can be obtained efficiently @VanRaden:2008:J-Dairy-Sci:18946147 [@Stranden:2009:J-Dairy-Sci:19448030] by solving the MME that correspond to ([eq:GM]).

Under some assumptions,

$$\sigma_{\alpha}^{2}=\frac{\sigma_{g}^{2}}{\sum_{j}2p_{j}(1-p_{j})}$$

So,

$$\begin{aligned} Var(\mathbf{g}|\mathbf{M}) & = \frac{\mathbf{MM}'}{\sum_{j}2p_{j}(1-p_{j})}\sigma_{g}^{2}\nonumber \\ & = \mathbf{G}\sigma_{g}^{2}. \end{aligned}$$

- Suppose $\mathbf{g}$ is partitioned as $$\mathbf{g}=\left[\begin{array}{c} \mathbf{g}_{1}\\ \mathbf{g}_{2} \end{array}\right]=\left[\begin{array}{l} \mathbf{g}_{1}\\ \mathbf{M}_{2}\boldsymbol{\alpha} \end{array}\right],$$

$\mathbf{g_{1}}$ are the genomic BVs of the animals with missing genotypes $\mathbf{M}_{1}$

$\mathbf{g}_{2}$ are the BVs of those with observed genotypes $\mathbf{M}_{2}$.

Following Legarra et al. @Legarra:2009:J-Dairy-Sci:19700729, the vector $\mathbf{g}_{1}$ can be written as

$$\begin{aligned} \mathbf{g}_{1} & = \mathbf{A}_{12}\mathbf{A}_{22}^{-1}\mathbf{g}_{2}+(\mathbf{g}_{1}-\mathbf{A}_{12}\mathbf{A}_{22}^{-1}\mathbf{g}_{2})\nonumber \\ & = \hat{\mathbf{g}}_{1}+\boldsymbol{\epsilon}, \end{aligned}$$

$\mathbf{A}_{ij}$ are partitions of $\mathbf{A}$ that correspond to $\mathbf{g}_{1}$ and $\mathbf{g}_{2}$.

The first term in ([eq:g1.2]) is the best linear predictor (BLP) of $\mathbf{g}_{1}$ given $\mathbf{g}_{2}$,

the second is a residual.

It is easy to see that $\boldsymbol{\epsilon}$ in ([eq:g1.2]) is uncorrelated to $\mathbf{g}_{2}$,

therefore if $\mathbf{g}_{1}$ and $\mathbf{g}_{2}$ are multivariate normal, $\mathbf{\epsilon}$ and $\mathbf{g}_{2}$ are independent.

Consider first the conditional distribution of $\mathbf{g}_{1}$ given $\mathbf{P}$. Then, as expected, the

variance of $\mathbf{g}_{1}$ is

$$\begin{aligned} Var(\mathbf{g_{1}|}\mathbf{P}) & = [\mathbf{A}_{12}\mathbf{A}_{22}^{-1}\mathbf{A}_{21}+(\mathbf{A}_{11}-\mathbf{A}_{12}\mathbf{A}_{22}^{-1}\mathbf{A}_{21})]\sigma_{g}^{2}\\ & = \mathbf{A}_{11}\sigma_{g}^{2},\end{aligned}$$

where the first term of ([eq:varg1]) is the variance of the $\hat{{\bf{g}}}_{1}$ and the second term is the variance of $\boldsymbol{\epsilon}$.

Similarly, $Var(\mathbf{g}_{2}|\mathbf{P})=\mathbf{A}_{22}\sigma_{g}^{2}$.

Consider now the conditional distribution of $\mathbf{g}_{1}$ given $\mathbf{M}_{2}$.

Note that, given the observed genotypes $\mathbf{M}_{2}$, the distribution of $\mathbf{g}_{2}$ changes to a multivariate normal with

mean $\mathbf{0}$ and

covariance matrix $\mathbf{M}_{2}\mathbf{M}_{2}'\sigma_{\alpha}^{2}$.

As explained below, this change in the distribution of $\mathbf{g}_{2}$, produces an associated change in the distribution of $\mathbf{g}_{1}$ to a normal with

- mean $\mathbf{0}$ and
- covariance matrix: $$ Var(\mathbf{g_{1}|M}_{2})=\mathbf{A}_{12}\mathbf{A}_{22}^{-1}\mathbf{M}_{2}\mathbf{M}'_{2}\mathbf{A}_{22}^{-1}\mathbf{A}_{21}\sigma_{\alpha}^{2}+(\mathbf{A}_{11}-\mathbf{A}_{12}\mathbf{A}_{22}^{-1}\mathbf{A}_{21})\sigma_{g}^{2}, $$
- where now the vector $\hat{\mbox{g}}_{1}=\mathbf{A}_{12}\mathbf{A}_{22}^{-1}\mathbf{g}_{2}$ has covariance matrix given by the first term of ([eq:varg|z]),
- because $\boldsymbol{\epsilon}$ is independent of $\mathbf{g}_{2}$, the second term of ([eq:varg|z]) remains identical to that of ([eq:varg1]).

- Similarly, the covariance between $\mathbf{g}_{1}$ and $\mathbf{g}_{2}$ conditional on $\mathbf{M}_{2}$ is $$Cov(\mathbf{g}_{1},\mathbf{g}_{2})=\mathbf{A}_{12}\mathbf{A}_{22}^{-1}\mathbf{M}_{2}\mathbf{M}'_{2}\sigma_{\alpha}^{2}.$$

Further, assuming ([eq:MarkerGenVar]), the above results can be combined to show that conditional on $\mathbf{M}_{2}$,

$\mathbf{g}$ has a multivariate normal distribution with null mean and covariance matrix:

$$Var(\mathbf{g}|\mathbf{M}_{2})=\mathbf{H}=\left[\begin{array}{cc} \mathbf{A}_{12}\mathbf{A}_{22}^{-1}\mathbf{G}\mathbf{A}_{22}^{-1}\mathbf{A}_{21}+(\mathbf{A}_{11}-\mathbf{A}_{12}\mathbf{A}_{22}^{-1}\mathbf{A}_{21}) & \mathbf{A}_{12}\mathbf{A}_{22}^{-1}\mathbf{G}\\ \mathbf{G}\mathbf{A}_{22}^{-1}\mbox{A}_{21} & \mathbf{G} \end{array}\right]\sigma_{g}^{2},$$

where $\mathbf{G=}\mathbf{M}_{2}\mathbf{M}_{2}'/[\sum2p_{i}(1-p_{i})]$.

The inverse of this matrix is needed to setup the MME, and this is computed as $$\mathbf{H}^{-1}=\mathbf{A}^{-1}+\left[\begin{array}{ll} \mathbf{0} & \mathbf{0}\\ \mathbf{0} & \mathbf{G}^{-1}-\mathbf{A}_{22}^{-1} \end{array}\right].$$

Note that this requires computing both $\mathbf{G}^{-1}$ and $\mathbf{A}_{22}^{-1}$, which are dense and not easy to compute.

Due to the increased adoption of SNP genotyping in livestock, $\mathbf{G}^{-1}$ and $\mathbf{A}_{22}^{-1}$ are becoming too large for SS-GBLUP to remain computationally feasible (e.g. >100,000 animals).

A second problem in SS-GBLUP is related to the scaling that is done using the SNP frequencies.

As mentioned earlier, when all data that were used for selection are available for computing the conditional mean, it can be computed as if selection had not taken place @Gianola.D.ea:1986 [@Im.S.ea:1989; @Sorensen.DA.ea:2001].

If selection has taken place, this requires using SNP frequencies from the founders, as these frequencies are not changed by selection.

In most situations, however, SNP genotypes are not available in the founders and frequencies observed in the genotyped animals are used, which can lead to biased evaluations, particularly in a multi-breed context.

An approach very similar to that of using ([eq:g1.2]) to model missing genotypes was proposed by Fernando (see equation (43) [email protected]:_genom) in the context of genomic prediction using kernel regression, where missing genotypes were replaced by their conditional expectation computed using all available information, in contrast to using BLP as in SS-GBLUP.

Also, a residual that is similar to $\boldsymbol{\epsilon}$ was included in the model.

When these residuals are modeled exactly, the inverse of their covariance matrix is not sparse and SS-GBLUP would not be computationally feasible.