Single-Step Bayesian Regression


The mixed linear model for the phenotypic values can be expressed in terms of a BVM ([eq:MarkerModel]) or an MEM ([eq:MarkerModel2]) as

$$\begin{aligned} \mathbf{y} & = \mathbf{X}\boldsymbol{\beta}+\mathbf{Zg}+\mathbf{e} \\ & = \mathbf{X}\boldsymbol{\beta}+\mathbf{ZM}\boldsymbol{\alpha}+\mathbf{e}, \end{aligned}$$

where we have introduced the incidence matrix $\mathbf{Z}$ to accommodate animals with repeated records or animals without records. As in SSBV-BLUP, suppose $\mathbf{M}_{1}$ is not observed. Then it is not possible to use ([eq:MarkerModel2]) as the basis for the MEM. Note that $\mathbf{M}_{1}\boldsymbol{\alpha}$ is equal to $\mathbf{g}_{1}$. So, using ([eq:g1.2]) for $\mathbf{g}_{1}$ and writing $\mathbf{g}_{2}=\mathbf{M}_{2}\boldsymbol{\alpha}$, the model for the phenotypic values becomes

$$\begin{aligned} \mathbf{\left[\begin{array}{c} \mathbf{y}_{1}\\ \mathbf{y}_{2} \end{array}\right]} & = \left[\begin{array}{c} \mathbf{X}_{1}\\ \mathbf{X}_{2} \end{array}\right]\boldsymbol{\beta}+\left[\begin{array}{cc} \mathbf{Z}_{1} & \mathbf{0}\\ \mathbf{0} & \mathbf{Z}_{2} \end{array}\right]\left[\begin{array}{c} \mathbf{g}_{1}\\ \mathbf{g}_{2} \end{array}\right]+\mathbf{e}\\ &= \left[\begin{array}{c} \mathbf{X}_{1}\\ \mathbf{X}_{2} \end{array}\right]\boldsymbol{\beta}+\left[\begin{array}{cc} \mathbf{Z}_{1} & \mathbf{0}\\ \mathbf{0} & \mathbf{Z}_{2} \end{array}\right]\left[\begin{array}{l} \mathbf{A}_{12}\mathbf{A}_{22}^{-1}\mathbf{M}_{2}\boldsymbol{\alpha}+\boldsymbol{\epsilon}\\ \mathbf{M}_{2}\boldsymbol{\alpha} \end{array}\right]+\mathbf{e}\\ & = \left[\begin{array}{c} \mathbf{X}_{1}\\ \mathbf{X}_{2} \end{array}\right]\boldsymbol{\beta}+\left[\begin{array}{cc} \mathbf{Z}_{1} & \mathbf{0}\\ \mathbf{0} & \mathbf{Z}_{2} \end{array}\right]\left[\begin{array}{l} \hat{\mathbf{M}}_{1}\boldsymbol{\alpha}+\boldsymbol{\epsilon}\\ \mathbf{M}_{2}\boldsymbol{\alpha} \end{array}\right]+\mathbf{e}\\ &=\mathbf{X}\boldsymbol{\beta}+\mathbf{W}\boldsymbol{\alpha}+\mathbf{U}\boldsymbol{\epsilon}+\mathbf{e} \end{aligned}$$


$$\mathbf{U=\left[\begin{array}{c} \mathbf{Z}_{1}\\ 0 \end{array}\right],}\,\, \mathbf{X}=\left[\begin{array}{c} \mathbf{X}_{1}\\ \mathbf{X}_{2} \end{array}\right]\; \text{and } \mathbf{W}=\left[\begin{array}{c} \mathbf{W}_{1}\\ \mathbf{W}_{2} \end{array}\right]\ =\left[\begin{array}{c} \mathbf{Z}_{1}\hat{\mathbf{M}}_{1}\\ \mathbf{Z}_{2}\mathbf{M}_{2} \end{array}\right] $$

The matrix $\hat{\mathbf{M}}_{1}=\mathbf{A}_{12}\mathbf{A}_{22}^{-1}\mathbf{M}_{2}$ of imputed SNP covariates can be obtained efficiently, using partitioned inverse results, by solving the easily formed very sparse system:


where $\mathbf{A}^{ij}$ are partitions of $\mathbf{A}^{-1}$ that correspond to partitioning $\mathbf{g}$ into $\mathbf{g}_{1}$ and $\mathbf{g}_{2}$.

The differences between this MEM ([eq:SSBR]) and the model that is currently used for Bayesian regression (BR) are: 1) that some of the covariates in ([eq:SSBR]) are imputed, and 2) a residual term $\boldsymbol{\epsilon}$ has been introduced to account for the deviations of the imputed genotype $ $covariates from their unobserved, actual values. Regardless of the prior used for $\boldsymbol{\alpha}$, the distribution of the vector $\boldsymbol{\epsilon}$ of imputation residuals will be approximated by a multivariate normal vector with null mean and covariance matrix $(\mathbf{A}_{11}-\mathbf{A}_{12}\mathbf{A}_{22}^{-1}\mathbf{A}_{21})\sigma_{g}^{2}$ (see equation [eq:varg|z]), where $\sigma_{g}^{2}$ is assigned a scaled inverse chi-square distribution with scale parameter $S_{g}^{2}$ and degrees of freedom $\nu_{g}$. Imputing the covariates needs to be done only once, and it can be done efficiently in parallel. Imputation of unobserved SNP covariates will not add significantly to overall computing time.

The MME that correspond to ([eq:SSBR]) for BayesC with $\pi=0$ are

$$\left[\begin{array}{lll} \mathbf{X}'\mathbf{X} & \mathbf{X}'\mathbf{W} & \mathbf{X}_{1}'\mathbf{Z}_{1}\\ \mathbf{W}'\mathbf{X} & \mathbf{W}'\mathbf{W}+\mathbf{I\dfrac{\sigma_{e}^{2}}{\sigma_{\alpha}^{2}}} & \mathbf{W}'_{1}\mathbf{Z}_{1}\\ \mathbf{Z}'_{1}\mathbf{X}_{1} & \mathbf{Z}'_{1}\mathbf{W}_{1} & \mathbf{Z}_{1}'\mathbf{Z}_{1}+\mathbf{A}^{11}\dfrac{\sigma_{e}^{2}}{\sigma_{g}^{2}} \end{array}\right]\left[\begin{array}{c} \hat{\boldsymbol{\beta}}\\ \hat{\boldsymbol{\alpha}}\\ \hat{\boldsymbol{\epsilon}} \end{array}\right]=\left[\begin{array}{c} \mathbf{X}'\mathbf{y}\\ \mathbf{W}'\mathbf{y}\\ \mathbf{Z}_{1}'\mathbf{y}_{1} \end{array}\right].$$

The submatrix of these MME that correspond to $\boldsymbol{\epsilon}$ are identical to those for $\mathbf{g}_{1}$ from a pedigree-based analysis and are very sparse. Thus as explained in the next section, conditional on $\boldsymbol{\beta}$ and $\boldsymbol{\alpha}$, $\boldsymbol{\epsilon}$ can sampled efficiently by using either a blocking-Gibbs sampler @garcia-cortes96:_gibbs [@sorensenGianolaBook] or a single-site, Gibbs sampler used in pedigree-based analyses @sorensenGianolaBook. Note that these MME, which do not have $\mathbf{G}$ or its inverse, may be used to overcome the computational problems facing SSBV-BLUP. The predicted BVs can be written as

$$\tilde{\mathbf{g}}=\left[\begin{array}{l} \hat{\mathbf{M}}_{1}\\ \mathbf{M}_{2} \end{array}\right]\hat{\boldsymbol{\alpha}}+\mathbf{U}\hat{\boldsymbol{\epsilon}}=\left[\begin{array}{l} \hat{\mathbf{M}}_{1}\\ \mathbf{M}_{2} \end{array}\right]\hat{\boldsymbol{\alpha}}+\mathbf{\left[\begin{array}{c} \mathbf{Z}_{1}\\ 0 \end{array}\right]}\hat{\boldsymbol{\epsilon}}.$$

A similar system of MME without $\boldsymbol{\epsilon}$ was solved by iteration for a MEM @VanRaden:2008:J-Dairy-Sci:18946147 but using only genotyped animals. The MME given by ([eq:MMERLF]) has the advantage that it will not grow in size as more animals are genotyped, in contrast to the MME corresponding to ([eq:SSBLUPModel]) that is given by Aguilar et al. @Aguilar:2010:J-Dairy-Sci:20105546, but assuming ([eq:markerGenVar]) they give identical EBV.

Numerical Example

Consider the pedigree in Table [tab:ped].

Suppose genotypes are available only on the parents, individuals 1, 2, and 3. Genotypes ($\mathbf{M}_{2}$) at 10 markers are given in Table [tab:mrk].

Following Legarra et al. @Legarra:2009:J-Dairy-Sci:19700729, the relationship matrix is rearranged such that $\mathbf{A}_{11}$ are relationships among individuals 4, 5, and 6, which do not have genotypes, and $\mathbf{A}_{22}$ are relationships among the parents, 1, 2, and 3, which have genotypes given in Table [tab:mrk]. The inverse of the rearranged relationship matrix is given in Table [tab:Ai].

The imputed genotypes $\hat{\mathbf{M}}_{1}$ could be obtained efficiently by solving the system ([eq:impute]) and are given in Table [tab:M1Hat]. %

Now, to setup the MME we will assume that $\sigma_{\alpha}^{2}=\frac{\sigma_{g}^{2}}{10}$, $\sigma_{g}^{2}=\sigma_{e}^{2}$, and that $\mu$, an effect common to all the observations, is the only fixed effect. Then, the MME ([eq:MMERLF]) and solutions corresponding to the marker effects model ([eq:SSBR]) is given in Table [tab:MMESSBRLHS]. For comparison, the MME and solutions for the single-step BV model are given in Table [tab:MMESSBV]. The solutions for $\mu$ are identical from the two sets of MME, and BLUP of $\mathbf{g}$ obtained as ([eq:ghatSSBR]), using the solutions to ([eq:MMERLF]), are identical to the solutions to $\mathbf{g}$ given in Table [tab:MMESSBV].