kin.blup: Genotypic value prediction based on kinship

Description

Genotypic value prediction by G-BLUP, where the genotypic covariance G can be additive or based on a Gaussian kernel.

Usage

kin.blup(data,geno,pheno,GAUSS=FALSE,K=NULL,fixed=NULL,covariate=NULL,
         PEV=FALSE,n.core=1,theta.seq=NULL,reduce=TRUE)

Arguments

data

Data frame with columns for the phenotype, the genotype identifier, and any environmental variables.

geno

Character string for the name of the column in the data frame that contains the genotype identifier.

pheno

Character string for the name of the column in the data frame that contains the phenotype.

GAUSS

To model genetic covariance with a Gaussian kernel, set GAUSS=TRUE and pass the Euclidean distance for K (see below).

There are three options for specifying kinship: (1) If K=NULL, genotypes are assumed to be independent $(G=I \: V_g)$. (2) For breeding value prediction, set GAUSS=FALSE and use an additive relationship matrix for K to create the model $(G=K \: V_g)$

fixed

An array of strings containing the names of columns that should be included as (categorical) fixed effects in the mixed model.

covariate

An array of strings containing the names of columns that should be included as covariates in the mixed model.

PEV

When PEV=TRUE, the function returns the prediction error variance for the genotypic values ($PEV_i = Var[g^*_i-g_i]$).

n.core

Specifies the number of cores to use for parallel execution of the Gaussian kernel method. Requires R package multicore and cannot be used with the R GUI.

theta.seq

The scale parameter for the Gaussian kernel is set by maximizing the restricted log-likelihood over a grid of values. By default, the grid is constructed by dividing the interval (0,max(K)] into 10 points. Passing a numeric array to this variable (theta.

reduce

When reduce=TRUE and there are repeated observations, kin.blup reduces the dimension of the problem to equal the number of lines (see details). To disable this feature, set reduce=FALSE.

Value

If GAUSS = FALSE and PEV = FALSE, a list containing [object Object],[object Object],[object Object] If PEV = TRUE, the list also includes [object Object] If GAUSS = TRUE, the list also includes [object Object]

Details

This function is a wrapper for mixed.solve and thus solves mixed models of the form: $$y = X \beta + [Z \: 0] g + \varepsilon$$ where $\beta$ is a vector of fixed effects, $g$ is a vector of random genotypic values with covariance $G = Var[g]$, and the residuals are i.i.d. ($Var[\varepsilon] = I \: V_e$). The design matrix for the genetic values has been partitioned to illustrate that not all lines need phenotypes (i.e., for genomic selection). Unlike mixed.solve, this function does not return estimates of the fixed effects, only the BLUP solution for the genotypic values. It was designed to replace kinship.BLUP and to relieve the user of having to explicitly construct design matrices. Variance components are estimated by REML and BLUP values are returned for every entry in K, regardless of whether it has been phenotyped. The rownames of K must match the genotype labels in the data frame for phenotyped lines; missing phenotypes (NA) are simply omitted. Unlike its predecessor, this function does not handle marker data directly. For breeding value prediction, the user must supply a relationship matrix, which can be calculated from markers with A.mat. For Gaussian kernel predictions, pass the Euclidean distance matrix for K, which can be calculated with dist. In the terminology of mixed models, both the "fixed" and "covariate" variables are fixed effects ($\beta$ in the above equation): the former are treated as factors with distinct levels while the latter are continuous with one coefficient per variable. If no fixed effects of any kind are specified, an overall mean is included in the model. The prediction error variance (PEV) is the square of the SE of the BLUPs (see mixed.solve) and can be used to estimate the accuracy of BLUP predictions. When the number of observations exceeds the number of lines, as happens with replicated trials or measurements, genomic predictions can be done by explicitly modeling the environmental variables in a one-step approach. For unbalanced trials, this accounts for differing degrees of replication but is more computationally intensive than a two-step model. When reduce=TRUE, the original mixed model is multiplied by $L = (Z'Z)^{-1/2}Z'$ to generate a new mixed model with dimension equal to the number of lines, i.i.d. residuals, and the same variance parameters as in the original model. (In general, only linear combinations of the original fixed effects parameters $\beta$ are estimable in the reduced model--this is handled by the software.) The reduced problem is similar to a two-step prediction with rotation of the adjusted means (Schulz-Streeck et al. 2012), but it is one-step in the sense that variance components and BLUPs are only estimated once.

References

Endelman, J.B. 2011. Ridge regression and other kernels for genomic selection with R package rrBLUP. Plant Genome 4:250-255. doi: 10.3835/plantgenome2011.08.0024 Schulz-Streeck, T., J.O. Ogutu and H.-P. Piepho. 2012. Comparisons of single-stage and two-stage approaches to genomic selection. Theoretical and Applied Genetics. doi: 10.1007/s00122-012-1960-1

Examples

Run this code

#random population of 200 lines with 1000 markers
M <- matrix(rep(0,200*1000),200,1000)
for (i in 1:200) {
  M[i,] <- ifelse(runif(1000)<0.5,-1,1)
}
rownames(M) <- 1:200
A <- A.mat(M)

#random phenotypes
u <- rnorm(1000)
g <- as.vector(crossprod(t(M),u))
h2 <- 0.5  #heritability
y <- g + rnorm(200,mean=0,sd=sqrt((1-h2)/h2*var(g)))

data <- data.frame(y=y,gid=1:200)

#predict breeding values
ans <- kin.blup(data=data,geno="gid",pheno="y",K=A)
accuracy <- cor(g,ans$g)

Run the code above in your browser using DataLab