fitBLUP: Function fitBLUP

Description

Solves the Linear Mixed Model and calculates the Best Linear Unbiased Predictor (BLUP)

Usage

fitBLUP(y, X = NULL, Z = NULL, K = NULL, U = NULL, 
        d = NULL, theta = NULL, BLUP = TRUE, 
        method = c("REML","ML"), return.Hinv = FALSE, 
        tol = 1E-5, maxiter = 1000, interval = c(1E-9,1E9),
        warn = TRUE)

Value

Returns a list object that contains the elements:

b: (vector) fixed effects solutions (including the intercept).
u: (vector) random effects solutions.
varU: random effect variance.
varE: residual variance.
h2: heritability.
convergence: (logical) whether Brent's method converged.
method: either 'REML' or 'ML' method used.

Arguments

y: (numeric vector) Response variable
X: (numeric matrix) Design matrix for fixed effects. When X=NULL a vector of ones is constructed only for the intercept (default)
Z: (numeric matrix) Design matrix for random effects. When Z=NULL an identity matrix is considered (default) thus G = K; otherwise G = Z K Z' is used
K: (numeric matrix) Kinship relationships. This can be of the "float32" type as per the 'float' R-package, or a (character) name of a binary file where the matrix is stored
U: (numeric matrix) Eigenvectors from spectral value decomposition of G = U D U'
d: (numeric vector) Eigenvalues from spectral value decomposition of G = U D U'
theta: (numeric) Residual/genetic variances ratio. When it is not NULL, the optimization of the likelihood function (REML or ML) is not performed
BLUP: TRUE or FALSE to whether return the random effects estimates
method: (character) Either 'REML' (Restricted Maximum Likelihood) or 'ML' (Maximum Likelihood)
return.Hinv: TRUE or FALSE to whether return the inverse of the matrix H
tol: (numeric) Maximum error between two consecutive solutions (convergence tolerance) when finding the root of the log-likelihood's first derivative
maxiter: (integer) Maximum number of iterations to run before convergence is reached
interval: (numeric vector) Range of values in which the root is searched
warn: TRUE or FALSE to whether show warnings

Author

Paulino Perez-Rodriguez, Marco Lopez-Cruz (maraloc@gmail.com) and Gustavo de los Campos

Details

The basic linear mixed model that relates phenotypes with genetic values is of the form

y = X b + Z u + e

where y is a vector with the response, b is the vector of fixed effects, u is the vector of the (random) genetic values of the genotypes, e is the vector of environmental residuals (random error), and X and Z are design matrices conecting the fixed and genetic effects with replicates. Genetic values are assumed to follow a Normal distribution as u ~ N(0,σ²_uK), and the error terms are assumed e ~ N(0,σ²_eD), with D=I being an identity matrix.

The vector of genetic values g = Z u will therefore follow g ~ N(0,σ²_uG) where G = Z K Z'. In the un-replicated case, Z = I is an identity matrix, and hence g = u and G = K.

The predicted values u_trn = (u_i), i = 1,2,...,n_trn, corresponding to observed data (training set) are derived as

u_trn = H (y_trn - X_trnb)

where H is a matrix of weights given by

H = G_trn (G_trn + θD)^-1

where G_trn is the sub-matrix corresponding to the training set, and θ = σ²_e/σ²_u is the residual/genetic variances ratio representing a shrinkage parameter. This parameter is expressed in terms of the heritability, h² = σ²_u/(σ²_u + σ²_e), as θ = (1 - h²)/h².

The predictions of u_tst corresponding to un-observed data (testing set) can be obtained by using

H = G_tst,trn (G_trn + θD)^-1

where G_tst,trn is the sub-matrix of G corresponding to the testing set (in rows) and training set (in columns).

Solutions are found using the GEMMA (Genome-wide Efficient Mixed Model Analysis) approach (Zhou & Stephens, 2012). First, the Brent's method is implemented to solve for the genetic/residual variances ratio (i.e., 1/θ) from the first derivative of the log-likelihood (either REML or ML). Then, variances σ²_u and σ²_e are calculated. Finally, b is obtained using Generalized Least Squares.

References

VanRaden PM (2008). Efficient methods to compute genomic predictions. Journal of Dairy Science, 91(11), 4414–4423.

Zhou X, Stephens M (2012). Genome-wide efficient mixed-model analysis for association studies. Nature Genetics, 44(7), 821-824

Examples

Run this code

  require(SFSI)
  data(wheatHTP)
  
  index = which(Y$trial %in% 1:6)     # Use only a subset of data
  Y = Y[index,]
  M = scale(M[index,])/sqrt(ncol(M))  # Subset and scale markers
  G = tcrossprod(M)                   # Genomic relationship matrix
  y = as.vector(scale(Y[,"E1"]))      # Scale response variable
  
  # Training and testing sets
  tst = which(Y$trial == 2)
  trn = which(Y$trial != 2)

  yNA <- y
  yNA[tst] <- NA
  fm1 = fitBLUP(yNA, K=G)
  plot(y[tst],fm1$u[tst])      # Predicted vs observed values in testing set
  cor(y[tst],fm1$u[tst])       # Prediction accuracy in testing set
  cor(y[trn],fm1$u[trn])       # Prediction accuracy in training set
  fm1$theta                    # Residual/Genetic variances ratio
  fm1$h2                       # Heritability
  # \donttest{
  if(requireNamespace("float")){
   # Using a 'float' type variable
   G2 = float::fl(G)
   fm2 = fitBLUP(yNA, K=G2)
   max(abs(fm1$u-fm2$u))  # Check for discrepances
  }
  # }

Run the code above in your browser using DataLab