synbreed (version 0.12-6)

crossVal: Cross validation of different prediction models

Description

Function for the application of the cross validation procedure on prediction models with fixed and random effects. Covariance matrices must be committed to the function and variance components can be committed or reestimated with ASReml or the BLR function.

Usage

crossVal(gpData, trait=1, cov.matrix = NULL,  k = 2, Rep = 1, Seed = NULL,
         sampling = c("random", "within popStruc", "across popStruc","commit"),
         TS=NULL,ES=NULL, varComp = NULL, popStruc = NULL, VC.est = c("commit",
         "ASReml","BRR","BL"),verbose=FALSE,...)

Arguments

gpData
Object of class gpData
trait
numeric or character. The name or number of the trait in the gpData object to be used as trait.
cov.matrix
list including covariance matrices for the random effects. Size and order of rows and columns should be equal to rownames of y. If no covariance is given, an identity matrix and marker genotypes are used for a marker regression. In general, a covariance matrix should be non-singular and positive definite to be invertible, if this is not the case, a constant of 1e-5 is added to the diagonal elements of the covariance matrix.
k
numeric. Number of folds for k-fold cross validation, thus k should be in [2,nrow(y)] (default=2).
Rep
numeric. Number of replications (default = 1).
Seed
numeric. Number for set.seed() to make results reproducable.
sampling
Different sampling strategies can be "random", "within popStruc" or "across popStruc". If sampling is "commit" test sets have to specified in TS (see Details).
TS
A (optional) list of vectors with IDs for the test set in each fold within a list of replications, same layout as output for id.TS .
ES
A (optional) list of IDs for the estimation set in each fold within each replication.
varComp
A vector of variance components for the random effects, which has to be specified if VC.est="commit". The first variance components should be the same order as the given covariance matrices, the last given variance component is for the residuals.
popStruc
Vector of length nrow(y) assigning individuals to a population structure. If no popStruc is defined, family information of gpData is used. Only required for options sampling="within popStruc" or sampling="across popStruc"
VC.est
Should variance components be reestimated with "ASReml" or with Bayesian Ridge Regression "BRR" or Bayesian Lasso "BL" of the BLR package within the estimation set of each fold in the cross validation? If VC.est="commit", the variance components have to be defined in varComp. For ASReml, ASReml software has to be installed on the system.
verbose
Logical. Whether output shows replications and folds.
further arguments to be used by the genomic prediction models, i.e. prior values and MCMC options for the BLR function (see BLR).

Value

An object of class list with following items:
bu
Estimated fixed and random effects of each fold within each replication.
n.DS
Size of the data set (ES+TS) in each fold.
y.TS
Predicted values of all test sets within each replication.
n.TS
Size of the test set in each fold.
id.TS
List of IDs of each test sets within a list of each replication.
PredAbi
Predictive ability of each fold within each replication calculated as correlation coefficient \(r(y_{TS},\hat y_{TS})\).
rankCor
Spearman's rank correlation of each fold within each replication calculated between \(y_{TS}\) and \(\hat y_{TS}\).
mse
Mean squared error of each fold within each replication calculated between \(y_{TS}\) and \(\hat y_{TS}\).
bias
Regression coefficients of a regression of the observed values on the predicted values in the TS. A regression coefficient \(< 1\) implies inflation of predicted values, and a coefficient of \(> 1\) deflation of predicted values.
m10
Mean of observed values for the 10% best predicted of each replication. The k test sets are pooled within each replication.
k
Number of folds
Rep
Replications
sampling
Sampling method
Seed
Seed for set.seed()
rep.seed
Calculated seeds for each replication
nr.ranEff
Number of random effects
VC.est.method
Method for the variance components (committed or reestimated with ASReml/BRR/BL)

Details

In cross validation the data set is splitted into an estimation (ES) and a test set (TS). The effects are estimated with the ES and used to predict observations in the TS. For sampling into ES and TS, k-fold cross validation is applied, where the data set is splitted into k subsets and k-1 comprising the ES and 1 is the TS, repeated for each subset. To account for the family structure (Albrecht et al. 2011), sampling can be defined as:
random
Does not account for family structure, random sampling within the complete data set
within popStruc
Accounts for within population structure information, e.g. each family is splitted into k subsets
across popStruc
Accounts for across population structure information, e.g. ES and TS contains a set of complete families
The following mixed model equation is used for VC.est="commit": $$\bf y=\bf{Xb}+\bf{Zu}+\bf e$$ with $$\bf u \sim N(0,G\sigma^2_u)$$ gives the mixed model equations $$\left(\begin{array}{cc} \bf X'\bf X & \bf X'\bf Z \\ \bf Z'\bf X & \bf Z'\bf Z + \bf G^{-1}\frac{\sigma^2_e}{\sigma^2_u} \end{array} \right) \left( \begin{array}{c} \bf b \\ \bf u \end{array}\right) = \left(\begin{array}{c}\bf X'\bf y \\ \bf Z'\bf y \end{array} \right)$$

References

Albrecht T, Wimmer V, Auinger HJ, Erbe M, Knaak C, Ouzunova M, Simianer H, Schoen CC (2011) Genome-based prediction of testcross values in maize. Theor Appl Genet 123:339-350 Mosier CI (1951) I. Problems and design of cross-validation 1. Educ Psychol Measurement 11:5-11 Crossa J, de los Campos G, Perez P, Gianola D, Burgueno J, et al. (2010) Prediction of genetic values of quantitative traits in plant breeding using pedigree and molecular markers, Genetics 186:713-724 Gustavo de los Campos and Paulino Perez Rodriguez, (2010). BLR: Bayesian Linear Regression. R package version 1.2. http://CRAN.R-project.org/package=BLR

See Also

summary.cvData

Examples

Run this code
# loading the maize data set
## Not run: ------------------------------------
# library(synbreedData)
# data(maize)
# maize2 <- codeGeno(maize)
# U <- kin(maize2,ret="realized")
# # cross validation
# cv.maize  <- crossVal(maize2,cov.matrix=list(U),k=5,Rep=1,
#             Seed=123,sampling="random",varComp=c(26.5282,48.5785),VC.est="commit")
# cv.maize2 <- crossVal(maize2,k=5,Rep=1,
#              Seed=123,sampling="random",varComp=c(0.0704447,48.5785),VC.est="commit")
# # comparing results, both are equal!
# cv.maize$PredAbi
# cv.maize2$PredAbi
# summary(cv.maize)
# summary(cv.maize2)
## ---------------------------------------------

Run the code above in your browser using DataLab