x.val: Estimate genome-wide prediction accuracy using cross-validation

Description

x.val performs cross-validation (CV) to estimate the accuracy of genome-wide prediction (otherwise known as genomic selection) for a specific training population (TP), i.e. a set of individuals for which phenotypic and genotypic data is available. Cross-validation can be conducted via one of two methods within x.val, see Details for more information.

         NOTE - \code{x.val}, specifically \code{\link[BGLR]{BGLR}} writes and reads files to disk so it is highly recommended to set your working directory

Usage

x.val(
  G.in = NULL,
  y.in = NULL,
  min.maf = 0.01,
  mkr.cutoff = 0.5,
  entry.cutoff = 0.5,
  remove.dups = TRUE,
  impute = "EM",
  frac.train = 0.6,
  nCV.iter = 100,
  nFold = NULL,
  nFold.reps = 1,
  return.estimates = FALSE,
  CV.burnIn = 750,
  CV.nIter = 1500,
  models = c("rrBLUP", "BayesA", "BayesB", "BayesC", "BL", "BRR"),
  saveAt = tempdir()
)

Value

A list containing:

CVs A dataframe of CV results for each trait/model combination specified
If return.estimates is TRUE the additional items will be returned:
- models.used A list of the models chosen to estimate marker effects for each trait
- mkr.effects A vector of marker effect estimates for each trait generated by the respective prediction model used
- betas A list of beta values for each trait generated by the respective prediction model used

Arguments

G.in

Matrix of genotypic data. First row contains marker names and the first column contains entry (taxa) names. Genotypes should be coded as follows:

1: homozygous for minor allele
0: heterozygous
-1: homozygous for major allele
NA: missing data
Imputed genotypes can be passed, see impute below for details

TIP - Set header=FALSE within read.table or read.csv when importing a tab-delimited file containing data for G.in.

y.in

Matrix of phenotypic data. First column contains entry (taxa) names found in G.in, regardless of whether the entry has a phenotype for any or all traits. Additional columns contain phenotypic data; column names should reflect the trait name(s). TIP - Set header=TRUE within read.table or read.csv when importing a tab-delimited file containing dat

min.maf

Optional numeric indicating a minimum minor allele frequency (MAF) when filtering G.in. Markers with an MAF < min.maf will be removed. Default is 0.01 to remove monomorphic markers. Set to 0 for no filtering.

mkr.cutoff

Optional numeric indicating the maximum missing data per marker when filtering G.in. Markers missing > mkr.cutoff data will be removed. Default is 0.50. Set to 1 for no filtering.

entry.cutoff

Optional numeric indicating the maximum missing genotypic data per entry allowed when filtering G.in. Entries missing > entry.cutoff marker data will be removed. Default is 0.50. Set to 1 for no filtering.

remove.dups

Optional logical. If TRUE duplicate entries in the genotype matrix, if present, will be removed. This step may be necessary for missing marker imputation (see impute). Default is TRUE.

impute

Options include c("EM", "mean", "pass"). By default (i.e. "EM"), after filtering missing genotypic data will be imputed via the EM algorithm implemented in rrBLUP-package (Endelman, 2011; Poland et al., 2012). If "mean" missing genotypic data will be imputed via the 'marker mean' method, also implemented in rrBLUP-package. Enter "pass" if a pre-filtered and imputed genotype matrix is provided to G.in.

frac.train

Optional numeric indicating the fraction of the TP that is used to estimate marker effects (i.e. the prediction set) under cross-validation (CV) method 1 (see Details). The remaining \((1-frac.trait)\) of the TP will then comprise the prediction set.

nCV.iter

Optional integer indicating the number of times to iterate CV method 1 described in Details. Default is 100.

nFold

Optional integer. If a number is provided, denoting the number of "folds", then CV will be conducted using CV method 2 (see Details). Default is NULL, resulting in the default use of the CV method 1.

nFold.reps

Optional integer indicating the number of times CV method 2 is repeated. The CV accuracy returned is the average r of each rep. Default is 1.

return.estimates

Optional logical. If TRUE additional items including the marker effect and beta estimates from the selected prediction model (i.e. highest CV accuracy) will be returned.

CV.burnIn

Optional integer argument used by BGLR when fitting Bayesian models. Default is 750.

CV.nIter

Optional integer argument used by BGLR (de los Compos and Rodriguez, 2014) when fitting Bayesian models. Default is 1500.

models

Optional character vector of the regression models to be used in CV and to estimate marker effects. Options include rrBLUP, BayesA, BayesB, BayesC, BL, BRR, one or more may be included at a time. By default all models are tested.

saveAt

When using models other than "rrBLUP" (i.e. Bayesian models), this is a path and prefix for saving temporary files the are produced by the BGLR function.

Details

Two CV methods are available within PopVar:

CV method 1: During each iteration a training (i.e. model training) set will be randomly sampled from the TP of size \(N*(frac.train)\), where N is the size of the TP, and the remainder of the TP is assigned to the validation set. The accuracies of individual models are expressed as average Pearson's correlation coefficient (r) between the genome estimated breeding value (GEBV) and observed phenotypic values in the validation set across all nCV.iter iterations. Due to its amendibility to various TP sizes, CV method 1 is the default CV method in pop.predict.
CV method 2: nFold independent validation sets are sampled from the TP and predicted by the remainder. For example, if \(nFold = 10\) the TP will be split into 10 equal sets, each containing \(1/10\)-th of the TP, which will be predicted by the remaining \(9/10\)-ths of the TP. The accuracies of individual models are expressed as the average (r) between the GEBV and observed phenotypic values in the validation set across all nFold folds. The process can be repeated nFold.reps times with nFold new independent sets being sampled each replication, in which case the reported prediction accuracies are averages across all folds and replications.

Examples

Run this code

# \donttest{
## CV using method 1 with 25 iterations
CV.mthd1 <- x.val(G.in = G.in_ex, y.in = y.in_ex, nCV.iter = 25)
CV.mthd1$CVs

## CV using method 2 with 5 folds and 3 replications
x.val(G.in = G.in_ex, y.in = y.in_ex, nFold = 5, nFold.reps = 3)
# }

Run the code above in your browser using DataLab