x.val
performs cross-validation (CV) to estimate the accuracy of genome-wide prediction (otherwise known as genomic selection) for a specific training population (TP), i.e. a set of individuals for which phenotypic and genotypic data is available. Cross-validation can be conducted via one of two methods within x.val
, see Details
for more information.
NOTE - \code{x.val}, specifically \code{\link[BGLR]{BGLR}} writes and reads files to disk so it is highly recommended to set your working directory
x.val(
G.in = NULL,
y.in = NULL,
min.maf = 0.01,
mkr.cutoff = 0.5,
entry.cutoff = 0.5,
remove.dups = TRUE,
impute = "EM",
frac.train = 0.6,
nCV.iter = 100,
nFold = NULL,
nFold.reps = 1,
return.estimates = FALSE,
CV.burnIn = 750,
CV.nIter = 1500,
models = c("rrBLUP", "BayesA", "BayesB", "BayesC", "BL", "BRR"),
saveAt = tempdir()
)
A list containing:
CVs
A dataframe
of CV results for each trait/model combination specified
If return.estimates
is TRUE
the additional items will be returned:
models.used
A list
of the models chosen to estimate marker effects for each trait
mkr.effects
A vector
of marker effect estimates for each trait generated by the respective prediction model used
betas
A list
of beta values for each trait generated by the respective prediction model used
Matrix
of genotypic data. First row contains marker names and the first column contains entry (taxa) names. Genotypes should be coded as follows:
1
: homozygous for minor allele
0
: heterozygous
-1
: homozygous for major allele
NA
: missing data
Imputed genotypes can be passed, see impute
below for details
TIP - Set header=FALSE
within read.table
or read.csv
when importing a tab-delimited file containing data for G.in
.
Matrix
of phenotypic data. First column contains entry (taxa) names found in G.in
, regardless of whether the entry has a phenotype for any or all traits. Additional columns contain phenotypic data; column names should reflect the trait name(s). TIP - Set header=TRUE
within read.table
or read.csv
when importing a tab-delimited file containing dat
Optional numeric
indicating a minimum minor allele frequency (MAF) when filtering G.in
. Markers with an MAF < min.maf
will be removed. Default is 0.01
to remove monomorphic markers. Set to 0
for no filtering.
Optional numeric
indicating the maximum missing data per marker when filtering G.in
. Markers missing > mkr.cutoff
data will be removed. Default is 0.50
. Set to 1
for no filtering.
Optional numeric
indicating the maximum missing genotypic data per entry allowed when filtering G.in
. Entries missing > entry.cutoff
marker data will be removed. Default is 0.50
. Set to 1
for no filtering.
Optional logical
. If TRUE
duplicate entries in the genotype matrix, if present, will be removed. This step may be necessary for missing marker imputation (see impute
). Default is TRUE
.
Options include c("EM", "mean", "pass")
. By default (i.e. "EM"
), after filtering missing genotypic data will be imputed via the EM algorithm implemented in rrBLUP-package
(Endelman, 2011; Poland et al., 2012). If "mean"
missing genotypic data will be imputed via the 'marker mean' method, also implemented in rrBLUP-package
. Enter "pass"
if a pre-filtered and imputed genotype matrix is provided to G.in
.
Optional numeric
indicating the fraction of the TP that is used to estimate marker effects (i.e. the prediction set) under cross-validation (CV) method 1 (see Details
). The remaining \((1-frac.trait)\) of the TP will then comprise the prediction set.
Optional integer
indicating the number of times to iterate CV method 1 described in Details
. Default is 100
.
Optional integer
. If a number is provided, denoting the number of "folds", then CV will be conducted using CV method 2 (see Details
). Default is NULL
, resulting in the default use of the CV method 1.
Optional integer
indicating the number of times CV method 2 is repeated. The CV accuracy returned is the average r of each rep. Default is 1
.
Optional logical
. If TRUE
additional items including the marker effect and beta estimates from the selected prediction model (i.e. highest CV accuracy) will be returned.
Optional integer
argument used by BGLR
when fitting Bayesian models. Default is 750
.
Optional integer
argument used by BGLR
(de los Compos and Rodriguez, 2014) when fitting Bayesian models. Default is 1500
.
Optional character vector
of the regression models to be used in CV and to estimate marker effects. Options include rrBLUP, BayesA, BayesB, BayesC, BL, BRR
, one or more may be included at a time. By default all models are tested.
When using models other than "rrBLUP" (i.e. Bayesian models), this is a path and prefix for saving temporary files
the are produced by the BGLR
function.
Two CV methods are available within PopVar
:
CV method 1
: During each iteration a training (i.e. model training) set will be randomly sampled from the TP of size \(N*(frac.train)\), where N is the size of the TP, and the remainder of the TP is assigned to the validation set. The accuracies of individual models are expressed as average Pearson's correlation coefficient (r) between the genome estimated breeding value (GEBV) and observed phenotypic values in the validation set across all nCV.iter
iterations. Due to its amendibility to various TP sizes, CV method 1 is the default CV method in pop.predict
.
CV method 2
: nFold
independent validation sets are sampled from the TP and predicted by the remainder. For example, if \(nFold = 10\) the TP will be split into 10 equal sets, each containing \(1/10\)-th of the TP, which will be predicted by the remaining \(9/10\)-ths of the TP. The accuracies of individual models are expressed as the average (r) between the GEBV and observed phenotypic values in the validation set across all nFold
folds. The process can be repeated nFold.reps
times with nFold
new independent sets being sampled each replication, in which case the reported prediction accuracies are averages across all folds and replications.
# \donttest{
## CV using method 1 with 25 iterations
CV.mthd1 <- x.val(G.in = G.in_ex, y.in = y.in_ex, nCV.iter = 25)
CV.mthd1$CVs
## CV using method 2 with 5 folds and 3 replications
x.val(G.in = G.in_ex, y.in = y.in_ex, nFold = 5, nFold.reps = 3)
# }
Run the code above in your browser using DataLab