perGeno: Genomic prediction using glmnet, with a genotype-specific penalized regression model.

Description

.... These models can be fitted either for the original data, or on the residuals of a model with only main effects.

Usage

perGeno(
  dat,
  Y,
  G,
  E,
  indices = NULL,
  indicesData = NULL,
  testEnv = NULL,
  weight = NULL,
  useRes = TRUE,
  outputFile = NULL,
  corType = c("pearson", "spearman"),
  partition = data.frame(),
  nfolds = 10,
  alpha = 1,
  scaling = c("train", "all", "no"),
  quadratic = FALSE,
  verbose = FALSE
)

Value

A list with the following elements:

predTrain: Vector with predictions for the training set (to do: Add the factors genotype and environment; make a data.frame)
predTest: Vector with predictions for the test set (to do: Add the factors genotype and environment; make a data.frame). To do: add estimated environmental main effects, not only predicted environmental main effects
mu: the estimated overall (grand) mean
envInfoTrain: The estimated environmental main effects, and the predicted effects, obtained when the former are regressed on the averaged indices, using penalized regression.
envInfoTest: The predicted environmental main effects for the test environments, obtained from penalized regression using the estimated main effects for the training environments and the averaged indices.
parGeno: data.frame containing the estimated genotypic main effects (first column) and sensitivities (subsequent columns)
testAccuracyEnv: a data.frame with the accuracy (r) for each test environment
trainAccuracyEnv: a data.frame with the accuracy (r) for each training environment
trainAccuracyGeno: a data.frame with the accuracy (r) for each genotype, averaged over the training environments
testAccuracyGeno: a data.frame with the accuracy (r) for each genotype, averaged over the test environments
RMSEtrain: The root mean squared error on the training environments
RMSEtest: The root mean squared error on the test environments
Y: The name of the trait that was predicted, i.e. the column name in dat that was used
G: The genotype label that was used, i.e. the argument G that was used
E: The environment label that was used, i.e. the argument E that was used
indices: The indices that were used, i.e. the argument indices that was used
lambdaOpt
pargeno
quadratic: The quadratic option that was used

Arguments

dat: A data.frame with data from multi-environment trials. Each row corresponds to a particular genotype in a particular environment. The data do not need to be balanced, i.e. an environment does not need to contain all genotypes. dat should contain the training as well as the test environments (see testEnv)
Y: The trait to be analyzed: either of type character, in which case it should be one of the column names in dat, or numeric, in which case the Yth column of dat will be analyzed.
G: The column in dat containing the factor genotype (either character or numeric).
E: The column in dat containing the factor environment (either character or numeric).
indices: The columns in dat containing the environmental indices (vector of type character). Alternatively, if the indices are always constant within environments (i.e. not genotype dependent), the environmental data can also be provided using the argument indicesData (see below).
indicesData: An optional data.frame containing environmental indices (covariates); one value for each environment and index. It should have the environment names as row names (corresponding to the names contained in dat$E); the column names are the indices. If indices (see before) is also provided, the latter will be ignored.
testEnv: vector (character). Data from these environments are not used for fitting the model. Accuracy is evaluated for training and test environments separately. The default is NULL, i.e. no test environments, in which case the whole data set is training. It is also possible that there are test environments, but without any data; in this case, no accuracy is reported for test environments (CHECK correctness).
weight: Numeric vector of length nrow(dat), specifying the weight (inverse variance) of each observation, used in glmnet. Default NULL, giving constant weights.
useRes: Indicates whether the genotype-specific regressions are to be fitted on the residuals of a model with main effects. If TRUE residuals of a model with environmental main effects are used, if FALSE the regressions are fitted on the original data.
outputFile: The file name of the output files, without .csv extension which is added by the function. If not NULL the prediction accuracies for training and test environments are written to separate files. If NULL the output is not written to file.
corType: type of correlation: Pearson (default) or Spearman rank sum.
partition: data.frame with columns E and partition. The column E should contain the training environments (type character); partition should be of type integer. Environments in the same fold should have the same integer value. Default is data.frame(), in which case the function uses a leave-one-environment out cross-validation. If NULL, the (inner) training sets used for cross-validation will be drawn randomly from all observations, ignoring the environment structure. In the latter case, the number of folds (nfolds) can be specified.
nfolds: Default NULL. If partition == NULL, this can be used to specify the number of folds to be used in glmnet.
alpha: Type of penalty, as in glmnet (1 = LASSO, 0 = ridge; in between = elastic net). Default is 1.
scaling: determines how the environmental variables are scaled. "train" : all data (test and training environments) are scaled using the mean and and standard deviation in the training environments. "all" : using the mean and standard deviation of all environments. "no" : No scaling.
quadratic: boolean; default FALSE. If TRUE, quadratic terms (i.e., squared indices) are added to the model.
verbose: boolean; default FALSE. If TRUE, the accuracies per environment are printed on screen.