TrainSpectralModel: Train a model based predict reference values with spectral data

Description

Trains spectral prediction models using one of several algorithms and sampling procedures.

Get the mode of a set of numbers. Used in getting summary of results within [TrainSpectralModel()]

Usage

TrainSpectralModel(df, num.iterations, test.data = NULL,
  tune.length = 50, model.method = "pls", output.summary = TRUE,
  return.model = FALSE, best.model.metric = "RMSE",
  rf.variable.importance = FALSE, stratified.sampling = TRUE,
  cv.scheme = NULL, trial1 = NULL, trial2 = NULL, trial3 = NULL,
  split.test = FALSE, verbose = TRUE)

Arguments

data.frame object. First column contains unique identifiers, second contains reference values, followed by spectral columns. Include no other columns to right of spectra! Column names of spectra must start with "X" and reference column must be named "reference"

num.iterations

Number of training iterations to perform

test.data

data.frame with same specifications as df. Use if specific test set is desired for hyperparameter tuning. If NULL, function will automatically train with a stratified sample of 70%. Default is NULL.

tune.length

Number delineating search space for tuning of the PLSR hyperparameter ncomp. Default is 50.

model.method

Model type to use for training. Valid options include:

"pls": Partial least squares regression (Default)
"rf": Random forest
"svmLinear": Support vector machine with linear kernel
"svmRadial": Support vector machine with radial kernel

output.summary

boolean that controls function output.

If TRUE, a summary df will be output (1st row = means, 2nd row = standard deviations). Default is TRUE.
If FALSE, entire results data frame will be output

return.model

boolean that, if TRUE, causes the function to return the trained model in addition to the results data frame.

If TRUE, function return list of [model, results].
If FALSE, returns results data frame without model. Default is FALSE.

best.model.metric

Metric used to decide which model is best. Must be either "RMSE" or "Rsquared"

rf.variable.importance

boolean that:

If TRUE, model.method must be set to "rf". Returns a list with a model performance data.frame and a second data.frame with variable importance values for each wavelength for each training iteration. If return.model is also TRUE, returns list of three elements with trained model first, model performance second, and variable importance last. Dimensions are nrow = num.iterations, ncol = length(wavelengths).
If FALSE, no variable importance is returned. Default is FALSE.

stratified.sampling

If TRUE, training and test sets will be selected using stratified random sampling. This term is only used if test.data == NULL. Default is TRUE.

cv.scheme

A cross validation (CV) scheme from Jarqu<U+00ED>n et al., 2017. Options for cv.scheme include:

"CV1": untested lines in tested environments
"CV2": tested lines in tested environments
"CV0": tested lines in untested environments
"CV00": untested lines in untested environments

trial1

data.frame object that is for use only when cv.scheme is provided. Contains the trial to be tested in subsequent model training functions. The first column contains unique identifiers, second contains genotypes, third contains reference values, followed by spectral columns. Include no other columns to right of spectra! Column names of spectra must start with "X", reference column must be named "reference", and genotype column must be named "genotype".

trial2

data.frame object that is for use only when cv.scheme is provided. This data.frame contains a trial that has overlapping genotypes with trial1 but that were grown in a different site/year (different environment). Formatting must be consistent with trial1.

trial3

data.frame object that is for use only when cv.scheme is provided. This data.frame contains a trial that may or may not contain genotypes that overlap with trial1. Formatting must be consistent with trial1.

split.test

boolean that allows for a fixed training set and a split test set. Example// train model on data from two breeding programs and a stratified subset (70%) of a third and test on the remaining samples (30%) of the third. If FALSE, the entire provided test set test.data will remain as a testing set or if none is provided, 30% of the provided train.data will be used for testing. Default is FALSE.

verbose

If TRUE, the number of rows removed through filtering will be printed to the console. Default is TRUE.

vector.input

The mode of this vector of numbers will be calculated by this function

Value

data.frame with model performance statistics either in summary format (2 rows, one with mean and one with standard deviation of all training iterations) or in long format (number of rows = num.iterations). Also returns trained model if return.model is TRUE. If FALSE, returns results data.frame without model. Default is FALSE. Included summary statistics:

Tuned parameters depending on the model algorithm:
- Best.n.comp, the best number of components
- Best.ntree, the best number of trees in an RF model
- Best.mtry, the best number of variables to include at every decision point in an RF model
RMSECV, the root mean squared error of cross-validation
R2cv, the coefficient of multiple determination of cross-validation for PLSR models
RMSEP, the root mean squared error of prediction
R2p, the squared Pearson<U+2019>s correlation between predicted and observed test set values
RPD, the ratio of standard deviation of observed test set values to RMSEP
RPIQ, the ratio of performance to interquartile difference
CCC, the concordance correlation coefficient
Bias, the average difference between the predicted and observed values
SEP, the standard error of prediction
R2sp, the squared Spearman<U+2019>s rank correlation between predicted and observed test set values

mode of the numbers in `vector.input`

Examples

Run this code

# NOT RUN {
library(magrittr)
ikeogu.2017 %>%
  dplyr::filter(study.name == "C16Mcal") %>%
  dplyr::rename(reference = DMC.oven) %>%
  dplyr::select(sample.id, reference, dplyr::starts_with("X")) %>%
  na.omit() %>%
  TrainSpectralModel(df = .,
                     tune.length = 3,
                     num.iterations = 3,
                     output.summary = TRUE,
                     return.model = FALSE,
                     best.model.metric = "RMSE",
                     stratified.sampling = TRUE)
# }

Run the code above in your browser using DataLab