Wrapper that trains models based spectral data to predict reference values and reports model performance statistics
test_spectra(
train.data,
num.iterations,
test.data = NULL,
pretreatment = 1,
k.folds = 5,
proportion.train = 0.7,
tune.length = 50,
model.method = "pls",
best.model.metric = "RMSE",
stratified.sampling = TRUE,
cv.scheme = NULL,
trial1 = NULL,
trial2 = NULL,
trial3 = NULL,
split.test = FALSE,
seed = 1,
verbose = TRUE,
wavelengths = deprecated(),
preprocessing = deprecated(),
output.summary = deprecated(),
rf.variable.importance = deprecated()
)list of 5 objects:
`model.list` is a list of trained model objects, one for each
pretreatment method specified by the pretreatment argument.
Each model is trained with all rows of df.
`summary.model.performance` is a data.frame containing summary
statistics across all model training iterations and pretreatments.
See below for a description of the summary statistics provided.
`model.performance` is a data.frame containing performance
statistics for each iteration of model training separately (see below).
`predictions` is a data.frame containing both reference and
predicted values for each test set entry in each iteration of
model training.
`importance` is a data.frame containing variable importance
results for each wavelength at each iteration of model training.
If model.method is not "pls" or "rf", this list item is NULL.
`summary.model.performance` and `model.performance` data.frames
summary statistics include:
Tuned parameters depending on the model algorithm:
Best.n.comp, the best number of components
Best.ntree, the best number of trees in an RF model
Best.mtry, the best number of variables to include at every decision point in an RF model
RMSECV, the root mean squared error of cross-validation
R2cv, the coefficient of multiple determination of cross-validation for PLSR models
RMSEP, the root mean squared error of prediction
R2p, the squared Pearson’s correlation between predicted and observed test set values
RPD, the ratio of standard deviation of observed test set values to RMSEP
RPIQ, the ratio of performance to interquartile difference
CCC, the concordance correlation coefficient
Bias, the average difference between the predicted and observed values
SEP, the standard error of prediction
R2sp, the squared Spearman’s rank correlation between predicted and observed test set values
data.frame object of spectral data for input into a
spectral prediction model. First column contains unique identifiers, second
contains reference values, followed by spectral columns. Include no other
columns to right of spectra! Column names of spectra must start with "X"
and reference column must be named "reference".
Number of training iterations to perform
data.frame with same specifications as df. Use
if specific test set is desired for hyperparameter tuning. If NULL,
function will automatically train with a stratified sample of 70%. Default
is NULL.
Number or list of numbers 1:13 corresponding to desired pretreatment method(s):
Raw data (default)
Standard normal variate (SNV)
SNV and first derivative
SNV and second derivative
First derivative
Second derivative
Savitzky–Golay filter (SG)
SNV and SG
Gap-segment derivative (window size = 11)
SG and first derivative (window size = 5)
SG and first derivative (window size = 11)
SG and second derivative (window size = 5)
SG and second derivative (window size = 11)
Number indicating the number of folds for k-fold cross-validation during model training. Default is 5.
Fraction of samples to include in the training set. Default is 0.7.
Number delineating search space for tuning of the PLSR
hyperparameter ncomp. Must be set to 5 when using the random forest
algorithm (model.method == rf). Default is 50.
Model type to use for training. Valid options include:
"pls": Partial least squares regression (Default)
"rf": Random forest
"svmLinear": Support vector machine with linear kernel
"svmRadial": Support vector machine with radial kernel
Metric used to decide which model is best. Must be either "RMSE" or "Rsquared"
If TRUE, training and test sets will be
selected using stratified random sampling. This term is only used if
test.data == NULL. Default is TRUE.
A cross validation (CV) scheme from Jarquín et al., 2017.
Options for cv.scheme include:
"CV1": untested lines in tested environments
"CV2": tested lines in tested environments
"CV0": tested lines in untested environments
"CV00": untested lines in untested environments
data.frame object that is for use only when
cv.scheme is provided. Contains the trial to be tested in subsequent
model training functions. The first column contains unique identifiers,
second contains genotypes, third contains reference values, followed by
spectral columns. Include no other columns to right of spectra! Column
names of spectra must start with "X", reference column must be named
"reference", and genotype column must be named "genotype".
data.frame object that is for use only when
cv.scheme is provided. This data.frame contains a trial that has
overlapping genotypes with trial1 but that were grown in a different
site/year (different environment). Formatting must be consistent with
trial1.
data.frame object that is for use only when
cv.scheme is provided. This data.frame contains a trial that may or
may not contain genotypes that overlap with trial1. Formatting must
be consistent with trial1.
boolean that allows for a fixed training set and a split
test set. Example// train model on data from two breeding programs and a
stratified subset (70%) of a third and test on the remaining samples
(30%) of the third. If FALSE, the entire provided test set
test.data will remain as a testing set or if none is provided, 30%
of the provided train.data will be used for testing. Default is
FALSE.
Integer to be used internally as input for set.seed().
Only used if stratified.sampling = TRUE. In all other cases, seed
is set to the current iteration number. Default is 1.
If TRUE, the number of rows removed through filtering
will be printed to the console. Default is TRUE.
DEPRECATED wavelengths is no
longer supported; this information is now inferred from df
column names
DEPRECATED please use
pretreatment to specify the specific pretreatment(s) to test.
For behavior identical to that of preprocessing = TRUE, set
pretreatment = 1:13`.
DEPRECATED output.summary = FALSE
is no longer supported; a summary of output is always returned alongside
the full performance statistics.
DEPRECATED
rf.variable.importance = FALSE is no longer supported; variable
importance results are always returned if the model.method is
set to `pls` or `rf`.
Jenna Hershberger jmh579@cornell.edu
Calls pretreat_spectra, format_cv,
and train_spectra functions.
# \donttest{
library(magrittr)
ikeogu.2017 %>%
dplyr::rename(reference = DMC.oven,
unique.id = sample.id) %>%
dplyr::select(unique.id, reference, dplyr::starts_with("X")) %>%
na.omit() %>%
test_spectra(
train.data = .,
tune.length = 3,
num.iterations = 3,
pretreatment = 1
)
# }
Run the code above in your browser using DataLab