Use k-folds cross-validation method to calculate parameter inputs that optimize benchmark performance while attempting to account for out-of-sample error
cvFPM(
data,
paramList,
FN_crit = seq(0.1, 0.9, 0.05),
k = 5,
seed = NULL,
plot = TRUE,
tryStop = 10,
simplify = TRUE,
which = c(1, 2),
...
)list with 1 or 3 objects (depending on whether or not simplify = TRUE);
these include 1) optim_FN, the optimized FN_crit value; 2) CV_OR, the detailed breakdown of overall
reliability values for each FN_crit value; and 3) CV_FPM, floating percentile model benchark statistics based on
all cross-validation runs for each FN_crit
data.frame containing, at a minimum, chemical concentrations as columns and a logical Hit column classifying toxicity
character vector of column names of chemical concentration variables in data
numeric vector of values between 0 and 1 indicating false negative threshold for floating percentile model benchmark selection (default = seq(0.1, 0.9, 0.05))
numeric value >1 indicating the number of k-folds to include in cross-validation (default = 5)
random seed to set for reproducible results (default = NULL, i.e., no seed)
whether to plot the output of cvFPM (default = TRUE)
specifies the number of times the cross-validation algorithm will try to run before ending (see Details; default = 10)
logical; whether to return just the optimized FN_crit value or more detailed diagnostic information
numeric or character indicating which type of plot to generate (see Details; default = c(1, 2))
additional arguments passed to chemSig and FPM
cvFPM allows users to "tune" FN_crit argument in FPM. This is achieved by splitting the empirical dataset into "test" and "training" subsets,
calculating benchmarks for the training set, and then calculating the benchmarks' prediction errors using the test set.
This process is repeated several times (the number depending on the size of k relative to the sample size), and the results are summarized statistically.
Lastly, this process is repeated for each FN_crit value specified by the user, resulting in comparable
statistics for each FN_crit. The output in the console indicates which FN_crit value resulted in the consistently optimal benchmarks
(meaning highest overall reliability or most balanced errors).
By setting plot = TRUE (the default), the outcome of cross-validation can be visualized over the range of FN_crit values considered. Visualizing the results
can inform the user about variability in the cross-validation process, ranges of potentially reasonable FN_crit values, etc.
cvFPM does not currently support optimization of the alpha parameter of FPM;
optimFPM allows the user to optimize alpha but only using the empirical data (not through cross-validation).
Errors may be encountered by setting the value of k too high or too low, resulting in an inability of cvFPM
to generate meaningful subsets for testing and floating percentile model calculations. Groups for subsetting are roughly
evenly applied within the cross-validation method, so it is reasonable to expect that ceiling(nrow(data)/k) is the number of
samples in any given test subset, with nrow(data) - ceiling(nrow(data)/k) being the size of the training subset. If a large number
of samples still generates an error, consider increasing the tryStop value and rerunning cvFPM. The easiest way
to avoid this type of error is to keep k low relative to nrow(data) (bearing in mind that k must be >1).
The which argument can be used to specify which of the two plots should be generated when plot = TRUE. These plots include
the optimization results based on the overall reliability metric or the balanced rate of false positives and false negatives. Inputs
to which are, by default, c(1, 2), but flexible character inputs also can be used, for example which = "OR" or which = "balanced".
chemSig, FPM, seq
paramList = c("Cd", "Cu", "Fe", "Mn", "Ni", "Pb", "Zn")
cvFPM(data = h.tristate, paramList = paramList, FN_crit = seq(0.1, 0.9, 0.1), which = "OR")
Run the code above in your browser using DataLab