cvFPM: Cross-Validation Optimization of the Floating Percentile Model

Description

Use k-folds cross-validation method to calculate parameter inputs that optimize benchmark performance while attempting to account for out-of-sample error

Usage

cvFPM(
  data,
  paramList,
  FN_crit = seq(0.1, 0.9, 0.05),
  k = 5,
  seed = NULL,
  plot = TRUE,
  tryStop = 10,
  simplify = TRUE,
  which = c(1, 2),
  ...
)

Value

list with 1 or 3 objects (depending on whether or not simplify = TRUE); these include 1) optim_FN, the optimized FN_crit value; 2) CV_OR, the detailed breakdown of overall reliability values for each FN_crit value; and 3) CV_FPM, floating percentile model benchark statistics based on all cross-validation runs for each FN_crit

Arguments

data: data.frame containing, at a minimum, chemical concentrations as columns and a logical Hit column classifying toxicity
paramList: character vector of column names of chemical concentration variables in data
FN_crit: numeric vector of values between 0 and 1 indicating false negative threshold for floating percentile model benchmark selection (default = seq(0.1, 0.9, 0.05))
k: numeric value >1 indicating the number of k-folds to include in cross-validation (default = 5)
seed: random seed to set for reproducible results (default = NULL, i.e., no seed)
plot: whether to plot the output of cvFPM (default = TRUE)
tryStop: specifies the number of times the cross-validation algorithm will try to run before ending (see Details; default = 10)
simplify: logical; whether to return just the optimized FN_crit value or more detailed diagnostic information
which: numeric or character indicating which type of plot to generate (see Details; default = c(1, 2))
...: additional arguments passed to chemSig and FPM

Details

cvFPM allows users to "tune" FN_crit argument in FPM. This is achieved by splitting the empirical dataset into "test" and "training" subsets, calculating benchmarks for the training set, and then calculating the benchmarks' prediction errors using the test set. This process is repeated several times (the number depending on the size of k relative to the sample size), and the results are summarized statistically. Lastly, this process is repeated for each FN_crit value specified by the user, resulting in comparable statistics for each FN_crit. The output in the console indicates which FN_crit value resulted in the consistently optimal benchmarks (meaning highest overall reliability or most balanced errors). By setting plot = TRUE (the default), the outcome of cross-validation can be visualized over the range of FN_crit values considered. Visualizing the results can inform the user about variability in the cross-validation process, ranges of potentially reasonable FN_crit values, etc.

cvFPM does not currently support optimization of the alpha parameter of FPM; optimFPM allows the user to optimize alpha but only using the empirical data (not through cross-validation).

Errors may be encountered by setting the value of k too high or too low, resulting in an inability of cvFPM to generate meaningful subsets for testing and floating percentile model calculations. Groups for subsetting are roughly evenly applied within the cross-validation method, so it is reasonable to expect that ceiling(nrow(data)/k) is the number of samples in any given test subset, with nrow(data) - ceiling(nrow(data)/k) being the size of the training subset. If a large number of samples still generates an error, consider increasing the tryStop value and rerunning cvFPM. The easiest way to avoid this type of error is to keep k low relative to nrow(data) (bearing in mind that k must be >1).

The which argument can be used to specify which of the two plots should be generated when plot = TRUE. These plots include the optimization results based on the overall reliability metric or the balanced rate of false positives and false negatives. Inputs to which are, by default, c(1, 2), but flexible character inputs also can be used, for example which = "OR" or which = "balanced".

Examples

Run this code

paramList = c("Cd", "Cu", "Fe", "Mn", "Ni", "Pb", "Zn")
cvFPM(data = h.tristate, paramList = paramList, FN_crit = seq(0.1, 0.9, 0.1), which = "OR")