trainControl: Control parameters for train

Description

Control the computational nuances of the train function

Usage

trainControl(method = "boot",
             number = ifelse(grepl("cv", method), 10, 25),
             repeats = ifelse(grepl("cv", method), 1, number),
             p = 0.75,
             search = "grid",
             initialWindow = NULL,
             horizon = 1,
             fixedWindow = TRUE,
             verboseIter = FALSE,
             returnData = TRUE,
             returnResamp = "final",
             savePredictions = FALSE,
             classProbs = FALSE,
             summaryFunction = defaultSummary,
             selectionFunction = "best",
             preProcOptions = list(thresh = 0.95, ICAcomp = 3, k = 5),
             sampling = NULL,
             index = NULL,
             indexOut = NULL,
             timingSamps = 0,
             predictionBounds = rep(FALSE, 2),
             seeds = NA,
             adaptive = list(min = 5, alpha = 0.05,
             	               method = "gls", complete = TRUE),
             trim = FALSE,
             allowParallel = TRUE)

Arguments

method

The resampling method: "boot", "boot632", "cv", "repeatedcv", "LOOCV", "LGOCV" (for repeated training/test splits), "none" (only fits one model to the entire t

number

Either the number of folds or number of resampling iterations

repeats

For repeated k-fold cross-validation only: the number of complete sets of folds to compute

verboseIter

A logical for printing a training log.

returnData

A logical for saving the data

returnResamp

A character string indicating how much of the resampled summary metrics should be saved. Values can be "final", "all" or "none"

savePredictions

an indicator of how much of the hold-out predictions for each resample should be saved. Values can be either "all", "final", or "none". A logical value can also be used that convert to "all" (for true) o

For leave-group out cross-validation: the training percentage

Either "grid" or "random", describing how the tuning parameter grid is determined. See details below.

initialWindow, horizon, fixedWindow

possible arguments to createTimeSlices

classProbs

a logical; should class probabilities be computed for classification models (along with predicted values) in each resample?

summaryFunction

a function to compute performance metrics across resamples. The arguments to the function should be the same as those in defaultSummary.

selectionFunction

the function used to select the optimal tuning parameter. This can be a name of the function or the function itself. See best for details and other options.

preProcOptions

A list of options to pass to preProcess. The type of pre-processing (e.g. center, scaling etc) is passed in via the preProc option in train.

sampling

a single character value describing the type of additional sampling that is conducted after resampling (usually to resolve class imbalances). Values are "none", "down", "up", "smote", or "rose"

index

a list with elements for each resampling iteration. Each list element is a vector of integers corresponding to the rows used for training at that iteration.

indexOut

a list (the same length as index) that dictates which data are held-out for each resample (as integers). If NULL, then the unique set of samples not contained in index is used.

timingSamps

the number of training set samples that will be used to measure the time for predicting samples (zero indicates that the prediction time should not be estimated.

predictionBounds

a logical or numeric vector of length 2 (regression only). If logical, the predictions can be constrained to be within the limit of the training set outcomes. For example, a value of c(TRUE, FALSE) would only constrain the lower end of predic

seeds

an optional set of integers that will be used to set the seed at each resampling iteration. This is useful when the models are run in parallel. A value of NA will stop the seed from being set within the worker processes while a value of

adaptive

a list used when method is "adaptive_cv", "adaptive_boot" or "adaptive_LGOCV". See Details below.

trim

a logical. If TRUE the final model in object$finalModel may have some components of the object removed so reduce the size of the saved object. The predict method will still work, but some other features of the model

allowParallel

if a parallel backend is loaded and available, should the function use it?

Value

An echo of the parameters specified

Details

When setting the seeds manually, the number of models being evaluated is required. This may not be obvious as train does some optimizations for certain models. For example, when tuning over PLS model, the only model that is fit is the one with the largest number of components. So if the model is being tuned over comp in 1:10, the only model fit is ncomp = 10. However, if the vector of integers used in the seeds arguments is longer than actually needed, no error is thrown.

Using method = "none" and specifying more than one model in train's tuneGrid or tuneLength arguments will result in an error.

Using adaptive resampling when method is either "adaptive_cv", "adaptive_boot" or "adaptive_LGOCV", the full set of resamples is not run for each model. As resampling continues, a futility analysis is conducted and models with a low probability of being optimal are removed. These features are experimental. See Kuhn (2014) for more details. The options for this procedure are:

min: the minimum number of resamples used before models are removed
alpha: the confidence level of the one-sided intervals used to measure futility
method: either generalized least squares (method = "gls") or a Bradley-Terry model (method = "BT")
complete: if a single parameter value is found before the end of resampling, should the full set of resamples be computed for that parameter. )

The option search = "grid" uses the default grid search routine. When search = "random", a random search procedure is used (Bergstra and Bengio, 2012). See http://topepo.github.io/caret/random.html for details and an example.

References

Bergstra and Bengio (2012), ``Random Search for Hyper-Parameter Optimization'', Journal of Machine Learning Research, 13(Feb):281-305

Kuhn (2014), ``Futility Analysis in the Cross-Validation of Machine Learning Models'' http://arxiv.org/abs/1405.6974,

Package website for subsampling: http://topepo.github.io/caret/sampling.html

Examples

Run this code

## Do 5 repeats of 10-Fold CV for the iris data. We will fit
## a KNN model that evaluates 12 values of k and set the seed
## at each iteration.

set.seed(123)
seeds <- vector(mode = "list", length = 51)
for(i in 1:50) seeds[[i]] <- sample.int(1000, 22)

## For the last model:
seeds[[51]] <- sample.int(1000, 1)

ctrl <- trainControl(method = "repeatedcv",
                     repeats = 5,
                     seeds = seeds)

set.seed(1)
mod <- train(Species ~ ., data = iris,
             method = "knn",
             tuneLength = 12,
             trControl = ctrl)


ctrl2 <- trainControl(method = "adaptive_cv",
                      repeats = 5,
                      verboseIter = TRUE,
                      seeds = seeds)

set.seed(1)
mod2 <- train(Species ~ ., data = iris,
              method = "knn",
              tuneLength = 12,
              trControl = ctrl2)

Run the code above in your browser using DataLab