Control parameters for GA and SA feature selection
Many of these options are the same as those described for
trainControl. More extensive documentation and examples
can be found on the caret website at
functions component contains the information about how the model
should be fit and summarized. It also contains the elements needed for the
GA and SA modules (e.g. cross-over, etc).
The elements of
functions that are the same for GAs and SAs are:
fit, with arguments
..., is used to fit the classification or regression model
pred, with arguments
x, predicts new samples
fitness_intern, with arguments
p, summarizes performance for the internal estimates of fitness
fitness_extern, with arguments
model, summarizes performance using the externally held-out samples
selectIter, with arguments
maximize, determines the best search iteration for feature selection.
The elements of
functions specific to genetic algorithms are:
initial, with arguments
..., creates an initial population.
selection, with arguments
..., conducts selection of individuals.
crossover, with arguments
..., control genetic reproduction.
mutation, with arguments
..., adds mutations.
The elements of
functions specific to simulated annealing are:
initial, with arguments
..., creates the initial subset.
perturb, with arguments
number, makes incremental changes to the subsets.
prob, with arguments
iteration, computes the acceptance probabilities
The pages http://topepo.github.io/caret/feature-selection-using-genetic-algorithms.html and http://topepo.github.io/caret/feature-selection-using-simulated-annealing.html have more details about each of these functions.
holdout can be used to hold out samples for computing the internal
fitness value. Note that this is independent of the external resampling
step. Suppose 10-fold CV is being used. Within a resampling iteration,
holdout can be used to sample an additional proportion of the 90%
resampled data to use for estimating fitness. This may not be a good idea
unless you have a very large training set and want to avoid an internal
resampling procedure to estimate fitness.
The search algorithms can be parallelized in several places:
each externally resampled GA or SA can be run independently (controlled by the
within a GA, the fitness calculations at a particular generation can be run in parallel over the current set of individuals (see the
if inner resampling is used, these can be run in parallel (controls depend on the function used. See, for example,
any parallelization of the individual model fits. This is also specific to the modeling function.
It is probably best to pick one of these areas for parallelization and the first is likely to produces the largest decrease in run-time since it is the least likely to incur multiple re-starting of the worker processes. Keep in mind that if multiple levels of parallelization occur, this can effect the number of workers and the amount of memory required exponentially.
gafsControl(functions = NULL, method = "repeatedcv", metric = NULL, maximize = NULL, number = ifelse(grepl("cv", method), 10, 25), repeats = ifelse(grepl("cv", method), 1, 5), verbose = FALSE, returnResamp = "final", p = 0.75, index = NULL, indexOut = NULL, seeds = NULL, holdout = 0, genParallel = FALSE, allowParallel = TRUE)
safsControl(functions = NULL, method = "repeatedcv", metric = NULL, maximize = NULL, number = ifelse(grepl("cv", method), 10, 25), repeats = ifelse(grepl("cv", method), 1, 5), verbose = FALSE, returnResamp = "final", p = 0.75, index = NULL, indexOut = NULL, seeds = NULL, holdout = 0, improve = Inf, allowParallel = TRUE)
a list of functions for model fitting, prediction etc (see Details below)
The resampling method:
LGOCV(for repeated training/test splits)
a two-element string that specifies what summary metric will be used to select the optimal number of iterations from the external fitness value and which metric should guide subset selection. If specified, this vector should have names
safsfor explanations of the difference.
a two-element logical: should the metrics be maximized or minimized? Like the
metricargument, this this vector should have names
Either the number of folds or number of resampling iterations
For repeated k-fold cross-validation only: the number of complete sets of folds to compute
a logical for printing results
A character string indicating how much of the resampled summary metrics should be saved. Values can be ``all'' or ``none''
For leave-group out cross-validation: the training percentage
a list with elements for each resampling iteration. Each list element is the sample rows used for training at that iteration.
a list (the same length as
index) that dictates which sample are held-out for each resample. If
NULL, then the unique set of samples not contained in
a vector or integers that can be used to set the seed during each search. The number of seeds must be equal to the number of resamples plus one.
the proportion of data in [0, 1) to be held-back from
yto calculate the internal fitness values
if a parallel backend is loaded and available, should
gafsuse it tp parallelize the fitness calculations within a generation within a resample?
if a parallel backend is loaded and available, should the function use it?
the number of iterations without improvement before
safsreverts back to the previous optimal subset
An echo of the parameters specified