safs: Simulated annealing feature selection

Description

Supervised feature selection using simulated annealing

safs conducts a supervised binary search of the predictor space using simulated annealing (SA). See Kirkpatrick et al (1983) for more information on this search algorithm.

This function conducts the search of the feature space repeatedly within resampling iterations. First, the training data are split be whatever resampling method was specified in the control function. For example, if 10-fold cross-validation is selected, the entire simulated annealing search is conducted 10 separate times. For the first fold, nine tenths of the data are used in the search while the remaining tenth is used to estimate the external performance since these data points were not used in the search.

During the search, a measure of fitness (i.e. SA energy value) is needed to guide the search. This is the internal measure of performance. During the search, the data that are available are the instances selected by the top-level resampling (e.g. the nine tenths mentioned above). A common approach is to conduct another resampling procedure. Another option is to use a holdout set of samples to determine the internal estimate of performance (see the holdout argument of the control function). While this is faster, it is more likely to cause overfitting of the features and should only be used when a large amount of training data are available. Yet another idea is to use a penalized metric (such as the AIC statistic) but this may not exist for some metrics (e.g. the area under the ROC curve).

The internal estimates of performance will eventually overfit the subsets to the data. However, since the external estimate is not used by the search, it is able to make better assessments of overfitting. After resampling, this function determines the optimal number of iterations for the SA.

Finally, the entire data set is used in the last execution of the simulated annealing algorithm search and the final model is built on the predictor subset that is associated with the optimal number of iterations determined by resampling (although the update function can be used to manually set the number of iterations).

This is an example of the output produced when safsControl(verbose = TRUE) is used:

Fold03 1 0.401 (11) 
Fold03 2 0.401->0.410 (11+1, 91.7%) *
Fold03 3 0.410->0.396 (12+1, 92.3%) 0.969 A 
Fold03 4 0.410->0.370 (12+2, 85.7%) 0.881 
Fold03 5 0.410->0.399 (12+2, 85.7%) 0.954 A 
Fold03 6 0.410->0.399 (12+1, 78.6%) 0.940 A 
Fold03 7 0.410->0.428 (12+2, 73.3%) *

The text "Fold03" indicates that this search is for the third cross-validation fold. The initial subset of 11 predictors had a fitness value of 0.401. The next iteration added a single feature the the existing best subset of 11 (as indicated by "11+1") that increased the fitness value to 0.410. This new solution, which has a Jaccard similarity value of 91.7% to the current best solution, is automatically accepted. The third iteration adds another feature to the current set of 12 but does not improve the fitness. The acceptance probability for this difference is shown to be 95.6% and the "A" indicates that this new sub-optimal subset is accepted. The fourth iteration does not show an increase and is not accepted. Note that the Jaccard similarity value of 85.7% is the similarity to the current best solution (from iteration 2) and the "12+2" indicates that there are two additional features added from the current best that contains 12 predictors.

The search algorithm can be parallelized in several places:

each externally resampled SA can be run independently (controlled by the allowParallel option of safsControl)
if inner resampling is used, these can be run in parallel (controls depend on the function used. See, for example, trainControl)
any parallelization of the individual model fits. This is also specific to the modeling function.

It is probably best to pick one of these areas for parallelization and the first is likely to produces the largest decrease in run-time since it is the least likely to incur multiple re-starting of the worker processes. Keep in mind that if multiple levels of parallelization occur, this can effect the number of workers and the amount of memory required exponentially.

Usage

safs(x, ...)
# S3 method for default
safs(x, y, iters = 10, differences = TRUE,
  safsControl = safsControl(), ...)

Arguments

an object where samples are in rows and features are in columns. This could be a simple matrix, data frame or other type (e.g. sparse matrix). See Details below.

…

arguments passed to the classification or regression routine specified in the function safsControl$functions$fit

a numeric or factor vector containing the outcome for each sample.

iters

number of search iterations

differences

a logical: should the difference in fitness values with and without each predictor be calculated

safsControl

a list of values that define how this function acts. See safsControl and URL.

Value

an object of class safs

References

http://topepo.github.io/caret/feature-selection-using-genetic-algorithms.html

http://topepo.github.io/caret/feature-selection-using-simulated-annealing.html

Kuhn and Johnson (2013), Applied Predictive Modeling, Springer

Kirkpatrick, S., Gelatt, C. D., and Vecchi, M. P. (1983). Optimization by simulated annealing. Science, 220(4598), 671.

Examples

Run this code

# NOT RUN {
# }
# NOT RUN {
set.seed(1)
train_data <- twoClassSim(100, noiseVars = 10)
test_data  <- twoClassSim(10,  noiseVars = 10)

## A short example
ctrl <- safsControl(functions = rfSA,
                    method = "cv",
                    number = 3)

rf_search <- safs(x = train_data[, -ncol(train_data)],
                  y = train_data$Class,
                  iters = 3,
                  safsControl = ctrl)

rf_search
# }
# NOT RUN {
# }