subselect (version 0.1-1)

anneal: SIMULATED ANNEALING SEARCH FOR AN OPTIMAL k-VARIABLE SUBSET

Description

Given a set of variables, a Simulated Annealing algorithm seeks a k-variable subset which is optimal, as a surrogate for the whole set, with respect to a given criterion.

Usage

anneal( mat, kmin, kmax = kmin, nsol = 1, niter = 1000, exclude
= NULL, include = NULL, improvement = TRUE, printfile = FALSE, iseed =
c(sample(0:4095,3),sample(0:2047,1)*2+1), cooling = 0.05,  temp = 1,
criterion = "RM", pcindices = 1:kmax))

Arguments

mat
a covariance or correlation matrix of the variables from which the k-subset is to be selected.
kmin
the cardinality of the smallest subset that is wanted.
kmax
the cardinality of the largest subset that is wanted.
nsol
the number of initial subsets (runs of the algorithm).
niter
the number of iterations of the algorithm for each initial subset.
exclude
a vector of variables (referenced by their row/column numbers in matrix mat) that are to be forcibly excluded from the subsets.
include
a vector of variables (referenced by their row/column numbers in matrix mat) that are to be forcibly included in the subsets.
improvement
a logical variable indicating whether or not the best final subset (for each cardinality) is to be passed as input to a local improvement algorithm (see improve).
printfile
a logical variable indicating whether results in print-friendly format are to be written to a file.
iseed
a vector of 4 integers (as requested by the LAPACK routine SLARUV) for the seeds of the random number generation.
cooling
variable in the ]0,1[ interval indicating the rate of geometric cooling for the Simulated Annealing algorithm.
temp
positive variable indicating the initial temperature for the Simulated Annealing algorithm.
criterion
Indicates which criterion is to be used in judging the quality of the subsets. Currently, only the RM, RV and GCD criteria are supported (see References, rm.coef,
pcindices
a vector of ranks of Principal Components that are to be used for comparison with the k-variable subsets (for the GCD criterion only, see gcd.coef).

Value

  • A list with four items:
  • subsetsAn nsol x kmax x length(kmin:kmax) 3-dimensional array, giving for each cardinality (dimension 3) and each solution (dimension 1) the list of variables (referenced by their row/column numbers in matrix mat) in the subset (dimension 2). (For cardinalities smaller than kmax, the extra positions are set to zero).
  • valuesAn nsol x length(kmin:kmax) matrix, giving for each cardinality (columns), the criterion values of the nsol (rows) subsets obtained.
  • bestvaluesA length(kmin:kmax) vector giving the best values of the criterion obtained for each cardinality. If improvement is TRUE, these values result from the final restricted local search algorithm (and may therefore exceed the largest value for that cardinality in values).
  • bestsetsA length(kmin:kmax) x kmax matrix, giving, for each cardinality (rows), the variables (referenced by their row/column numbers in matrix mat) in the best k-subset that was found.

Details

An initial k-variable subset (for k ranging from kmin to kmax) of a full set of p (p not exceeding 300) variables is randomly selected and passed on to a Simulated Annealing algorithm. The algorithm then selects a random subset in the neighbourhood of the current subset (neighbourhood of a subset S being defined as the family of all k-variable subsets which differ from S by a single variable), and decides whether to replace the current subset according to the Simulated Annealing rule, i.e., either (i) always, if the alternative subset's value of the criterion is higher; or (ii) with probability $$\exp^{\frac{ac-cc}{t}}$$ if the alternative subset's value of the criterion (ac) is lower than that of the current solution (cc), where the parameter t (temperature) decreases throughout the iterations of the algorithm. For each cardinality k, the stopping criterion for the algorithm is the number of iterations (niter) which is controlled by the user. Also controlled by the user are the initial temperature (temp) and the rate of geometric cooling of the temperature (cooling).

Optionally, the best k-variable subset produced by Simulated Annealing may be passed as input to a restricted local search algorithm, for possible further improvement.

The user may force variables to be included and/or excluded from the k-subsets.

For each cardinality k, the total number of calls to the procedure which computes the criterion values is nsol x (niter + 1). These calls are the dominant computational effort in each iteration of the algorithm.

In order to improve computation times, the bulk of computations are carried out by a Fortran routine which is called by this function. Further details about the Simulated Annealing algorithm can be found in Reference 1 and in the comments to the Fortran code (in the src subdirectory for this package). For p > 300, it is necessary to change the declarative statements in the Fortran code.

References

1) Cadima, J., Cerdeira, J. Orestes and Minhoto, M. A Comparative Study of Algorithms for Variable Selection in Multivariate Statistics, submitted for publication in American Journal of Mathematical and Management Sciences.

2) Cadima, J. and Jolliffe, I.T. (2001). Variable Selection and the Interpretation of Principal Subspaces, Journal of Agricultural, Biological and Environmental Statistics, Vol. 6, 62-79.

See Also

rm.coef, rv.coef, gcd.coef, genetic, improve.

Examples

Run this code
# For illustration of use, a small data set with very few iterations
# of the algorithm.  (iseed is used to ensure the same solution,
# given the random nature of the algorithm).

data(swiss)
anneal(cor(swiss),2,3,nsol=4,niter=10,criterion="RM",iseed=c(2345,3276,1568,3187))
## $subsets
## , , Card.2
## 
##            Var.1 Var.2 Var.3
## Solution 1     3     6     0
## Solution 2     1     2     0
## Solution 3     3     6     0
## Solution 4     3     6     0
##
## , , Card.3
##
##            Var.1 Var.2 Var.3
## Solution 1     3     5     6
## Solution 2     1     2     5
## Solution 3     1     2     5
## Solution 4     4     5     6
##
##
## $values
##               card.2    card.3
## Solution 1 0.8016409 0.8769672
## Solution 2 0.7945390 0.8791856
## Solution 3 0.8016409 0.8791856
## Solution 4 0.8016409 0.9043760
##
## $bestvalues
##    Card.2    Card.3 
## 0.8016409 0.9043760 
##
## $bestsets
##        Var.1 Var.2 Var.3
## Card.2     3     6     0
## Card.3     4     5     6
##


#
#
# Excluding variable number 6 from the subsets.
# 

data(swiss)
anneal(cor(swiss),2,3,nsol=4,niter=10,criterion="RM",exclude=c(6),iseed=c(2345,3276,1568,3187))
## $subsets
## , , Card.2
##
##            Var.1 Var.2 Var.3
## Solution 1     4     5     0
## Solution 2     4     5     0
## Solution 3     4     5     0
## Solution 4     4     5     0
##
## , , Card.3
##
##            Var.1 Var.2 Var.3
## Solution 1     1     2     5
## Solution 2     1     2     5
## Solution 3     1     2     5
## Solution 4     1     2     5
##
##
## $values
##               card.2    card.3
## Solution 1 0.7982296 0.8791856
## Solution 2 0.7982296 0.8791856
## Solution 3 0.7982296 0.8791856
## Solution 4 0.7982296 0.8791856
##
## $bestvalues
##    Card.2    Card.3 
## 0.7982296 0.8791856 
##
## $bestsets
##        Var.1 Var.2 Var.3
## Card.2     4     5     0
## Card.3     1     2     5

Run the code above in your browser using DataLab