improve: RESTRICTED LOCAL IMPROVEMENT SEARCH FOR AN OPTIMAL k-VARIABLE SUBSET

Description

Given a set of variables, a Restricted Local Improvement algorithm seeks a k-variable subset which is optimal, as a surrogate for the whole set, with respect to a given criterion.

Usage

improve( mat, kmin, kmax = kmin, nsol = 1, exclude = NULL,
include = NULL, printfile = FALSE, iseed =
c(sample(0:4095,3),sample(0:2047,1)*2+1), criterion = "RM", pcindices
= 1:kmax)

Arguments

mat

a covariance or correlation matrix of the variables from which the k-subset is to be selected.

kmin

the cardinality of the smallest subset that is wanted.

kmax

the cardinality of the largest subset that is wanted.

nsol

the number of different subsets (runs of the algorithm) wanted.

exclude

a vector of variables (referenced by their row/column numbers in matrix mat) that are to be forcibly excluded from the subsets.

include

a vector of variables (referenced by their row/column numbers in matrix mat) that are to be forcibly included from the subsets.

printfile

a logical variable indicating whether results in print-friendly format are to be printed out to a file.

iseed

a vector of 4 integers (as requested by the LAPACK routine SLARUV) for the seeds of the random number generation.

criterion

Indicates which criterion is to be used in judging the quality of the subsets. Currently, only the RM, RV and GCD criteria are supported (see References, rm.coef,

pcindices

a vector of ranks of Principal Components that are to be used for comparison with the k-variable subsets (for the GCD criterion only, see gcd.coef).

Value

A list with four items:
subsetsAn nsol x kmax x length(kmin:kmax) 3-dimensional array, giving for each cardinality (dimension 3) and each solution (dimension 1) the list of variables (referenced by their row/column numbers in matrix mat) in the subset (dimension 2). (For cardinalities smaller than kmax, the extra positions are set to zero).
valuesAn nsol x length(kmin:kmax) matrix, giving for each cardinality (columns), the criterion values of the nsol (rows) solutions obtained.
bestvaluesA length(kmin:kmax) vector giving the best values of the criterion obtained for each cardinality.
bestsetsA length(kmin:kmax) x kmax matrix, giving, for each cardinality (rows), the variables (referenced by their row/column numbers in matrix mat) in the best k-subset that was found.

Details

An initial k-variable subset (for k ranging from kmin to kmax) of a full set of p (p not exceeding 300) variables is randomly selected and the variables not belonging to this subset are placed in a queue. The possibility of replacing a variable in the current k-subset with a variable from the queue is then explored. More precisely, a variable is selected, removed from the queue, and the k values of the criterion which would result from swapping this selected variable with each variable in the current subset are computed. If the best of these values improves the current criterion value, the current subset is updated accordingly. In this case, the variable which leaves the subset is added to the queue, but only if it has not previously been in the queue (i.e., no variable can enter the queue twice). The algorithm proceeds until the queue is emptied.

The user may force variables to be included and/or excluded from the k-subsets.

For each cardinality k, the total number of calls to the procedure which computes the criterion values is O(nsol x k x p). These calls are the dominant computational effort in each iteration of the algorithm.

In order to improve computation times, the bulk of computations are carried out in a Fortran routine which is called by this function. Further details about the algorithm can be found in Reference 1 and in the comments to the Fortran code (in the src subdirectory for this package). For p > 300, it is necessary to change the declarative statements in the Fortran code.

References

1) Cadima, J., Cerdeira, J. Orestes and Minhoto, M. A Comparative Study of Algorithms for Variable Selection in Multivariate Statistics, submitted for publication in American Journal of Mathematical and Management Sciences.

2) Cadima, J. and Jolliffe, I.T. (2001). Variable Selection and the Interpretation of Principal Subspaces, Journal of Agricultural, Biological and Environmental Statistics, Vol. 6, 62-79.

Examples

Run this code

# For illustration of use, a small data set with very few iterations
# of the algorithm. 
#

data(swiss)
improve(cor(swiss),2,3,nsol=4,iseed=c(2345,3276,1568,3187),criterion="GCD")
## $subsets
## , , Card.2
##
##            Var.1 Var.2 Var.3
## Solution 1     3     6     0
## Solution 2     3     6     0
## Solution 3     3     6     0
## Solution 4     3     6     0
##
## , , Card.3
##
##            Var.1 Var.2 Var.3
## Solution 1     4     5     6
## Solution 2     4     5     6
## Solution 3     4     5     6
## Solution 4     4     5     6
##
##
## $values
##               card.2   card.3
## Solution 1 0.8487026 0.925372
## Solution 2 0.8487026 0.925372
## Solution 3 0.8487026 0.925372
## Solution 4 0.8487026 0.925372
##
## $bestvalues
##    Card.2    Card.3 
## 0.8487026 0.9253720 
##
## $bestsets
##        Var.1 Var.2 Var.3
## Card.2     3     6     0
## Card.3     4     5     6

#
#
# Forcing the inclusion of variable 1 in the subset
#
 improve(cor(swiss),2,3,nsol=4,iseed=c(2345,3276,1568,3187),criterion="GCD",include=c(1))
## $subsets
## , , Card.2
##
##            Var.1 Var.2 Var.3
## Solution 1     1     6     0
## Solution 2     1     6     0
## Solution 3     1     6     0
## Solution 4     1     6     0
##
## , , Card.3
##
##            Var.1 Var.2 Var.3
## Solution 1     1     5     6
## Solution 2     1     5     6
## Solution 3     1     5     6
## Solution 4     1     5     6
##
##
## $values
##               card.2    card.3
## Solution 1 0.7284477 0.8048528
## Solution 2 0.7284477 0.8048528
## Solution 3 0.7284477 0.8048528
## Solution 4 0.7284477 0.8048528
##
## $bestvalues
##    Card.2    Card.3 
## 0.7284477 0.8048528 
##
## $bestsets
##        Var.1 Var.2 Var.3
## Card.2     1     6     0
## Card.3     1     5     6

Run the code above in your browser using DataLab