GENETIC ALGORITHM FOR AN OPTIMAL k-VARIABLE SUBSET
Given a set of variables, a Genetic Algorithm algorithm seeks a k-variable subset which is optimal, as a surrogate for the whole set, with respect to a given criterion.
genetic( mat, kmin, kmax = kmin, popsize = 100, nger = 20, mutate = FALSE, maxclone = 5, exclude = NULL, include = NULL, improvement = TRUE, printfile = FALSE, iseed = c(sample(0:4095,3),sample(0:2047,1)*2+1), criterion = "RM", pcindices = 1:kmax)
- a covariance or correlation matrix of the variables from which the k-subset is to be selected.
- the cardinality of the smallest subset that is wanted.
- the cardinality of the largest subset that is wanted.
- integer variable indicating the size of the population.
- integer variable giving the number of generations for which the genetic algorithm will run.
- logical variable indicating whether each child undergoes a mutation. By default, FALSE.
- integer variable specifying the maximum number of identical replicates (clones) of individuals that is acceptable in the population. Serves to ensure that the population has sufficient genetic diversity, which is necessary to enable the algorithm to
- a vector of variables (referenced by their row/column numbers in matrix mat) that are to be forcibly excluded from the subsets.
- a vector of variables (referenced by their row/column numbers in matrix mat) that are to be forcibly included in the subsets.
- a logical variable indicating whether or not the
best final subset (for each cardinality) is to be passed as input to a
local improvement algorithm (see
- a logical variable indicating whether results in print-friendly format are to be written to a file.
- a vector of 4 integers (as requested by the LAPACK routine SLARUV) for the seeds of the random number generation.
- Indicates which criterion is to be used in judging
the quality of the subsets. Currently, only the RM, RV and GCD criteria
are supported (see References,
- a vector of ranks of Principal Components that are to be
used for comparison with the variable subsets (for the GCD
criterion only, see
For each cardinality k (with k ranging from kmin to kmax),
an initial population of popsize k-variable subsets is randomly
selected from a full set of p (p not exceeding 300) variables.
In each iteration, popsize/2 couples
are formed from among the population and each couple generates a child
(a new k-variable subset)
which inherits properties of its parents (specifically, it inherits
all variables common to both parents and a random selection of
variables in the symmetric difference of its parents' genetic makeup).
Each offspring may optionally undergo a mutation (in the form of a
local improvement algorithm (see
improve). The parents
and offspring are ranked according to their criterion value, and the
best popsize of these k-subsets will make up the next
generation, which is used as the current population in the subsequent iteration.
The stopping rule for the algorithm is the number of generations (nger).
Optionally, the best k-variable subset produced by the Genetic Algorithm may be passed as input to a restricted local search algorithm, for possible further improvement.
The user may force variables to be included and/or excluded from the k-subsets.
For each cardinality k, the total number of calls to the procedure which computes the criterion values is $popsize + nger$ x $popsize/2$. These calls are the dominant computational effort in each iteration of the algorithm.
In order to improve computation times, the bulk of computations are
carried out by a Fortran routine which is called by this
function. Further details about the Genetic Algorithm can
be found in Reference 1 and in the comments to the Fortran code (in
src subdirectory for this package). For p > 300, it is
necessary to change the declarative statements in the Fortran code.
- A list with four items:
subsets A popsize x kmax x length(kmin:kmax) 3-dimensional array, giving for each cardinality (dimension 3) and each subset in the final population (dimension 1) the list of variables (referenced by their row/column numbers in matrix mat) in the subset (dimension 2). (For cardinalities smaller than kmax, the extra positions are set to zero). values A popsize x length(kmin:kmax) matrix, giving for each cardinality (columns), the (ordered) criterion values of the popsize (rows) subsets in the final generation. bestvalues A length(kmin:kmax) vector giving the best values of the criterion obtained for each cardinality. If improvement is TRUE, these values result from the final restricted local search algorithm (and may therefore exceed the largest value for that cardinality in values). bestsets A length(kmin:kmax) x kmax matrix, giving, for each cardinality (rows), the variables (referenced by their row/column numbers in matrix mat) in the best k-subset that was found.
1) Cadima, J., Cerdeira, J. Orestes and Minhoto, M. A Comparative Study of Algorithms for Variable Selection in Multivariate Statistics, submitted for publication in American Journal of Mathematical and Management Sciences.
2) Cadima, J. and Jolliffe, I.T. (2001). Variable Selection and the Interpretation of Principal Subspaces, Journal of Agricultural, Biological and Environmental Statistics, Vol. 6, 62-79.
data(swiss) genetic(cor(swiss),3,4,popsize=10,nger=5,iseed=c(2345,3276,1568,3187),criterion="Rv") ## For cardinality k= 4 ## Generation number 4 does not ## have sufficient genetic diversity ## for acceptable levels of consanguinity ## (couples differing by at least 2 genes). ## ## This situation may be overcome by ## reducing the maximum acceptable number ## of clones (maxclone). ## ## ## Best criterion value found so far: 0.955714479 ## ## $subsets ## Var.1 Var.2 Var.3 ## Solution 1 1 2 3 ## Solution 2 1 2 5 ## Solution 3 3 4 6 ## Solution 4 1 4 5 ## Solution 5 3 4 5 ## Solution 6 3 4 5 ## Solution 7 3 4 5 ## Solution 8 2 4 5 ## Solution 9 1 3 4 ## Solution 10 1 3 4 ## ## $values ## Solution 1 Solution 2 Solution 3 Solution 4 Solution 5 Solution 6 ## 0.9141995 0.9098502 0.9034868 0.9023910 0.9020271 0.9020271 ## Solution 7 Solution 8 Solution 9 Solution 10 ## 0.9020271 0.8982510 0.8940945 0.8940945 ## ## $bestvalues ## Card.3 ## 0.9141995 ## ## $bestsets ## Var.1 Var.2 Var.3 ## 1 2 3 ##