vscc: Variable selection for clustering and classification

Description

Performs variable selection under a clustering or classification framework. Automated implementation using model-based clustering is based on teigen version 2.0 and mclust version 4.0; issues *may* arise when using different versions.

Usage

vscc(x, G=1:9, automate = "mclust", initial = NULL, train = NULL, forcereduction = FALSE)

Arguments

Data frame or matrix to perform variable selection on

Vector for the number of groups to consider during initialization and/or post-selection analysis. Default is 1-9.

automate

Character string ("teigen", "mclust" (default), or NULL only) indicating which mixture model family to implement as initialization and/or post-selection analysis. If NULL, the function assumes manual operation of the algorithm (meaning an initial clustering vector must be given, and no post-selection analysis is performed).

initial

Optional vector giving the initial clustering.

train

Optional vector of training data (for classification framework).

forcereduction

Logical indicating if the full data set should be considered (FALSE) when selecting the `best' variable subset via total model uncertainty. Not used if automate=NULL.

Value

selected: A list containing the subsets of variables selected for each relation. Each set is numbered according to the number in the exponential of the relationship. For instance, vscc_object$selected[[3]] corresponds to the variable subset selected by the cubic relationship.
family: The family used as initialization and/or post selection. (Same as user input automate, and can be NULL).
wss: The within-group variance associated with each variable from the full data set.
topselected: The best variable subset according to the total model uncertainty.
initialrun: Results from the initialization; an object of class teigen or mclust.
bestmodel: Results from the best model on the selected variable subset; an object of class teigen or mclust.
chosenrelation: Numeric indication of the relationship chosen according to total model uncertainty. The number corresponds to exponent in the relationship: for instance, a value of '4' suggests the quartic relationship. If the value "Full dataset" is given, then the unreduced data provides the best model uncertainty; can be avoided by specifying forcereduction=TRUE in the function call.
uncertainty: Total model uncertainty associated with the best relationship.
allmodelfit: List containing the results (teigen or mclust objects) from the post-selection analysis on each variable subset. Number corresponds to the exponent in the relationship. For instance, vscc_object$allmodelfit[[1]] gives the results from the analysis on the variables selected by the linear relationship.

References

See citation("vscc") for the variable selection references. See also citation("teigen") and citation("mclust") if using those families of models via the automate call.

Examples

Run this code

require("mclust")
data(banknote) #Load data
head(banknote[,-1]) #Show preview of full data set
bankrun <- vscc(banknote[,-1])
head(bankrun$topselected) #Show preview of selected variables
table(banknote[,1], bankrun$initialrun$classification) #Clustering results on full data set
table(banknote[,1], bankrun$bestmodel$classification) #Clustering results on reduced data set