find_defining_features: Find the n defining features

Description

Reduce the dimensionality of a dataset by calculating how important each feature is for inferring the clustering.

Usage

find_defining_features(mixdir_obj, X, n_features = Inf, measure = c("JS",
  "ARI"), subsample_size = Inf, step_size = Inf, exponential_decay = TRUE,
  verbose = FALSE)

Arguments

mixdir_obj

the result from a call to mixdir(). It needs to have the fields category_prob. category_prob a list of a list of a named vector with probabilities for each feature, latent class and possible category.

the original dataset that was used for clustering.

n_features

the number of dimensions that should be selected. If it is Inf (the default) all features are returned ordered by importance (most important first).

measure

The measure used to assess the loss of clustering quality if a variable is removed. Two measures are implemented: "JS" short for Jensen-Shannon divergence comparing the original class probabilities and the new predicted class probabilities (smaller is better), "ARI" short for adjusted Rand index compares the overlap of the original and the predicted classes (requires the mcclust package) (1 is perfect, 0 is as good as random).

subsample_size

Running this method on the full dataset can be slow, but one can easily speed up the calculation by randomly selecting a subset of rows from X without usually disproportionately hurting the selection performance.

step_size

The method can either remove each feature individually and return the n features that caused the greatest quality loss (step=Inf) or iteratively remove the least important one until the the size of the remaining features equal n_features (step=1). Using a smaller step size increases the sensitivity of the selection process, but takes longer to calculate.

exponential_decay

Boolean or number. Alternative way of calculating how many features to remove each step. The default is to always remove the least important 50% of the features (exponential_decay=2).

verbose

Boolean indicating if status messages should be printed.

Details

Iteratively find the variable, whose removal least affects the clustering compared with the original. If n_features is a finite number the quality is a single number and reflects how good those n features maintain the original clustering. If n_features=Inf, the method returns all features ordered by decreasing importance. The accompanying quality vector contains the "cumulative" loss if the corresponding variable would be removed. Note that depending on the step size scheme the quality can differ. For example if all variables are removed in one step (step_size=Inf and exponential_decay=FALSE) the quality is not cumulative, but simply the quality of the clustering excluding the corresponding feature. In that sense the quality vector should not be used as a definitive answer, but should only be used as a guidance to see where there are jumps in the quality.

Examples

Run this code

# NOT RUN {
  data("mushroom")
  res <- mixdir(mushroom[1:100, ], n_latent=20)
  find_defining_features(res, mushroom[1:100, ], n_features=3)
  find_defining_features(res, mushroom[1:100, ], n_features=Inf)
# }

Run the code above in your browser using DataLab

Description

Usage

Arguments

Details

See Also

Examples