Reduce the dimensionality of a dataset by calculating how important each feature is for inferring the clustering.
find_defining_features(mixdir_obj, X, n_features = Inf, measure = c("JS",
"ARI"), subsample_size = Inf, step_size = Inf, exponential_decay = TRUE,
verbose = FALSE)
the result from a call to mixdir()
. It needs to have the
fields category_prob. category_prob a list of a list of a named vector with probabilities
for each feature, latent class and possible category.
the original dataset that was used for clustering.
the number of dimensions that should be selected. If it is
Inf
(the default) all features are returned ordered by importance
(most important first).
The measure used to assess the loss of clustering quality
if a variable is removed. Two measures are implemented: "JS" short for
Jensen-Shannon divergence comparing the original class probabilities
and the new predicted class probabilities (smaller is better),
"ARI" short for adjusted Rand index compares the overlap of the original
and the predicted classes (requires the mcclust
package) (1 is perfect,
0 is as good as random).
Running this method on the full dataset can be slow, but one can easily speed up the calculation by randomly selecting a subset of rows from X without usually disproportionately hurting the selection performance.
The method can either remove each feature individually
and return the n features that caused the greatest quality loss
(step=Inf
) or iteratively remove the least important one until
the the size of the remaining features equal n_features
(step=1
). Using a smaller step size increases the sensitivity
of the selection process, but takes longer to calculate.
Boolean or number. Alternative way of
calculating how many features to remove each step. The default is
to always remove the least important 50% of the features
(exponential_decay=2
).
Boolean indicating if status messages should be printed.
Iteratively find the variable, whose removal least affects the
clustering compared with the original. If n_features
is a finite number
the quality is a single number and reflects how good those n features maintain
the original clustering. If n_features=Inf
, the method returns all features
ordered by decreasing importance. The accompanying quality vector contains the
"cumulative" loss if the corresponding variable would be removed.
Note that depending on the step size scheme the quality can differ. For example
if all variables are removed in one step (step_size=Inf
and
exponential_decay=FALSE
) the quality is not cumulative, but simply the
quality of the clustering excluding the corresponding feature. In that
sense the quality vector should not be used as a definitive answer, but
should only be used as a guidance to see where there are jumps in the quality.
# NOT RUN {
data("mushroom")
res <- mixdir(mushroom[1:100, ], n_latent=20)
find_defining_features(res, mushroom[1:100, ], n_features=3)
find_defining_features(res, mushroom[1:100, ], n_features=Inf)
# }
Run the code above in your browser using DataLab