
clustomit(x, num_clusters, cluster_method, similarity = c("jaccard", "rand"), weighted_mean = TRUE, num_reps = 50, num_cores = getOption("mc.cores", 2), ...)
n
observations (rows)
and p
features (columns)cluster_method
match.fun
function. The function given
should return only clustering labels for each observation
in the matrix x
.TRUE
).cluster_method
clustomit
, which contains a named
list with elements x
x
into num_clusters
clusters
with the clustering algorithm specified in
cluster_method
. We then omit each cluster in turn
and all of the observations in that cluster. For the
omitted cluster, we resample from the remaining
observations and cluster the resampled observations into
num_clusters - 1
clusters again using the
clustering algorithm specified in cluster_method
.
Next, we compute the similarity between the cluster
labels of the original data set and the cluster labels of
the bootstrapped sample. We approximate the sampling
distribution of the ClustOmit statistic using a
stratified, nonparametric bootstrapping scheme and use
the apparent variability in the approximated sampling
distribution as a diagnostic tool for further evaluation
of the proposed clusters. By default, we utilize the
Jaccard similarity coefficient in the calculation of the
ClustOmit statistic to provide a clear interpretation of
cluster assessment. The technical details of the
ClustOmit statistic can be found in our forthcoming
publication entitled "Cluster Stability Evaluation of
Gene Expression Data."The ClustOmit cluster stability statistic is based on the cluster omission admissibility condition from Fisher and Van Ness (1971), who provide decision-theoretic admissibility conditions that a reasonable clustering algorithm should satisfy. The guidelines from Fisher and Van Ness (1971) establish a systematic foundation that is often lacking in the evaluation of clustering algorithms. The ClustOmit statistic is our proposed methodology to evaluate the cluster omission admissibility condition from Fisher and Van Ness (1971).
We require a clustering algorithm function to be
specified in the argument cluster_method
. The
function given should accept at least two arguments:
Also, the function given should return only clustering
labels for each observation in the matrix x
. The
additional arguments specified in ...
are useful
if a wrapper function is used: see the example below for
an illustration.
Hennic, C. (2007), Cluster-wise assessment of cluster stability, _Computational Statistics and Data Analysis_, 52, 258-271. http://www.jstor.org/stable/2334320
# First, we create a wrapper function for the K-means clustering algorithm
# that returns only the clustering labels for each observation (row) in
# \code{x}.
kmeans_wrapper <- function(x, num_clusters, num_starts = 10, ...) {
kmeans(x = x, centers = num_clusters, nstart = num_starts, ...)$cluster
}
# For this example, we generate five multivariate normal populations with the
# \code{sim_data} function.
x <- sim_data("normal", delta = 1.5, seed = 42)$x
clustomit_out <- clustomit(x = x, num_clusters = 4,
cluster_method = "kmeans_wrapper", num_cores = 1)
clustomit_out2 <- clustomit(x = x, num_clusters = 5,
cluster_method = kmeans_wrapper, num_cores = 1)
Run the code above in your browser using DataLab