clusterboot: Clusterwise cluster stability assessment by resampling

Description

Assessment of the clusterwise stability of a clustering of data (which can be cases*variables or dissimilarity data. The data is resampled using several schemes (bootstrap, subsetting, jittering, replacement of points by noise) and the Jaccard similarities of the original clusters to the most similar clusters in the resampled data are computed. The mean over these similarities is used as an index of the stability of a cluster (other statistics can be computed as well). The methods are described in Hennig (2007).

clusterboot is an integrated function that computes the clustering as well, using interface functions for various clustering methods implemented in R (several interface functions are provided, but you can implement further ones for your favourite clustering method). See the documentation of the input parameter clustermethod below.

Quite general clustering methods are possible, i.e. methods estimating or fixing the number of clusters, methods producing overlapping clusters or not assigning all cases to clusters (but declaring them as "noise"). Fuzzy clusterings cannot be processed and have to be transformed to crisp clusterings by the interface function.

Usage

clusterboot(data,B=100, distances=(class(data)=="dist"),
                        bootmethod=if(distances) "boot"
                        else c("boot","noise"),
                        bscompare=FALSE, 
                        multipleboot=TRUE,
                        jittertuning=0.05, noisetuning=c(0.05,4),
                        subtuning=floor(nrow(data)/2),
                        clustermethod,noisemethod=FALSE,count=TRUE,
                        showplots=FALSE,dissolution=0.5,
                        recover=0.75,...)
## S3 method for class 'clboot':
print(x,statistics=c("mean","dissolution","recovery"),...)
## S3 method for class 'clboot':
plot(x,xlim=c(0,1),breaks=seq(0,1,by=0.05),...)

Arguments

data

something that can be coerced into a matrix. The data matrix - either an n*p-data matrix (or data frame) or an n*n-dissimilarity matrix (or dist-object).

integer. Number of resampling runs for each scheme, see bootmethod.

distances

logical. If TRUE, the data is interpreted as dissimilarity matrix. If data is a dist-object, distances=TRUE automatically, otherwise distances=FALSE by default. This means that y

bootmethod

vector of strings, defining the methods used for resampling. Possible methods: "boot": nonparametric bootstrap (precise behaviour is controlled by parameters bscompare and multipleboot).

bscompare

logical. If TRUE, multiple points in the bootstrap sample are taken into account to compute the Jaccard similarity to the original clusters (which are represented by their "bootstrap versions", i.e., the points of the original

multipleboot

logical. If FALSE, all points drawn more than once in the bootstrap draw are only used once in the bootstrap samples.

jittertuning

positive numeric. Tuning for the "jitter"-method. The noise distribution for jittering is a normal distribution with zero mean. The covariance matrix has the same Eigenvectors as that of the original data set, but the standard

noisetuning

A vector of two positive numerics. Tuning for the "noise"-method. The first component determines the probability that a point is replaced by noise. Noise is generated by a uniform distribution on a hyperrectangle along the princip

subtuning

integer. Size of subsets for "subset".

clustermethod

an interface function (the function name, not a string containing the name, has to be provided!). This defines the clustering method. See the "Details"-section for a list of available interface functions and guidelines how to write your own on

noisemethod

logical. If TRUE, the last cluster is regarded as "noise component", which means that for computing the Jaccard similarity, it is not treated as a cluster. The noise component of the original clustering is only compared with the n

count

logical. If TRUE, the resampling runs are counted on the screen.

showplots

logical. If TRUE, a plot of the first two dimensions of the resampled data set (or the classical MDS solution for dissimilarity data) is shown for every resampling run. The last plot shows the original data set.

dissolution

numeric between 0 and 1. If the Jaccard similarity between the resampling version of the original cluster and the most similar cluster on the resampled data is smaller or equal to this value, the cluster is considered as "dissolved". Numbers o

recover

numeric between 0 and 1. If the Jaccard similarity between the resampling version of the original cluster and the most similar cluster on the resampled data is larger than this value, the cluster is considered as "successfully recovered". Num

...

additional parameters for the clustermethods called by clusterboot. No effect in print.clboot and plot.clboot.

object of class clboot.

statistics

specifies in print.clboot, which of the three clusterwise Jaccard similarity statistics "mean", "dissolution" (number of times the cluster has been dissolved) and "recovery" (number of ti

xlim

transferred to hist.

breaks

transferred to hist.

Value

clusterboot returns an object of class "clboot", which is a list with components result, partition, nc, clustermethod, B, bootmethod, multipleboot, dissolution, recover, bootresult, bootmean, bootbrd, bootrecover, jitterresult, jittermean, jitterbrd, jitterrecover, subsetresult, subsetmean, subsetbrd, subsetrecover, bojitresult, bojitmean, bojitbrd, bojitrecover, noiseresult, noisemean, noisebrd, noiserecover.
resultclustering result; full output of the selected clustermethod for the original data set.
partitionpartition parameter of the selected clustermethod (note that this is only meaningful for partitioning clustering methods).
ncnumber of clusters in original data (including noise component if noisemethod=TRUE).
clustermethod, B, bootmethod, multipleboot, dissolution, recoverinput parameters, see above.
bootresultmatrix of Jaccard similarities for bootmethod="boot". Rows correspond to clusters in the original data set. Columns correspond to bootstrap runs.
bootmeanclusterwise means of the bootresult.
bootbrdclusterwise number of times a cluster has been dissolved.
bootrecoverclusterwise number of times a cluster has been successfully recovered.
subsetresult, subsetmean, etc.same as bootresult, bootmean, etc., but for the other resampling methods.

itemize

result

item

nc
clusterlist
partition
clustermethod

code

rep(1,n)

Details

Here are some guidelines for interpretation. There is some theoretical justification to consider a Jaccard similarity value smaller or equal to 0.5 as an indication of a "dissolved cluster", see Hennig (2004). Generally, a valid, stable cluster should yield a mean Jaccard similarity value of 0.75 or more. Between 0.6 and 0.75, clusters may be considered as indicating patterns in the data, but which points exactly should belong to these clusters is highly doubtful. Below average Jaccard values of 0.6, clusters should not be trusted. "Highly stable" clusters should yield average Jaccard similarities of 0.85 and above. All of this refers to bootstrap; for the other resampling schemes it depends on the tuning constants, though their default values should grant similar interpretations in most cases.

While B=100 is recommended, smaller run numbers could give quite informative results as well, if computation times become too high.

Note that the stability of a cluster is assessed, but stability is not the only important validity criterion - clusters obtained by very inflexible clustering methods may be stable but not valid, as discussed in Hennig (2007). See plotcluster for graphical cluster validation.

Information about interface functions for clustering methods: The following interface functions are currently implemented (in the present package; note that almost all of these functions require the specification of some control parameters, so if you use one of them, look up their common help page kmeansCBI) first:

kmeansCBI

{an interface to the function kmeans for k-means clustering. This assumes a cases*variables matrix as input.} hclustCBI{an interface to the function hclust for agglomerative hierarchical clustering with optional noise component. This function produces a partition and assumes a cases*variables matrix as input.} hclusttreeCBI{an interface to the function hclust for agglomerative hierarchical clustering. This function produces a tree (not only a partition; therefore the number of clusters can be huge!) and assumes a cases*variables matrix as input.} disthclustCBI{an interface to the function hclust for agglomerative hierarchical clustering with optional noise component. This function produces a partition and assumes a dissimilarity matrix as input.} noisemclustCBI{an interface to the functions EMclust and EMclustN, for normal mixture model based clustering. This assumes a cases*variables matrix as input. Warning: EMclust and EMclustN often have problems with multiple points. It is recommended to use this only together with multipleboot=FALSE.} distnoisemclustCBI{an interface to the functions EMclust and EMclustN, for normal mixture model based clustering. This assumes a dissimilarity matrix as input and generates a data matrix by multidimensional scaling first. Warning: EMclust and EMclustN often have problems with multiple points. It is recommended to use this only together with multipleboot=FALSE.} claraCBI{an interface to the functions pam and clara for partitioning around medoids. This can be used with cases*variables as well as dissimilarity matrices as input.} pamkCBI{an interface to the function pamk for partitioning around medoids. The number of cluster is estimated by the average silhouette width. This can be used with cases*variables as well as dissimilarity matrices as input.} trimkmeansCBI{an interface to the function trimkmeans for trimmed k-means clustering. This assumes a cases*variables matrix as input.} disttrimkmeansCBI{an interface to the function trimkmeans for trimmed k-means clustering. This assumes a dissimilarity matrix as input and generates a data matrix by multidimensional scaling first.} dbscanCBI{an interface to the function dbscan for density based clustering. This can be used with cases*variables as well as dissimilarity matrices as input..} mahalCBI{an interface to the function fixmahal for fixed point clustering. This assumes a cases*variables matrix as input.}

References

Hennig, C. (2004) A general robustness and stability theory for cluster analysis, Preprint 2004-07, Fachbereich Mathematik - SPST, Hamburg. http://www.homepages.ucl.ac.uk/~ucakche/papers/classbrd.ps Hennig, C. (2007) Cluster-wise assessment of cluster stability. Computational Statistics and Data Analysis, tentatively accepted.

Examples

Run this code

set.seed(20000)
  face <- rFace(50,dMoNo=2,dNoEy=0,p=2)
  cf1 <- clusterboot(face,B=5,bootmethod=
          c("boot","noise","jitter"),clustermethod=kmeansCBI,
          k=5)  print(cf1)
  plot(cf1)  cf2 <- clusterboot(dist(face),B=5,bootmethod=
          "subset",clustermethod=disthclustCBI,
          k=5, cut="number", method="average", showplots=TRUE)
  print(cf2)

Run the code above in your browser using DataLab