clusterboot is an integrated function that computes the
clustering as well, using interface functions for various
clustering methods implemented in R (several interface functions are
provided, but you can
implement further ones for your favourite clustering method). See the
documentation of the input parameter clustermethod below.
Quite general clustering methods are possible, i.e. methods estimating or fixing the number of clusters, methods producing overlapping clusters or not assigning all cases to clusters (but declaring them as "noise"). Fuzzy clusterings cannot be processed and have to be transformed to crisp clusterings by the interface function.
clusterboot(data,B=100, distances=(class(data)=="dist"),
bootmethod=if(distances) "boot"
else c("boot","noise"),
bscompare=FALSE,
multipleboot=TRUE,
jittertuning=0.05, noisetuning=c(0.05,4),
subtuning=floor(nrow(data)/2),
clustermethod,noisemethod=FALSE,count=TRUE,
showplots=FALSE,dissolution=0.5,
recover=0.75,...)## S3 method for class 'clboot':
print(x,statistics=c("mean","dissolution","recovery"),...)
## S3 method for class 'clboot':
plot(x,xlim=c(0,1),breaks=seq(0,1,by=0.05),...)
n*p-data matrix (or data frame) or an
n*n-dissimilarity matrix (or dist-object).bootmethod.TRUE, the data is interpreted as
dissimilarity matrix. If data is a dist-object,
distances=TRUE automatically, otherwise
distances=FALSE by default. This means that y"boot": nonparametric bootstrap (precise behaviour is
controlled by parameters bscompare and
multipleboot).
TRUE, multiple points in the
bootstrap sample are taken into account to compute the Jaccard
similarity to the original clusters (which are represented by their
"bootstrap versions", i.e., the
points of the originalFALSE, all points drawn more
than once in the bootstrap draw are only used once in the bootstrap
samples."jitter"-method. The noise distribution for
jittering is a normal distribution with zero mean. The covariance
matrix has the same Eigenvectors as that of the original
data set, but the standard"noise"-method. The first component determines the
probability that a point is replaced by noise. Noise is generated by
a uniform distribution on a hyperrectangle along the princip"subset".TRUE, the last cluster is
regarded as "noise component", which means that for computing the Jaccard
similarity, it is not treated as a cluster. The noise component of
the original clustering is only compared with the nTRUE, the resampling runs are counted
on the screen.TRUE, a plot of the first two
dimensions of the resampled data set (or the classical MDS solution
for dissimilarity data) is shown for every resampling run. The last
plot shows the original data set.clusterboot. No effect in print.clboot and
plot.clboot.clboot.print.clboot,
which of the three clusterwise Jaccard
similarity statistics "mean", "dissolution" (number of
times the cluster has been dissolved) and "recovery" (number
of tihist.hist.clusterboot returns an object of class "clboot", which
is a list with components
result, partition, nc, clustermethod, B, bootmethod,
multipleboot, dissolution, recover, bootresult, bootmean, bootbrd,
bootrecover, jitterresult, jittermean, jitterbrd, jitterrecover,
subsetresult, subsetmean, subsetbrd, subsetrecover, bojitresult,
bojitmean, bojitbrd, bojitrecover, noiseresult, noisemean,
noisebrd, noiserecover.clustermethod for the original data set.clustermethod
(note that this is only meaningful for partitioning clustering methods).noisemethod=TRUE).bootmethod="boot". Rows correspond to clusters in the
original data set. Columns correspond to bootstrap runs.bootresult.bootresult,
bootmean, etc., but for the other resampling methods.rep(1,n) While B=100 is recommended, smaller run numbers could give
quite informative results as well, if computation times become too high.
Note that the stability of a cluster is assessed, but
stability is not the only important validity criterion - clusters
obtained by very inflexible clustering methods may be stable but not
valid, as discussed in Hennig (2007).
See plotcluster for graphical cluster validation.
Information about interface functions for clustering methods:
The following interface functions are currently
implemented (in the present package; note that almost all of these
functions require the specification of some control parameters, so
if you use one of them, look up their common help page
kmeansCBI) first:
kmeans for k-means clustering. This assumes a
cases*variables matrix as input.}
hclust for agglomerative hierarchical clustering with
optional noise component. This
function produces a partition and assumes a cases*variables
matrix as input.}
hclust for agglomerative hierarchical clustering. This
function produces a tree (not only a partition; therefore the
number of clusters can be huge!) and assumes a cases*variables
matrix as input.}
hclust for agglomerative hierarchical clustering with
optional noise component. This
function produces a partition and assumes a dissimilarity
matrix as input.}
EMclust and
EMclustN, for normal mixture model based
clustering. This assumes a cases*variables matrix as
input. Warning: EMclust and
EMclustN often have problems with multiple
points. It is recommended to use this only together with
multipleboot=FALSE.}
EMclust and
EMclustN, for normal mixture model based
clustering. This assumes a dissimilarity matrix as input and
generates a data matrix by multidimensional scaling first.
Warning: EMclust and
EMclustN often have problems with multiple
points. It is recommended to use this only together with
multipleboot=FALSE.}
pam and clara
for partitioning around medoids. This can be used with
cases*variables as well as dissimilarity matrices as input.}
pamk for partitioning around medoids. The number
of cluster is estimated by the average silhouette width.
This can be used with
cases*variables as well as dissimilarity matrices as input.}
trimkmeans for trimmed k-means
clustering. This assumes a cases*variables matrix as input.}
trimkmeans for trimmed k-means
clustering. This assumes a dissimilarity matrix as input and
generates a data matrix by multidimensional scaling first.}
dbscan for density based
clustering. This can be used with
cases*variables as well as dissimilarity matrices as input..}
fixmahal for fixed point
clustering. This assumes a cases*variables matrix as input.}dist,
interface functions:
kmeansCBI, hclustCBI,
hclusttreeCBI, disthclustCBI,
noisemclustCBI, distnoisemclustCBI,
claraCBI, pamkCBI,
trimkmeansCBI, disttrimkmeansCBI,
dbscanCBI, mahalCBIset.seed(20000)
face <- rFace(50,dMoNo=2,dNoEy=0,p=2)
cf1 <- clusterboot(face,B=5,bootmethod=
c("boot","noise","jitter"),clustermethod=kmeansCBI,
k=5) print(cf1)
plot(cf1) cf2 <- clusterboot(dist(face),B=5,bootmethod=
"subset",clustermethod=disthclustCBI,
k=5, cut="number", method="average", showplots=TRUE)
print(cf2)Run the code above in your browser using DataLab