clusGap()
calculates a goodness of clustering measure, the
“gap” statistic. For each number of clusters x
, after first centering, and then
svd
(aka ‘PCA’)-rotating them when (as by
default) spaceH0 = "scaledPCA"
.
maxSE(f, SE.f)
determines the location of the maximum
of f
, taking a “1-SE rule” into account for the
*SE*
methods. The default method "firstSEmax"
looks for
the smallest "Tibs2001SEmax"
, Tibshirani
et al's recommendation of determining the number of clusters from the
gap statistics and their standard deviations.
clusGap(x, FUNcluster, K.max, B = 100, d.power = 1,
spaceH0 = c("scaledPCA", "original"),
verbose = interactive(), ...)maxSE(f, SE.f,
method = c("firstSEmax", "Tibs2001SEmax", "globalSEmax",
"firstmax", "globalmax"),
SE.factor = 1)
# S3 method for clusGap
print(x, method = "firstSEmax", SE.factor = 1, ...)
# S3 method for clusGap
plot(x, type = "b", xlab = "k", ylab = expression(Gap[k]),
main = NULL, do.arrows = TRUE,
arrowArgs = list(col="red3", length=1/16, angle=90, code=3), ...)
clusGap(..)
returns an object of S3 class "clusGap"
,
basically a list with components
a matrix with K.max
rows and 4 columns, named
"logW", "E.logW", "gap", and "SE.sim",
where gap = E.logW - logW
, and SE.sim
corresponds to
the standard error of gap
, SE.sim[k]=
the clusGap(..)
call
.
the spaceH0
argument (match.arg()
ed).
number of observations, i.e., nrow(x)
.
input B
input function FUNcluster
numeric matrix or data.frame
.
a function
which accepts as first
argument a (data) matrix like x
, second argument, say
list
with a component named (or shortened to)
cluster
which is a vector of length n = nrow(x)
of
integers in 1:k
determining the clustering or grouping of the
n
observations.
the maximum number of clusters to consider, must be at least two.
integer, number of Monte Carlo (“bootstrap”) samples.
a positive integer specifying the power dist
) before
they are summed up to give d.power = 1
,
corresponds to the “historical” R implementation, whereas
d.power = 2
corresponds to what Tibshirani et al had
proposed. This was found by Juan Gonzalez, in 2016-02.
a character
string specifying the
space of the "scaledPCA"
and "original"
use a uniform distribution
in a hyper cube and had been mentioned in the reference;
"original"
been added after a proposal (including code) by
Juan Gonzalez.
integer or logical, determining if “progress” output should be printed. The default prints one bit per bootstrap sample.
(for clusGap()
:) optionally further arguments for
FUNcluster()
, see kmeans
example below.
numeric vector of ‘function values’, of length
numeric vector of length f
.
character string indicating how the “optimal”
number of clusters,
"globalmax"
:simply corresponds to the global maximum,
i.e., is which.max(f)
"firstmax"
:gives the location of the first local maximum.
"Tibs2001SEmax"
:uses the criterion, Tibshirani et
al (2001) proposed: “the smallest
"firstSEmax"
:location of the first SE.factor * SE.f[]
, i.e, within an “f S.E.” range
of that maximum (see also SE.factor
).
This, the default, has been proposed by Martin Maechler in 2012,
when adding clusGap()
to the cluster package, after
having seen the "globalSEmax"
proposal (in code) and read
the "Tibs2001SEmax"
proposal.
"globalSEmax"
:(used in Dudoit and Fridlyand (2002),
supposedly following Tibshirani's proposition):
location of the first SE.factor * SE.f[]
, i.e,
within an “f S.E.” range of that maximum (see also
SE.factor
).
See the examples for a comparison in a simple case.
[When method
contains "SE"
] Determining
the optimal number of clusters, Tibshirani et al. proposed the
“1 S.E.”-rule. Using an SE.factor
arguments with the same meaning as in
plot.default()
, with different default.
logical indicating if (1 SE -)“error bars”
should be drawn, via arrows()
.
a list of arguments passed to arrows()
;
the default, notably angle
and code
, provide a style
matching usual error bars.
This function is originally based on the functions gap
of
former (Bioconductor) package SAGx by Per Broberg,
gapStat()
from former package SLmisc by Matthias Kohl
and ideas from gap()
and its methods of package lga by
Justin Harrington.
The current implementation is by Martin Maechler.
The implementation of spaceH0 = "original"
is based on code
proposed by Juan Gonzalez.
The main result <res>$Tab[,"gap"]
of course is from
bootstrapping aka Monte Carlo simulation and hence random, or
equivalently, depending on the initial random seed (see
set.seed()
).
On the other hand, in our experience, using B = 500
gives
quite precise results such that the gap plot is basically unchanged
after an another run.
Tibshirani, R., Walther, G. and Hastie, T. (2001). Estimating the number of data clusters via the Gap statistic. Journal of the Royal Statistical Society B, 63, 411--423.
Tibshirani, R., Walther, G. and Hastie, T. (2000). Estimating the number of clusters in a dataset via the Gap statistic. Technical Report. Stanford.
Dudoit, S. and Fridlyand, J. (2002) A prediction-based resampling method for estimating the number of clusters in a dataset. Genome Biology 3(7). tools:::Rd_expr_doi("10.1186/gb-2002-3-7-research0036")
Per Broberg (2006). SAGx: Statistical Analysis of the GeneChip. R package version 1.9.7. https://bioconductor.org/packages/3.12/bioc/html/SAGx.html Deprecated and removed from Bioc ca. 2022
silhouette
for a much simpler less sophisticated
goodness of clustering measure.
cluster.stats()
in package fpc for
alternative measures.
### --- maxSE() methods -------------------------------------------
(mets <- eval(formals(maxSE)$method))
fk <- c(2,3,5,4,7,8,5,4)
sk <- c(1,1,2,1,1,3,1,1)/2
## use plot.clusGap():
plot(structure(class="clusGap", list(Tab = cbind(gap=fk, SE.sim=sk))))
## Note that 'firstmax' and 'globalmax' are always at 3 and 6 :
sapply(c(1/4, 1,2,4), function(SEf)
sapply(mets, function(M) maxSE(fk, sk, method = M, SE.factor = SEf)))
### --- clusGap() -------------------------------------------------
## ridiculously nicely separated clusters in 3 D :
x <- rbind(matrix(rnorm(150, sd = 0.1), ncol = 3),
matrix(rnorm(150, mean = 1, sd = 0.1), ncol = 3),
matrix(rnorm(150, mean = 2, sd = 0.1), ncol = 3),
matrix(rnorm(150, mean = 3, sd = 0.1), ncol = 3))
## Slightly faster way to use pam (see below)
pam1 <- function(x,k) list(cluster = pam(x,k, cluster.only=TRUE))
## We do not recommend using hier.clustering here, but if you want,
## there is factoextra::hcut () or a cheap version of it
hclusCut <- function(x, k, d.meth = "euclidean", ...)
list(cluster = cutree(hclust(dist(x, method=d.meth), ...), k=k))
## You can manually set it before running this : doExtras <- TRUE # or FALSE
if(!(exists("doExtras") && is.logical(doExtras)))
doExtras <- cluster:::doExtras()
if(doExtras) {
## Note we use B = 60 in the following examples to keep them "speedy".
## ---- rather keep the default B = 500 for your analysis!
## note we can pass 'nstart = 20' to kmeans() :
gskmn <- clusGap(x, FUNcluster = kmeans, nstart = 20, K.max = 8, B = 60)
gskmn #-> its print() method
plot(gskmn, main = "clusGap(., FUNcluster = kmeans, n.start=20, B= 60)")
set.seed(12); system.time(
gsPam0 <- clusGap(x, FUNcluster = pam, K.max = 8, B = 60)
)
set.seed(12); system.time(
gsPam1 <- clusGap(x, FUNcluster = pam1, K.max = 8, B = 60)
)
## and show that it gives the "same":
not.eq <- c("call", "FUNcluster"); n <- names(gsPam0)
eq <- n[!(n %in% not.eq)]
stopifnot(identical(gsPam1[eq], gsPam0[eq]))
print(gsPam1, method="globalSEmax")
print(gsPam1, method="globalmax")
print(gsHc <- clusGap(x, FUNcluster = hclusCut, K.max = 8, B = 60))
}# end {doExtras}
gs.pam.RU <- clusGap(ruspini, FUNcluster = pam1, K.max = 8, B = 60)
gs.pam.RU
plot(gs.pam.RU, main = "Gap statistic for the 'ruspini' data")
mtext("k = 4 is best .. and k = 5 pretty close")
## This takes a minute..
## No clustering ==> k = 1 ("one cluster") should be optimal:
Z <- matrix(rnorm(256*3), 256,3)
gsP.Z <- clusGap(Z, FUNcluster = pam1, K.max = 8, B = 200)
plot(gsP.Z, main = "clusGap() ==> k = 1 cluster is optimal")
gsP.Z
Run the code above in your browser using DataLab