Partitioning around medoids with estimation of number of clusters
This calls the function
clara to perform a
partitioning around medoids clustering with the number of clusters
estimated by optimum average silhouette width (see
pam.object) or Calinski-Harabasz
calinhara). The Duda-Hart test
dudahart2) is applied to decide whether there should be
more than one cluster (unless 1 is excluded as number of clusters or
data are dissimilarities).
pamk(data,krange=2:10,criterion="asw", usepam=TRUE, scaling=FALSE, alpha=0.001, diss=inherits(data, "dist"), critout=FALSE, ns=10, seed=NULL, ...)
- a data matrix or data frame or something that can be
coerced into a matrix, or dissimilarity matrix or
pamfor more information.
- integer vector. Numbers of clusters which are to be
compared by the average silhouette width criterion. Note: average
silhouette width and Calinski-Harabasz can't estimate number of
nc=1. If 1 is included, a Duda-Hart test is applied and 1 is estimated if this is not significant.
- one of
"ch". Determines whether average silhouette width (as given out by
clara, or as computed by
"multiasw"is specified; recommended for large data sets with
usepam=FALSE) or Calinski-Harabasz is applied. Note that the original Calinski-Harabasz index is not defined for dissimilarities; if dissimilarity data is run with
criterion="ch", the dissimilarity-based generalisation in Hennig and Liao (2013) is used.
- logical. If
pamis used, otherwise
clara(recommended for large datasets with 2,000 or more observations; dissimilarity matrices can not be used with
- either a logical value or a numeric vector of length
equal to the number of variables. If
scalingis a numeric vector with length equal to the number of variables, then each variable is divided by the corresponding value from
TRUEthen scaling is done by dividing the (centered) variables by their root-mean-square, and if
FALSE, no scaling is done.
- numeric between 0 and 1, tuning constant for
dudahart2(only used for 1-cluster test).
- logical flag: if
datawill be considered as a dissimilarity matrix (and the potential number of clusters 1 will be ignored). If
datawill be considered as a matrix of observations by variables.
- logical. If
TRUE, the criterion value is printed out for every number of clusters.
- passed on to
- passed on to
- further arguments to be transferred to
A list with components
- The output of the optimal run of the
- the optimal number of clusters.
- vector of criterion values for numbers of
critis the p-value of the Duda-Hart test if 1 is in
Calinski, R. B., and Harabasz, J. (1974) A Dendrite Method for Cluster Analysis, Communications in Statistics, 3, 1-27.
Duda, R. O. and Hart, P. E. (1973) Pattern Classification and Scene Analysis. Wiley, New York.
Hennig, C. and Liao, T. (2013) How to find an appropriate clustering for mixed-type variables with application to socio-economic stratification, Journal of the Royal Statistical Society, Series C Applied Statistics, 62, 309-369.
Kaufman, L. and Rousseeuw, P.J. (1990). "Finding Groups in Data: An Introduction to Cluster Analysis". Wiley, New York.
options(digits=3) set.seed(20000) face <- rFace(50,dMoNo=2,dNoEy=0,p=2) pk1 <- pamk(face,krange=1:5,criterion="asw",critout=TRUE) pk2 <- pamk(face,krange=1:5,criterion="multiasw",ns=2,critout=TRUE) # "multiasw" is better for larger data sets, use larger ns then. pk3 <- pamk(face,krange=1:5,criterion="ch",critout=TRUE)