Learn R Programming

fpc (version 2.1-6)

kmeansCBI: Interface functions for clustering methods

Description

These functions provide an interface to several clustering methods implemented in R, for use together with the cluster stability assessment in clusterboot (as parameter clustermethod; "CBI" stands for "clusterboot interface"). In some situations it could make sense to use them to compute a clustering even if you don't want to run clusterboot, because some of the functions contain some additional features (e.g., normal mixture model based clustering of dissimilarity matrices projected into the Euclidean space by MDS or partitioning around medoids with estimated number of clusters, noise/outlier identification in hierarchical clustering).

Usage

kmeansCBI(data,krange,k,scaling=FALSE,runs=1,criterion="ch",...)

hclustCBI(data,k,cut="number",method,scaling=TRUE,noisecut=0,...)

hclusttreeCBI(data,minlevel=2,method,scaling=TRUE,...)

disthclustCBI(dmatrix,k,cut="number",method,noisecut=0,...) noisemclustCBI(data,G,k,emModelNames,nnk,hcmodel=NULL,Vinv=NULL, summary.out=FALSE,...)

distnoisemclustCBI(dmatrix,G,k,emModelNames,nnk, hcmodel=NULL,Vinv=NULL,mdsmethod="classical", mdsdim=4, summary.out=FALSE, points.out=FALSE,...)

claraCBI(data,k,usepam=TRUE,diss=inherits(data,"dist"),...)

pamkCBI(data,krange=2:10,k=NULL,criterion="asw", usepam=TRUE, scaling=TRUE,diss=inherits(data,"dist"),...)

trimkmeansCBI(data,k,scaling=TRUE,trim=0.1,...)

tclustCBI(data,k,trim=0.05,...)

disttrimkmeansCBI(dmatrix,k,scaling=TRUE,trim=0.1, mdsmethod="classical", mdsdim=4,...)

dbscanCBI(data,eps,MinPts,diss=inherits(data,"dist"),...)

mahalCBI(data,clustercut=0.5,...)

mergenormCBI(data, G=NULL, k=NULL, emModelNames=NULL, nnk=0, hcmodel = NULL, Vinv = NULL, mergemethod="bhat", cutoff=0.1,...)

speccCBI(data,k,...)

Arguments

data
a numeric matrix. The data matrix - usually a cases*variables-data matrix. claraCBI, pamkCBI and dbscanCBI work with an n*n-dissimilarity matrix as well, see parameter diss.
dmatrix
a squared numerical dissimilarity matrix or a dist-object.
k
numeric, usually integer. In most cases, this is the number of clusters for methods where this is fixed. For hclustCBI and disthclustCBI see parameter cut below. Some methods have a k paramet
scaling
either a logical value or a numeric vector of length equal to the number of variables. If scaling is a numeric vector with length equal to the number of variables, then each variable is divided by the corresponding value from
runs
integer. Number of random initializations from which the k-means algorithm is started.
criterion
"ch" or "asw". Decides whether number of clusters is estimated by the Calinski-Harabasz criterion or by the average silhouette width.
cut
either "level" or "number". This determines how cutree is used to obtain a partition from a hierarchy tree. cut="level" means that the tree is cut at a particular dissimilarity level, cut="number" means t
method
method for hierarchical clustering, see the documentation of hclust.
noisecut
numeric. All clusters of size <=noisecut< code=""> in the disthclustCBI/hclustCBI-partition are joined and declared as noise/outliers.
minlevel
integer. minlevel=1 means that all clusters in the tree are given out by hclusttreeCBI or disthclusttreeCBI, including one-point clusters (but excluding the cluster with all points). minlevel=2<
G
vector of integers. Number of clusters or numbers of clusters used by mclustBIC. If G has more than one entry, the number of clusters is estimated by the BIC.
emModelNames
vector of string. Models for covariance matrices, see documentation of mclustBIC.
nnk
integer. Tuning constant for NNclean, which is used to estimate the initial noise for noisemclustCBI and distnoisemclustCBI. See parameter k in the
hcmodel
string or NULL. Determines the initialization of the EM-algorithm for mclustBIC. Documented in hc.
Vinv
numeric. See documentation of mclustBIC.
summary.out
logical. If TRUE, the result of summary.mclustBIC is added as component mclustsummary to the output of noisemclustCBI and distnoisemc
mdsmethod
"classical", "kruskal" or "sammon". Determines the multidimensional scaling method to compute Euclidean data from a dissimilarity matrix. See cmdscale, isoM
mdsdim
integer. Dimensionality of MDS solution.
points.out
logical. If TRUE, the matrix of MDS points is added as component points to the output of noisemclustCBI.
usepam
logical. If TRUE, the function pam is used for clustering, otherwise clara. pam is
diss
logical. If TRUE, data will be considered as a dissimilarity matrix. In claraCBI, this requires usepam=TRUE.
krange
vector of integers. Numbers of clusters to be compared.
trim
numeric between 0 and 1. Proportion of data points trimmed, i.e., assigned to noise. See tclust, trimkmeans.
eps
numeric. The radius of the neighborhoods to be considered by dbscan.
MinPts
integer. How many points have to be in a neighborhood so that a point is considered to be a cluster seed? See documentation of dbscan.
clustercut
numeric between 0 and 1. If fixmahal is used for fuzzy clustering, a crisp partition is generated and points with cluster membership values above clustercut are considered as membe
mergemethod
method for merging Gaussians, passed on as method to mergenormals.
cutoff
numeric between 0 and 1, tuning constant for mergenormals.
...
further parameters to be transferred to the original clustering functions (not required).

Value

  • All interface functions return a list with the following components (there may be some more, see summary.out and points.out above):
  • resultclustering result, usually a list with the full output of the clustering method (the precise format doesn't matter); whatever you want to use later.
  • ncnumber of clusters. If some points don't belong to any cluster but are declared as "noise", nc includes the noise component, and there should be another component nccl, being the number of clusters not including the noise component.
  • clusterlistthis is a list consisting of a logical vectors of length of the number of data points (n) for each cluster, indicating whether a point is a member of this cluster (TRUE) or not. If a noise component is included, it should always be the last vector in this list.
  • partitionan integer vector of length n, partitioning the data. If the method produces a partition, it should be the clustering. This component is only used for plots, so you could do something like rep(1,n) for non-partitioning methods.
  • clustermethoda string indicating the clustering method.
  • The output of some of the functions has further components:
  • ncclsee nc above.
  • nnkby noisemclustCBI and distnoisemclustCBI, see above.
  • initnoiselogical vector, indicating initially estimated noise by NNclean, called by noisemclustCBI and distnoisemclustCBI.
  • noiselogical. TRUE if points were classified as noise/outliers by disthclustCBI.

Details

All these functions call clustering methods implemented in R to cluster data and to provide output in the format required by clusterboot. Here is a brief overview. For further details see the help pages of the involved clustering methods. [object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

See Also

clusterboot, dist, kmeans, kmeansruns, hclust, mclustBIC, pam, pamk, clara, trimkmeans, dbscan, fixmahal

Examples

Run this code
set.seed(20000)
  face <- rFace(50,dMoNo=2,dNoEy=0,p=2)
  dbs <- dbscanCBI(face,eps=1.5,MinPts=4)
  dhc <- disthclustCBI(dist(face),method="average",k=1.5,noisecut=2)
  table(dbs$partition,dhc$partition)
  mergenormCBI(face,G=10,emModelNames="EEE",nnk=2)

Run the code above in your browser using DataLab