valstat.object: Cluster validation statistics - object

Description

The objects of class "valstat" store cluster validation statistics from various clustering methods run with various numbers of clusters.

Arguments

Value

A legitimate valstat object is a list. The format of the list relies on the number of involved clustering methods, nmethods, say, i.e., the length of the method-component explained below. The first nmethods elements of the valstat-list are just numbered. These are themselves lists that are numbered between 1 and the maxG-component defined below. Element [[i]][[j]] refers to the clustering from clustering method number i with number of clusters j. Every such element is a list with components avewithin, mnnd, cvnnd, maxdiameter, widestgap, sindex, minsep, asw, dindex, denscut, highdgap, pearsongamma, withinss, entropy: Further optional components are pamc, kdnorm, kdunif, dmode, aggregated. All these are cluster validation indexes, as follows.

avewithin

average distance within clusters (reweighted so that every observation, rather than every distance, has the same weight).

mnnd

average distance to nnkth nearest neighbour within cluster. (nnk is a parameter of cqcluster.stats, default 2.)

cvnnd

coefficient of variation of dissimilarities to nnkth nearest wthin-cluster neighbour, measuring uniformity of within-cluster densities, weighted over all clusters, see Sec. 3.7 of Hennig (2017). (nnk is a parameter of cqcluster.stats, default 2.)

maxdiameter

maximum cluster diameter.

widestgap

widest within-cluster gap or average of cluster-wise widest within-cluster gap, depending on parameter averagegap of cqcluster.stats, default FALSE.

sindex

separation index. Defined based on the distances for every point to the closest point not in the same cluster. The separation index is then the mean of the smallest proportion sepprob (parameter of cqcluster.stats, default 0.1) of these. See Hennig (2017).

minsep

minimum cluster separation.

asw

average silhouette width. See silhouette.

dindex

this index measures to what extent the density decreases from the cluster mode to the outskirts; I-densdec in Sec. 3.6 of Hennig (2017); low values are good.

denscut

this index measures whether cluster boundaries run through density valleys; I-densbound in Sec. 3.6 of Hennig (2017); low values are good.

highdgap

this measures whether there is a large within-cluster gap with high density on both sides; I-highdgap in Sec. 3.6 of Hennig (2017); low values are good.

pearsongamma

correlation between distances and a 0-1-vector where 0 means same cluster, 1 means different clusters. "Normalized gamma" in Halkidi et al. (2001).

withinss

a generalisation of the within clusters sum of squares (k-means objective function), which is obtained if d is a Euclidean distance matrix. For general distance measures, this is half the sum of the within cluster squared dissimilarities divided by the cluster size.

entropy

entropy of the distribution of cluster memberships, see Meila(2007).

pamc

average distance to cluster centroid, which is the observation that minimises this average distance.

kdnorm

Kolmogorov distance between distribution of within-cluster Mahalanobis distances and appropriate chi-squared distribution, aggregated over clusters (I am grateful to Agustin Mayo-Iscar for the idea).

kdunif

Kolmogorov distance between distribution of distances to dnnkth nearest within-cluster neighbor and appropriate Gamma-distribution, see Byers and Raftery (1998), aggregated over clusters. dnnk is parameter nnk of distrsimilarity, corresponding to dnnk of clusterbenchstats.

dmode

aggregated density mode index equal to 0.75*dindex+0.25*highdgap before standardisation.

Furthermore, a valstat object has the following list components:

maxG

maximum number of clusters.

minG

minimum number of clusters (list entries below that number are empty lists).

method

vector of names (character strings) of clustering CBI-functions, see kmeansCBI.

name

vector of names (character strings) of clustering methods. These can be user-chosen names (see argument methodsnames in clusterbenchstats) and may distinguish different methods run by the same CBI-function but with different parameter values such as complete and average linkage for hclustCBI.

statistics

vector of names (character strings) of cluster validation indexes.

GENERATION

These objects are generated as part of the clusterbenchstats-output.

METHODS

The valstat class has methods for the following generic functions: print, plot, see plot.valstat.

References

Hennig, C. (2017) Cluster validation by measurement of clustering characteristics relevant to the user. In C. H. Skiadas (ed.) Proceedings of ASMDA 2017, 501-520, https://arxiv.org/abs/1703.09282