plot.valstat: Simulation-standardised plot and print of cluster validation statistics

Description

Visualisation and print function for cluster validation output compared to results on simulated random clusterings. The print method can also be used to compute and print an aggregated cluster validation index.

Unlike for many other plot methods, the additional arguments of plot.valstat are essential. print.valstat should make good sense with the defaults, but for computing the aggregate index aggregate and weights need to be set.

Usage

# S3 method for valstat
plot(x,simobject=NULL,statistic="sindex",
                            xlim=NULL,ylim=c(0,1),
                            nmethods=length(x)-5,
                            col=1:nmethods,cex=1,pch=c("c","f","a","n"),
                            simcol=rep(grey(0.7),4),
                         shift=c(-0.1,-1/3,1/3,0.1),include.othernc=NULL,...)

# S3 method for valstat
print(x,statistics=x$statistics,
                          nmethods=length(x)-5,aggregate=FALSE,
                          weights=NULL,digits=2,
                          include.othernc=NULL,...)

Arguments

object of class "valstat", such as sublists stat, qstat, sstat of clusterbenchstats-output.

simobject

list of simulation results as produced by randomclustersim and documented there; typically sublist sim of clusterbenchstats-output.

statistic

one of "avewithin","mnnd","variation", "diameter","gap","sindex","minsep","asw","dindex","denscut", "highdgap","pg","withinss","entropy","pamc","kdnorm","kdunif","dmode"; validation statistic to be plotted.

xlim

passed on to plot. Default is the range of all involved numbers of clusters, minimum minus 0.5 to maximum plus 0.5.

ylim

passed on to plot.

nmethods

integer. Number of clustering methods to involve (these are those from number 1 to nmethods specified in x$name).

col

colours used for the different clustering methods.

cex

passed on to plot.

pch

vector of symbols for random clustering results from stupidkcentroids, stupidkfn, stupidkaven, stupidknn. To be passed on to plot.

simcol

vector of colours used for random clustering results in order stupidkcentroids, stupidkfn, stupidkaven, stupidknn.

shift

numeric vector. Indicates the amount to which the results from stupidkcentroids, stupidkfn, stupidkaven, stupidknn are plotted to the right of their respective number of clusters (negative numbers plot to the left).

include.othernc

this indicates whether methods should be included that estimated their number of clusters themselves and gave a result outside the standard range as given by x$minG and x$maxG. If not NULL, this is a list of integer vectors of length 2. The first number is the number of the clustering method (the order is determined by argument x$name), the second number is the number of clusters for those methods that estimate the number of clusters themselves and estimated a number outside the standard range. Normally what will be used here, if not NULL, is the output parameter cm$othernc of clusterbenchstats, see also cluster.magazine.

statistics

vector of character strings specifying the validation statistics that will be included in the output (unless you want to restrict the output for some reason, the default should be fine.

aggregate

logical. If TRUE, an aggegate validation statistic will be computed as the weighted mean of the involved statistic. This requires weights to be set. In order for this to make sense, values of the validation statistics should be comparable, which is achieved by standardisation in clusterbenchstats. Accordingly, x should be the qstat or sstat-component of the clusterbenchstats-output rather than the stat-component.

weights

vector of numericals. Weights for computation of the aggregate statistic in case that aggregate=TRUE. The order of clustering methods corresponding to the weight vector is given by x$name.

digits

minimal number of significant digits, passed on to print.table.

...

no effect.

Value

print.valstats returns the results table as invisible object.

Details

Whereas print.valstat, at least with aggregate=TRUE makes more sense for the qstat or sstat-component of the clusterbenchstats-output rather than the stat-component, plot.valstat should be run with the stat-component if simobject is specified, because the simulated cluster validity statistics are unstandardised and need to be compared with unstandardised values on the dataset of interest.

References

Hennig, C. (2017) Cluster validation by measurement of clustering characteristics relevant to the user. In C. H. Skiadas (ed.) Proceedings of ASMDA 2017, 501-520, https://arxiv.org/abs/1703.09282

Akhanli, S. and Hennig, C. (2020) Calibrating and aggregating cluster validity indexes for context-adapted comparison of clusterings. On arxiv from February 2020.

Examples

Run this code

# NOT RUN {
  set.seed(20000)
  options(digits=3)
  face <- rFace(10,dMoNo=2,dNoEy=0,p=2)
  clustermethod=c("kmeansCBI","hclustCBI","hclustCBI")
  clustermethodpars <- list()
  clustermethodpars[[2]] <- clustermethodpars[[3]] <- list()
  clustermethodpars[[2]]$method <- "ward.D2"
  clustermethodpars[[3]]$method <- "single"
  methodname <- c("kmeans","ward","single")
  cbs <-  clusterbenchstats(face,G=2:3,clustermethod=clustermethod,
    methodname=methodname,distmethod=rep(FALSE,3),
    clustermethodpars=clustermethodpars,nnruns=2,kmruns=2,fnruns=2,avenruns=2)
  plot(cbs$stat,cbs$sim)
  plot(cbs$stat,cbs$sim,statistic="dindex")
  plot(cbs$stat,cbs$sim,statistic="avewithin")
  print(cbs$sstat,aggregate=TRUE,weights=c(1,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0))
# }

Run the code above in your browser using DataLab