otrimlesimg: Adequacy approach for number of clusters for OTRIMLE

Description

otrimlesimg computes Optimally Tuned Robust Improper Maximum Likelihood Clustering (OTRIMLE), see otrimle for a range of values of the number of clusters, and also for artificial datasets simulated from the model parameters estimated on the original data. The summary-methods present and evaluate the results so that a smallest adequate number of clusters can be found as the smallest one for which the value of the density-based cluster quality statistics Q on the original data is compatible with its distribution on the artificial datasets with the same number of clusters, see Hennig and Coretto 2021 for details.

Usage

otrimlesimg(dataset, G=1:6, multicore=TRUE,
ncores=detectCores(logical=FALSE)-1, erc=20, beta0=0, simruns=20,
sim.est.logicd=FALSE, 
monitor=1)
# S3 method for otrimlesimgdens
summary(object, noisepenalty=0.05 , sdcutoff=2
, ...)
# S3 method for summary.otrimlesimgdens
print(x, ...)
# S3 method for summary.otrimlesimgdens
plot(x , plot="criterion", penx=NULL,
peny=NULL, pencex=1, cutoff=TRUE, ylim=NULL, ...)

Arguments

dataset

something that can be coerced into an observations times variables matrix. The dataset.

vector of integers (normally starting from 1). Numbers of clusters to be considered.

multicore

logical. If TRUE, parallel computing is used through the function mclapply from package parallel; read warnings there if you intend to use this; it won't work on Windows.

ncores

integer. Number of cores for parallelisation.

erc

A number larger or equal than one specifying the maximum allowed ratio between within-cluster covariance matrix eigenvalues. See otrimle.

beta0

A non-negative constant, penalty term for noise, to be passed as beta to otrimle, see documentation there.

simruns

integer. Number of replicate artificial datasets drawn from each model.

sim.est.logicd

logical. If TRUE, the logarithm of the improper constant density logicd, see otrimle, is re-estimated when running otrimle on the artificial datasets. Otherwise the value estimated on the original data is taken as fixed. TRUE requires much longer computation time, but can be seen as generating more realistic variation.

monitor

0 or 1. If 1, progress messages are printed on screen.

noisepenalty

number between 0 and 1. p_0 in Hennig and Coretto (2021); normally small. The method prefers to treat a proportion of <=noisepenalty of points as outliers to adding a cluster.

sdcutoff

numerical. c in formula (7) in Hennig and Coretto (2021). A clustering is treated as adequate if its value of the density-based cluster quality measure Q calibrated (i.e., mean/sd-standardised) by the values on the artificial datasets generated from the estimated model is <=sdcutoff.

plot

"criterion" or "noise", see details.

penx

FALSE, NULL, or numerical. x-coordinate from where the simplicity ordering of clustering is given (as test in the plot). If FALSE, this is not added to the plot. If NULL a default guess is made for a good position (which doesn't always work well).

peny

NULL, or numerical. x-coordinate from where the simplicity ordering of clustering is given (as test in the plot). If NULL, a default guess is made for a good position (which doesn't always work well).

pencex

numeric. Magnification factor (parameter cex to be passed on to legend) for simplicity ordering, see parameter penx.

cutoff

logical. If TRUE, the "criterion"-plot shows the cutoff value below which numbers of clusters are adequate, see details.

ylim

vector of two numericals, range of the y-axis to be passed on to plot. If NULL, the range is chosen automatically (but can be different from the plot default).

object

an object of class 'otrimlesimgdens' obtained from calling otrimlesimg

an object of class 'summary.otrimlesimgdens' obtained from calling summary function over an object of class 'otrimlesimgdens' obtained from calling otrimlesimg.

...

optional parameters to be passed on to plot.

Value

otrimlesimg returns a list of type "otrimlesimgdens" containing the components result, simresult, simruns.

result

output object of otrimleg (list of results on original data) run with the parameters provided to otrimlesimg.

simresult

list of length simruns of output objects of otrimleg for all the simulated artificial datasets.

simruns

input parameter simruns.

summary.otrimlesimgdens returns a list of type "summary.otrimlesimgdens" with components G, simeval, ssimruns, npr, nprdiff, logicd, denscrit, peng, penorder, bestG, sdcutoff, bestresult, cluster. simruns

otrmlesimg input parameter G (numbers of clusters).

simeval

list with components denscrit, meandens, sddens, standens, errors, defined below.

ssimruns

otrmlesimg input parameter simruns.

npr

vector of estimated noise proportions on the original data for all numbers of clusters, exproportion[1] from otrimle.

nprdiff

vector for all numbers of clusters of differences between estimated smallest cluster proportion and noise proportion on the original data.

logicd

vector of logs of improper constant density values on the original data for all numbers of clusters.

denscrit

vector over all numbers of clusters of density-based cluster quality statistics Q on original data as provided by the measure-component of kerndensmeasure.

peng

vector of simplicity values (see Details) over all numbers of clusters.

penorder

simplicity order of number of clusters.

bestG

best (i.e., most simple adequate) number of clusters.

sdcutoff

input parameter sdcutoff.

result

output of otrimle for the best number of clusters bestG.

cluster

clustering vector for the best number of clusters bestG. 0 corresponds to noise/outliers.

Components of summary.otrimlesimgdens output component simeval:

denscritmatrix

maximum number of clusters times simruns matrix of denscrit-vectors for all clusterings on simulated data.

meandens

vector over numbers of clusters of robust estimator of the mean of denscrit over simulated datasets, computed by scaleTau2.

sddens

vector over numbers of clusters of robust estimator of the standard deviation of denscrit over simulated datasets, computed by scaleTau2.

standens

vector over numbers of clusters of denscrit of the original data standardised by meandens and sddens.

errors

vector over numbers of clusters of numbers of times that otrimle led to an error. plot.summary.otrimlesimgdens will return the output of par() before anything was changed by the plot function.

Details

The method is fully described in Hennig and Coretto (2021). The required tuning constants for choosing an optimal number of clusters, the smallest percentage of additional noise that the user is willing to trade in for adding another cluster (p_0 in the paper, noisepenalty here) and the critical value (c in the paper, sdcutoff here) for adequacy of the standardised density based quality measure Q are provided to the summary function, which is required to choose the best (simplest adequate) number of clusters.

The plot function plot.summary.otrimlesimgdens can produce two plots. If plot="criterion", the standardised density-based cluster quality measure Q is plotted against the number of clusters. The values for the simulated artificial datasets are points, the values for the original dataset are given as line type. If cutoff="TRUE", the critical values (see above) are added as red crosses; a number of clusters is adequate if the value of the original data is below the critical value, i.e., Q is not significantly larger than for the artificial datasets generated from the fitted model. Using penx, the ordered numbers of clusters from the simplest to the least simple can also be indicated in the plot, where simplicitly is defined as the number of clusters plus the estimated noise proportion divided by noisepenalty, see above. The chosen number of clusters is the simplest adequate one, meaning that a low number of clusters and a low noise proportion are preferred.

If plot="noise", the noise proportion (black) and the simplicity (red) are plotted against the numnber of clusters.

References

Coretto, P. and C. Hennig (2016). Robust improper maximum likelihood: tuning, computation, and a comparison with other methods for robust Gaussian clustering. Journal of the American Statistical Association, Vol. 111(516), pp. 1648-1659. 10.1080/01621459.2015.1100996

P. Coretto and C. Hennig (2017). Consistency, breakdown robustness, and algorithms for robust improper maximum likelihood clustering. Journal of Machine Learning Research, Vol. 18(142), pp. 1-39. https://jmlr.org/papers/v18/16-382.html

Hennig, C. and P.Coretto (2021). An adequacy approach for deciding the number of clusters for OTRIMLE robust Gaussian mixture based clustering. To appear in Australian and New Zealand Journal of Statistics, https://arxiv.org/abs/2009.00921.

Examples

Run this code

# NOT RUN {
## otrimlesimg is computer intensive, so only a small data subset
## is used for speed.
data(banknote)
selectdata <- c(1:30,101:110,117:136,160:161)
set.seed(555566)
x <- banknote[selectdata,5:7]
   
## simruns=2 chosen for speed. This is not recommended in practice. 
obanknote <- otrimlesimg(x,G=1:2,multicore=FALSE,simruns=2,monitor=0)
sobanknote <- summary(obanknote)
print(sobanknote)
plot(sobanknote,plot="criterion",penx=1.4)
plot(sobanknote,plot="noise",penx=1.4)
plot(x,col=sobanknote$cluster+1,pch=c("N","1","2")[sobanknote$cluster+1])
# }

Run the code above in your browser using DataLab