otrimlesimg
computes Optimally Tuned Robust Improper Maximum
Likelihood Clustering
(OTRIMLE), see otrimle
for a range of values of the
number of clusters, and also for artificial datasets simulated from
the model parameters estimated on the original data. The
summary
-methods present and evaluate the results so that a
smallest adequate number of clusters can be found as the smallest one
for which the value of the density-based cluster quality statistics Q
on the original data
is compatible with its distribution on the artificial datasets with
the same number of clusters, see Hennig and Coretto 2021 for details.
otrimlesimg(dataset, G=1:6, multicore=TRUE,
ncores=detectCores(logical=FALSE)-1, erc=20, beta0=0, simruns=20,
sim.est.logicd=FALSE,
monitor=1)# S3 method for otrimlesimgdens
summary(object, noisepenalty=0.05 , sdcutoff=2
, ...)
# S3 method for summary.otrimlesimgdens
print(x, ...)
# S3 method for summary.otrimlesimgdens
plot(x , plot="criterion", penx=NULL,
peny=NULL, pencex=1, cutoff=TRUE, ylim=NULL, ...)
something that can be coerced into an observations times variables matrix. The dataset.
vector of integers (normally starting from 1). Numbers of clusters to be considered.
logical. If TRUE
, parallel computing is used
through the function mclapply
from package
parallel
; read warnings there if you intend to use this; it
won't work on Windows.
integer. Number of cores for parallelisation.
A number larger or equal than one specifying the maximum
allowed ratio between within-cluster covariance matrix
eigenvalues. See otrimle
.
A non-negative constant, penalty term for noise, to be
passed as beta
to otrimle
, see documentation
there.
integer. Number of replicate artificial datasets drawn from each model.
logical. If TRUE
, the logarithm of the improper
constant density logicd
, see otrimle
, is
re-estimated when running otrimle
on the artificial
datasets. Otherwise the value estimated on the original data is
taken as fixed. TRUE
requires much longer computation time,
but can be seen as generating more realistic variation.
0 or 1. If 1, progress messages are printed on screen.
number between 0 and 1. p_0
in Hennig and
Coretto (2021); normally small. The method prefers to treat a
proportion of <=noisepenalty
of points as outliers to adding a
cluster.
numerical. c
in formula (7) in Hennig and
Coretto (2021). A clustering is treated as adequate if its value of
the density-based cluster quality measure Q calibrated (i.e.,
mean/sd-standardised) by the values on the artificial datasets
generated from the estimated model is <=sdcutoff
.
"criterion"
or "noise"
, see details.
FALSE, NULL
, or numerical. x-coordinate from where
the simplicity ordering of clustering is given (as test in the
plot). If FALSE
, this is not added to the plot. If
NULL
a default guess is made for a good position (which
doesn't always work well).
NULL
, or numerical. x-coordinate from where
the simplicity ordering of clustering is given (as test in the
plot). If
NULL
, a default guess is made for a good position (which
doesn't always work well).
numeric. Magnification factor (parameter cex
to
be passed on to legend
) for simplicity ordering, see
parameter penx
.
logical. If TRUE
, the "criterion"
-plot
shows the cutoff value below which numbers of clusters are adequate,
see details.
an object of class 'otrimlesimgdens'
obtained
from calling otrimlesimg
an object of class 'summary.otrimlesimgdens'
obtained
from calling summary
function over an object of class
'otrimlesimgdens'
obtained from calling otrimlesimg
.
optional parameters to be passed on to plot
.
otrimlesimg
returns a list of type "otrimlesimgdens"
containing the components result, simresult, simruns
.
output object of otrimleg
(list of results on
original data) run with the parameters provided to
otrimlesimg
.
list of length simruns
of output objects of
otrimleg
for all the simulated artificial datasets.
input parameter simruns
.
summary.otrimlesimgdens returns a list of type "summary.otrimlesimgdens" with components G, simeval, ssimruns, npr, nprdiff, logicd, denscrit, peng, penorder, bestG, sdcutoff, bestresult, cluster. simruns
otrmlesimg
input parameter G
(numbers of
clusters).
list with components denscrit,
meandens, sddens, standens, errors
, defined below.
otrmlesimg
input parameter simruns
.
vector of estimated noise proportions on the original data
for all numbers of clusters, exproportion[1]
from
otrimle
.
vector for all numbers of clusters of differences between estimated smallest cluster proportion and noise proportion on the original data.
vector of logs of improper constant density values on the original data for all numbers of clusters.
vector over all numbers of clusters of density-based
cluster quality statistics Q
on original data as provided by the measure
-component of
kerndensmeasure
.
vector of simplicity values (see Details) over all numbers of clusters.
simplicity order of number of clusters.
best (i.e., most simple adequate) number of clusters.
input parameter sdcutoff
.
output of otrimle
for the best number of
clusters bestG
.
clustering vector for the best number of
clusters bestG
. 0
corresponds to noise/outliers.
Components of summary.otrimlesimgdens output component simeval:
maximum number of clusters times simruns
matrix
of denscrit
-vectors for all clusterings on simulated data.
vector over numbers of clusters of robust estimator of
the mean of denscrit
over simulated datasets, computed by
scaleTau2
.
vector over numbers of clusters of robust estimator of
the standard deviation of denscrit
over simulated datasets,
computed by scaleTau2
.
vector over numbers of clusters of denscrit
of
the original data standardised by meandens
and
sddens
.
vector over numbers of clusters of numbers of times that
otrimle led to an error.
plot.summary.otrimlesimgdens
will return the output
of par()
before anything was changed by the plot
function.
The method is fully described in Hennig and Coretto
(2021). The required tuning constants for choosing an optimal number
of clusters, the smallest percentage of additional noise that the user
is willing to trade in for adding another cluster (p_0
in the
paper, noisepenalty
here) and the critical value (c
in
the paper, sdcutoff
here) for adequacy of the standardised
density based quality measure Q are provided to the summary function,
which is required to choose the best (simplest adequate) number of
clusters.
The plot function plot.summary.otrimlesimgdens
can produce two
plots. If plot="criterion"
, the standardised density-based
cluster quality
measure Q is plotted against the number of clusters. The values for
the simulated artificial datasets are points, the values for the
original dataset are given as line type. If cutoff="TRUE"
, the
critical values (see above) are added as red crosses; a number of
clusters is adequate if the value of the original data is below the
critical value, i.e., Q is not significantly larger than for the
artificial datasets generated from the fitted model. Using
penx
, the ordered numbers of clusters from the simplest to the
least simple can also be indicated in the plot, where simplicitly is
defined as the number of clusters plus the estimated noise proportion
divided by noisepenalty
, see above. The chosen number of
clusters is the simplest adequate one, meaning that a low number of
clusters and a low noise proportion are preferred.
If plot="noise"
, the noise proportion (black) and the
simplicity (red) are plotted against the numnber of clusters.
Coretto, P. and C. Hennig (2016). Robust improper maximum likelihood: tuning, computation, and a comparison with other methods for robust Gaussian clustering. Journal of the American Statistical Association, Vol. 111(516), pp. 1648-1659. 10.1080/01621459.2015.1100996
P. Coretto and C. Hennig (2017). Consistency, breakdown robustness, and algorithms for robust improper maximum likelihood clustering. Journal of Machine Learning Research, Vol. 18(142), pp. 1-39. https://jmlr.org/papers/v18/16-382.html
Hennig, C. and P.Coretto (2021). An adequacy approach for deciding the number of clusters for OTRIMLE robust Gaussian mixture based clustering. To appear in Australian and New Zealand Journal of Statistics, https://arxiv.org/abs/2009.00921.
# NOT RUN {
## otrimlesimg is computer intensive, so only a small data subset
## is used for speed.
data(banknote)
selectdata <- c(1:30,101:110,117:136,160:161)
set.seed(555566)
x <- banknote[selectdata,5:7]
## simruns=2 chosen for speed. This is not recommended in practice.
obanknote <- otrimlesimg(x,G=1:2,multicore=FALSE,simruns=2,monitor=0)
sobanknote <- summary(obanknote)
print(sobanknote)
plot(sobanknote,plot="criterion",penx=1.4)
plot(sobanknote,plot="noise",penx=1.4)
plot(x,col=sobanknote$cluster+1,pch=c("N","1","2")[sobanknote$cluster+1])
# }
Run the code above in your browser using DataLab