ms: Mean shift clustering.

Description

Functions for mean shift, iterative mean shift, mean shift clustering, and bandwidth selection for mean shift clustering (based on self-coverage). These are experimental (and not fully documented) functions which implement the techniques presented in Einbeck (2010).

Usage

meanshift(X, x, h)
ms.rep(X, x, h, plotms=1, thresh= 0.00000001, iter=100)
ms(X, h, subset,  thr=0.001, scaled= TRUE, plotms=2, or.labels=NULL, ...)
ms.self.coverage(X, taumin=0.02, taumax=0.5, gridsize=25,
       thr=0.001, scaled=TRUE, draw=1/3,  cluster=FALSE, plot.type="o", 
       or.labels=NULL, print=FALSE, ...)

Arguments

Data matrix.

Bandwidth.

Point from which we wish to shift to the local mean.

subset

Vector specifying a subset of 1:n. This allows to run the iterative mean shift procedure only from a subset of points (if unspecified, 1:n is used here, i.e. each data point serves as a starting point).

scaled

Logical.

taumin,taumax,gridsize

Determine the grid of bandwidths to investigate.

draw

Only cluster centers belonging to this (randomly selected) fraction of the original data cloud are used for the computation of the self-coverage.

thresh, thr, iter

Parameters controlling convergence behavior.

cluster

if TRUE, distances are always measured to the cluster to which an observation is assigned, rather than to the nearest cluster.

plotms, plot.type, or.labels, ...

Graphical parameters.

If TRUE, coverage values are printed on the screen as soon as computed. This is quite helpful especially if gridsize is large.

Value

For the function ms:
cluster.centera matrix which gives the coordinates of the estimated density modes (i.e., of the mean-shift based cluster centers).
cluster.labelassigns each data point to the cluster center to which its mean shift trajectory has converged.
closest.labelassigns each data point to the closest cluster center in terms of Euclidean distance.
datathe data frame (scaled if scaled=TRUE).
scaled.bythe data were scaled by dividing each variable through the values provided in this vector.
For all other functions, use names().

Details

The methods implemented here are not directly related to local principal curves, but have with them the building block "mean shift" in common.

Chen (1995) showed that, if the mean shift is computed iteratively, the resulting sequence of local means converges to a mode of the estimated density function. By assigning each data point to the mode to which it has converged, this turns into a clustering technique.

The concepts of coverage and self-coverage, which were originally introduced in the principal curve context, adapt straightforwardly to this setting.

The goodness-of-fit messure Rc can also be applied in this context. For instance, a value of $R_C=0.8$ means that, after the clustering, the mean absolute residual length has been reduced by $80%$ (compared to the distances to the overall mean).

References

Chen, Y. (1995). Mean Shift, Mode Seeking, and Clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 17, 790-799.

Einbeck, J. (2010). Bandwidth selection for mean-shift based unsupervised learning techniques: a unified approach via self-coverage. Working paper, Durham University.

Examples

Run this code

data(faithful)
foo <- ms.self.coverage(faithful,gridsize= 10, taumin=0.1, taumax=0.5,
    plot.type="o")    # need higher gridsizes in practice!
h <- select.self.coverage(foo)$select
fit <- ms(faithful,h=h[1])
coverage(fit$data, fit$cluster.center)
Rc(fit$data, fit$cluster.center[fit$closest.label,], type="points")

Run the code above in your browser using DataLab