bmsClustering: Function to perform clustering using the blurring version of the mean shift algorithm.

Description

This function implements the blurring mean shift algorithm, which approximates the standard mean shift algorithm. Because it recursively updates the entire sample at each iteration, the blurring version of the mean shift algorithm is often faster than the standard version (especially if the standard mean shift algorithm is run using a single core).

Usage

bmsClustering(X, h = NULL, kernel = "epanechnikovKernel",
tol.stop = 1e-06, max.iter = 100, tol.epsilon = 0.001)

Arguments

a $p \times n$ matrix containing $n \ge 1$ $p$-dimensional numeric vectors stored as columns. Each column of X represents a sample point.

a strictly positive bandwidth parameter.

kernel

a kernel function (as a character string). The following kernels are supported:

Epanechnikov: $ K(x) = \frac{3}{2}(1-x^2)I_{[0,1]}(x) $; kernel="epanechnikovKernel"
cubic: $ K(x) = 4(1-x)^3I_{[0,1]}(x) $; kernel="cubicKernel"
Gaussian: $ K(x) = \sqrt{\frac{2}{\pi}}e^{-\frac{x^2}{2}}I_{[0,\infty)}(x) $; kernel="gaussianKernel"
exponential $ K(x) = e^{-x}I_{[0,\infty)}(x) $; kernel="exponentialKernel".

The use of the Epanechnikov kernel is recommended when using the blurring version of the mean shift algorithm.

tol.stop

a strictly positive tolerance parameter. The mean shift algorithm stops when its update generates a step of length smaller than tol.stop. tol.stop should be considerably smaller than tol.epsilon.

max.iter

a strictly positive integer specifying the maximum number of iterations before the algorithm is forced to stop.

tol.epsilon

a strictly positive tolerance parameter. Points that are less than tol.epsilon- separated are grouped in the same cluster once the algorithm stops.

Value

components: a matrix containing the modes/cluster representatives by column.
labels: an integer vector of cluster labels.

Details

It is generally recommended to standardize X so that each variable has unit variance prior to running the algorithm on the data.

Roughly speaking, larger values of h produce a coarser clustering (i.e. few and large clusters). For sufficiently large values of h, the algorithm produces a unique cluster containing all the data points. Smaller values of h produce a finer clustering (i.e. many small clusters). For sufficiently small values of h, each cluster that is identified by the algorithm will contain exactly one data point.

If h is not specified in the function call, then h is by default set to the 30th percentile of the empirical distribution of distances between the columns of X, i.e. h=quantile( dist( t( X ) ), 0.3 ).

In their implementation, gaussianKernel and exponentialKernel are rescaled to assign probability of at least 0.99 to the unit interval $[0,1]$. This ensures that all the kernels are roughly on the same scale.

When using the blurring version of the mean shift algorithm, it is generally recommended to use a compactly supported kernel. In particular, the algorithm is guaranteed to converge in finitely many iterations with the Epanechnikov kernel.

References

Carreira-Perpinan, M. A. (2015) A review of mean-shift algorithms for clustering. arXiv http://arxiv.org/abs/1503.00687

Examples

Run this code

## an example using the iris dataset
## help( iris )

## prepare data matrix
iris.data <- t( iris[,c( "Sepal.Length", "Sepal.Width" )] )

## run blurring mean shift algorithm
clustering <- bmsClustering( iris.data )
print( clustering )

## plot the clusters
## Not run: 
# plot( iris.data[1,], iris.data[2,], col=clustering$labels+2, cex=0.8,
# pch=16, xlab="Sepal.Length", ylab="Sepal.Width" )
# points( clustering$components[1,], clustering$components[2,],
# col=2+( 1:ncol( clustering$components ) ), cex=1.8, pch=16 )## End(Not run)

Run the code above in your browser using DataLab