cec: Cross-Entropy Clustering

Description

Performs Cross-Entropy Clustering on a data matrix.

Usage

cec(x, centers, type = c("covariance", "fixedr", "spherical", "diagonal",  
"eigenvalues", "all"), iter.max = 25, nstart = 1, param, 
centers.init = c("kmeans++", "random"), card.min = "5%", keep.removed = F, 
interactive = F, readline = T)

Arguments

Numeric matrix of data.

centers

Either a matrix of initial centers or the number of initial centers (k, single number or a vector of numbers for variable number of initial centers at each start). In the latter case, initial centers will be generated using a method dependi

type

Type (or types) of clustering (density family). This can be either a single value or a vector of length equal to the number of centers. Possible values are: "covariance", "fixedr", "spherical", "diagonal", "eigenvalues", "all" (default).

iter.max

Maximum number of iterations at each start.

nstart

Number of clusterings to perform (with different initial centers). Only the best clustering (with the lowest cost) will be returned. Value grater then one is valid only if the centers argument is a number.

centers.init

Centers initialization method. Possible values are: "kmeans++" (default), "random".

param

Parameter (or parameters) specific to a particular type of clustering. Not all types of clustering require parameter. Types that require parameter: "covariance" (matrix parameter), "fixedr" (numeric parameter), "eigenvalues" (vector parameter). This ca

card.min

Minimal cluster cardinality. If cluster cardinality becomes less than card.min, cluster is removed. This argument can be either an integer number or a string ended with a percent sign (e.g. "5%").

keep.removed

If this parameter is TRUE, removed clusters will be visible in the results as NA in centers matrix (as well as corresponding values in the list of covariances).

interactive

Interactive mode. If TRUE, the result of clustering will be plotted after every iteration.

readline

Used only in the interactive mode. If readline is TRUE, at each iteration, before plotting it will wait for the user to press instead of standard "before plotting" (par(ask = TRUE)) waiting.

Value

Returns an object of class "cec" with available components: "data", "cluster", "probabilities", "centers", "cost.function", "nclusters", "final.cost.function", "final.nclusters", "iterations", "cost", "covariances", "covariances.model", "time".

Details

In the context of implementation, Cross-Entropy Clustering (CEC) aims to partition m points into k clusters so as to minimize the cost function (energy E of the clustering) by switching the points between clusters. The presented method is based on the adapted Hartigan approach, where we reduce clusters which cardinalities decreased below some small prefixed level.

The energy function E is given by: $$E(Y_1,\mathcal{F}_1;...;Y_k,\mathcal{F}_k) = \sum\limits_{i=1}^{k} p(Y_i) \cdot (-ln(p(Y_i)) + H^{\times}(Y_i\|\mathcal{F}_i))$$ where Yi denotes the i-th cluster, p(Yi) is the ratio of the number of points in i-th cluster to the total number points, H(Yi|Fi) is the value of cross-entropy, which represents the internal cluster energy function of data Yi defined with respect to a certain Gaussian density family Fi, which encodes the type of clustering we consider.

The value of the internal energy function H depends only on the covariance matrix of the points in the cluster (computed using maximum-likelihood method). Five implementations of H have been proposed (expressed as a type - model - of the clustering):

"covariance" - Gaussian densities with a fixed given covariance. The shapes of clusters depend on the given covariance matrix (additional parameter).
"fixedr" - Special case of "covariance", where the covariance matrix equalsrIfor the givenr(additional parameter). The clustering will have a tendency to divide data into balls with approximate radius proportional to the square root ofr.
"spherical" - Spherical (radial) Gaussian densities (covariance proportional to the indentity). Clusters will have a tendency to form balls of arbitrary sizes.
"diagonal" - Gaussian densities with diagonal covariane. Data will form ellipsoids with radiuses parallel to the coordinate axes.
"eigenvalues" - Gaussian densities with covariance matrix having fixed eigenvalues (additional parameter). The clustering will try to divide the data into ellipsoids with fixed shape rotated by an arbitrary angle.
"all" - All Gaussian densities. Data will form ellipsoids with arbitrary radiuses.

The implementation of cec function allows mixing of clustering types.

References

Spurek, P. and Tabor, J. (2014) Cross-Entropy Clustering Pattern Recognition 47, 9 3046--3059

Examples

Run this code

#
#
#
# Cross-Entropy Clustering
#
#
# 

## Example of clustering random dataset of 3 Gaussians using spherical Gaussian densities, 
## 10 random initial centers and 7\% as minimal cluster size.

m1 = matrix(rnorm(2000, sd=1), ncol=2)
m2 = matrix(rnorm(2000, mean = 3, sd = 1.5), ncol = 2)
m3 = matrix(rnorm(2000, mean = 3, sd = 1), ncol = 2)
m3[,2] = m3[,2] - 5
m = rbind(m1, m2, m3)
centers = initcenters(m, 10)
par(ask = TRUE)
plot(m, cex = 0.5, pch = 16)
## Initial centers:
Z = cec(m, centers, type="sp", iter.max = -1, card.min="7%")
plot(Z)
## Clustering result:
Z = cec(m, centers, type="sp", iter.max = 100, card.min="7%")
plot(Z)
# Result:
Z
# Cost function:
cec.plot.cost.function(Z)
## Example of clustering mouse-like set.
m = mouseset(n=7000, r.head=2, r.left.ear=1.1, r.right.ear=1.1, left.ear.dist=2.5,
right.ear.dist=2.5, dim=2)
plot(m, cex = 0.5, pch = 16)
centers = initcenters(m, 3)
## Initial centers:
Z = cec(m, centers, type="sp", iter.max = -1, card.min="5%")
plot(Z)
## Clustering result:
Z = cec(m, centers, type="sp", iter.max = 100, nstart=4, card.min="5%")
plot(Z)
# Result:
Z
# Cost function:
cec.plot.cost.function(Z)
## Example of clustering data set "Tset" using "eigenvalues" clustering type.
data(Tset)
plot(Tset, cex = 0.5, pch = 16)
centers = initcenters(Tset, 2)
## Initial centers:
Z <- cec(Tset, 5, type="eigenvalues", param=c(0.02,0.002), iter.max= -1)
plot(Z)
## Clustering result:
Z <- cec(Tset, 5, "eigenvalues", param=c(0.02,0.002), nstart=4)
plot(Z)
# Result:
Z
# Cost function:
cec.plot.cost.function(Z)

Run the code above in your browser using DataLab