apcluster: Affinity Propagation

Description

Runs affinity propagation clustering for a given similarity matrix

Usage

apcluster(s, p=NA, q=NA, maxits=1000, convits=100, lam=0.9,
          details=FALSE, nonoise=FALSE, seed=NA)
apclusterLM(s, p=NA, q=NA, maxits=1000, convits=100, lam=0.9,
          details=FALSE, nonoise=FALSE, seed=NA)

Arguments

an $l\times l$ similarity matrix

input preference; can be a vector that specifies individual preferences for each data point. If scalar, the same value is used for all data points. If NA, exemplar preferences are initialized according to the

If p=NA, exemplar preferences are initialized according to the distribution of non-Inf values in s. If q=NA, exemplar preferences are set to the median of non-Inf values in s

maxits

maximal number of iterations that should be executed

convits

the algorithm terminates if the examplars have not changed for convits iterations

lam

damping factor; should be a value in the range [0.5, 1); higher values correspond to heavy damping which may be needed if oscillations occur

details

if TRUE, more detailed information about the algorithm's progess is stored in the output object (see APResult)

nonoise

apcluster adds a small amount of noise to s to prevent degenerate cases; if TRUE, this is disabled

seed

for reproducibility, the seed of the random number generator can be set to a fixed value before adding noise (see above), if NA, the seed remains unchanged

Value

Upon successful completion, the function returns a APResult object.

Details

Affinity Propagation clusters data, using a set of real-valued pairwise data point similarities as input. Clusters are each represented by a cluster center data point (the exemplar). The method is iterative and searches for clusters so as to maximize an objective function, called net similarity.

Apart from minor adaptations and optimizations, the implementation of the function apclusterLM is largely analogous to Frey's and Dueck's Matlab code (see http://www.psi.toronto.edu/affinitypropagation/). The function apcluster uses the same ideas, but replaces the loops in the computations of responsibilities and availabilities by pure matrix operations. For moderate data sets, the variant apcluster is approximately 60% faster than apclusterLM. For large data sets (several thousands of data samples), the use of apclusterLM (LM = Less Memory) may be advantageous, since this function requires less temporal storage (LM = Less Memory). For at most 5000 samples, we recommend to use apcluster (on up-to-date systems that are not too tight with memory).

The new argument q allows for better controlling the number of clusters without knowing the distribution of similarity values. A meaningful range for the parameter p can be determined using the function preferenceRange. Alternatively, a certain fixed number of clusters may be desirable. For this purpose, the function apclusterK is available.

References

http://www.bioinf.jku.at/software/apcluster

Frey, B. J. and Dueck, D. (2007) Clustering by passing messages between data points. Science 315, 972-976.

Examples

Run this code

## create two Gaussian clouds
cl1 <- cbind(rnorm(100,0.2,0.05),rnorm(100,0.8,0.06))
cl2 <- cbind(rnorm(50,0.7,0.08),rnorm(50,0.3,0.05))
x <- rbind(cl1,cl2)

## create similarity matrix
sim <- negDistMat(x, r=2)

## run affinity propagation (p defaults to median of similarity)
apres <- apcluster(sim)

## show details of clustering results
show(apres)

## plot clustering result
plot(apres, x)

## run affinity propagation with default preference of 10% quantile
## of similarities; this should lead to a smaller number of clusters
apres <- apcluster(sim, q=0.1)
show(apres)
plot(apres, x)

## now try the same with RBF kernel
sim <- expSimMat(x, r=2)
apres <- apcluster(sim, q=0.2)
show(apres)
plot(apres, x)

Run the code above in your browser using DataLab