similarity: Methods for Computing Similarity Matrices

Description

Compute similarity matrices from data set

Usage

negDistMat(x, r=1, ...)
expSimMat(x, r=2, w=1, ...)
linSimMat(x, w=1, ...)
linKernel(x, normalize=FALSE)

Arguments

real-valued data matrix; every row is a sample, every column a feature/input dimension

exponent (see details below)

radius (see details below)

normalize

see details below

...

all other arguments are passed to dist as they are; the default distance is method="euclidean", see di

Value

All functions listed above return square matrices of similarities.

Details

negDistMat creates a square matrix of mutual pairwise similarities of data vectors as negative distances. The argument r (default is 1) is used to transform the resulting distances by computing the r-th power (use r=2 to obtain negative squared distances as in Frey's and Dueck's demos), i.e., given a distance d, the resulting similarity is computed as $s=-d^r$. Internally, the computation of distances is done using dist. All options of this function except diag and upper can be used, especially method which allows for selecting different distance measures.

expSimMat computes similarities in a way similar to negDistMat, but the transformation of distances to similarities is done in the following way: $$s=\exp\left(-\left(\frac{d}{w}\right)^r\right)$$ As above, r is an exponent. The parameter w controls the speed of descent. r=2 in conjunction with Euclidean distances corresponds to the well-known Gaussian/RBF kernel, whereas r=1 corresponds to the Laplace kernel. Note that these similarity measures can also be understood as fuzzy equality relations.

linSimMat provides another way of transforming distances into similarities by applying the following transformation to a distance d: $$s=\max\left(0,1-\frac{d}{w}\right)$$ Here w corresponds to a maximal radius of interest. Note that this is a fuzzy equality relation with respect to the Lukasiewicz t-norm.

Unlike the above three functions, linKernel computes pairwise similarities as scalar products of data vectors, i.e. it corresponds, as the name suggests, to the linear kernel. If normalize=TRUE, the values are scaled to the unit sphere in the following way (for two samples x and y: $$s=\frac{\vec{x}^T\vec{y}}{\|\vec{x}\| \|\vec{y}\|}$$

References

http://www.bioinf.jku.at/software/apcluster

Bodenhofer, U., Kothmeier, A., and Hochreiter, S. (2011) APCluster: an R package for affinity propagation clustering. Bioinformatics 27, 2463-2464. DOI: http://dx.doi.org/10.1093/bioinformatics/btr406{10.1093/bioinformatics/btr406}.

Frey, B. J. and Dueck, D. (2007) Clustering by passing messages between data points. Science 315, 972-976. DOI: http://dx.doi.org/10.1126/science.1136800{10.1126/science.1136800}.

Micchelli, C. A. (1986) Interpolation of scattered data: distance matrices and conditionally positive definite functions. Constr. Approx. 2, 11-20.

De Baets, B. and Mesiar, R. (1997) Pseudo-metrics and T-equivalences. J. Fuzzy Math. 5, 471-481.

Examples

Run this code

## create two Gaussian clouds
cl1 <- cbind(rnorm(100,0.2,0.05),rnorm(100,0.8,0.06))
cl2 <- cbind(rnorm(50,0.7,0.08),rnorm(50,0.3,0.05))
x <- rbind(cl1,cl2)

## create negative distance matrix (default Euclidean)
sim1 <- negDistMat(x)

## compute similarities as squared negative distances
## (in accordance with Frey's and Dueck's demos)
sim2 <- negDistMat(x, r=2)

## compute RBF kernel
sim3 <- expSimMat(x, r=2)

Run the code above in your browser using DataLab