dist
Matrix Distance/Similarity Computation
These functions compute and return the autodistance/similarity matrix between either rows or columns of a matrix/data frame, or a list, as well as the crossdistance matrix between two matrices/data frames/lists.
 Keywords
 cluster
Usage
dist(x, y = NULL, method = NULL, ..., diag = FALSE, upper = FALSE, pairwise = FALSE, by_rows = TRUE, convert_similarities = TRUE, auto_convert_data_frames = TRUE)
simil(x, y = NULL, method = NULL, ..., diag = FALSE, upper = FALSE, pairwise = FALSE, by_rows = TRUE, convert_distances = TRUE, auto_convert_data_frames = TRUE)
pr_dist2simil(x)
pr_simil2dist(x)
as.dist(x, FUN = NULL)
as.simil(x, FUN = NULL)
"as.matrix"(x, diag = 0, ...)
"as.matrix"(x, diag = NA, ...)
Arguments
 x
 For
dist
andsimil
, a numeric matrix object, a data frame, or a list. A vector will be converted into a column matrix. Foras.simil
andas.dist
, an object of classdist
andsimil
, respectively, or a numeric matrix. Forpr_dist2simil
andpr_simil2dist
, any numeric vector.  y
NULL
, or a similar object thanx
 method
 a function, a registry entry, or a mnemonic string referencing the
proximity measure. A list of all available measures can be obtained
using
pr_DB
(see examples). The default fordist
is"Euclidean"
, and forsimil
"correlation"
.  diag
 logical value indicating whether the diagonal of the
distance/similarity matrix should be printed by
print.dist
/print.simil
. In the context ofas.matrix
the value to use on the diagonal representing selfproximities.  upper
 logical value indicating whether the upper triangle of the
distance/similarity matrix should be printed by
print.dist
/print.simil
 pairwise
 logical value indicating whether distances should be
computed for the pairs of
x
andy
only.  by_rows
 logical indicating whether proximities between rows, or columns should be computed.
 convert_similarities, convert_distances
 logical indicating whether distances should be automatically converted into similarities (and the other way round) if needed.
 auto_convert_data_frames
 logical indicating whether data frames should be converted to matrices if all variables are numeric, or all are logical, or all are complex.
 FUN
 optional function to be used by
as.dist
andas.simil
. IfNULL
, it is looked up in the method registry. If there is none specified there,FUN
defaults topr_simil2dist
andpr_dist2simil
, respectively.  ...
 further arguments passed to the proximity function.
Details
The interface is fashioned after dist
, but can
also compute crossdistances, and allows user extensions by means of
registry of all proximity measures (see pr_DB
).
Missing values are allowed but are excluded from all computations
involving the rows within which they occur. If some columns are
excluded in calculating a Euclidean, Manhattan, Canberra or
Minkowski distance, the sum is scaled up proportionally to the
number of columns used (compare dist
in
package stats).
Data frames are silently coerced to matrix if all columns are of
(same) mode numeric
or logical
.
Distance measures can be used with simil
, and similarity
measures with dist
. In these cases, the result is transformed
accordingly using the specified coercion functions (default:
$pr_simil2dist(d) = 1  s$ and $pr_dist2simil(s) = 1 / (1 + d)$).
Objects of class simil
and dist
can be converted one in
another using as.dist
and as.simil
, respectively.
Distance and similarity objects can conveniently be subset
(see examples). Note that duplicate indexes are silently ignored.
Value

Auto distances/similarities are returned as an object of class
dist
/simil
and
crossdistances/similarities as an object of class crossdist
/crosssimil
.
References
Anderberg, M.R. (1973), Cluster analysis for applications, 359 pp., Academic Press, New York, NY, USA. Cox, M.F. and Cox, M.A.A. (2001), Multidimensional Scaling, Chapman and Hall. Sokol, R.S. and Sneath P.H.A (1963), Principles of Numerical Taxonomy, W. H. Freeman and Co., San Francisco.
See Also
dist
for compatibility information, and
pr_DB
for the proximity data base.
Examples
### show available proximities
summary(pr_DB)
### get more information about a particular one
pr_DB$get_entry("Jaccard")
### binary data
x < matrix(sample(c(FALSE, TRUE), 8, rep = TRUE), ncol = 2)
dist(x, method = "Jaccard")
### for realvalued data
dist(x, method = "eJaccard")
### for positive realvalued data
dist(x, method = "fJaccard")
### cross distances
dist(x, x, method = "Jaccard")
### pairwise (diagonal)
dist(x, x, method = "Jaccard",
pairwise = TRUE)
### this is the same but less efficient
as.matrix(stats::dist(x, method = "binary"))
### numeric data
x < matrix(rnorm(16), ncol = 4)
## test inheritance of names
rownames(x) < LETTERS[1:4]
colnames(x) < letters[1:4]
dist(x)
dist(x, x)
## custom distance function
f < function(x, y) sum(x * y)
dist(x, f)
## working with lists
z < unlist(apply(x, 1, list), recursive = FALSE)
(d < dist(z))
dist(z, z)
## subsetting
d[[1:2]]
subset(d, c(1,3,4))
d[[c(1,2,2)]] # duplicate index gets ignored
## transformations and selfproximities
as.matrix(as.simil(d, function(x) exp(x)), diag = 1)
## row and column indexes
row.dist(d)
col.dist(d)