Last chance! 50% off unlimited learning
Sale ends in
Compute a symmetric matrix of distances (or similarities) between the rows or columns of a matrix;
or compute cross-distances between the rows or columns of two different matrices.
This implementation is faster than dist
and can operate on sparse matrices (in canonical DSM format).
dist.matrix(M, M2 = NULL, method = "cosine", p = 2,
normalized = FALSE, byrow = TRUE, convert = TRUE, as.dist = FALSE,
terms = NULL, terms2 = terms, skip.missing = FALSE)
a dense or sparse matrix representing a scored DSM, or an object of class dsm
an optional dense or sparse matrix representing a second scored DSM, or an object of class dsm
.
If present, cross-distances between the rows (or columns) of M
and those of M2
will be computed.
distance or similarity measure to be used (see “Distance Measures” below for details)
exponent of the minkowski
if TRUE
, assume that the row (or column) vectors of M
and M2
have been appropriately normalised (depending on the selected distance measure) in order to speed up calculations.
This option is often used with the cosine
metric, for which vectors must be normalized wrt. the Euclidean norm. It is currently ignored for other distance measures.
whether to calculate distances between row vectors (default) or between column vectors (byrow=FALSE
)
if TRUE
, similarity measures are automatically converted to distances in an appropriate way (see “Distance Measures” below for details).
Note that this is the default setting and convert=FALSE
has to be specified explicitly in order to obtain a similarity matrix.
convert the full symmetric distance matrix to a compact object of class dist
.
This option cannot be used if cross-distances are calculated (with argument M2
) or if a similarity measure has been selected (with option convert=FALSE
).
a character vector specifying rows of M
for which distance matrix is to be computed (or columns if byrow=FALSE
)
a character vector specifying rows of M2
for which the cross-distance matrix is to be computed (or columns if byrow=FALSE
).
If only the argument terms
is specified, the same set of rows (or columns) will be selected from both M
and M2
; you can explicitly specify terms2=NULL
in order to compute cross-distances for all rows (or columns) of M2
.
if TRUE
, silently ignores terms not found in M
(or in M2
). By default (skip.missing=FALSE
) an error is raised in this case.
By default, a numeric matrix of class dist.matrix
, specifying distances or similarities between term vectors.
A similarity matrix is marked by an additional attribute similarity
with value TRUE
.
If the distance or similarity matrix is symmetric (i.e. neither a cross-distance matrix nor based on an asymmetric distance measure), it is marked by an attribute symmetric
with value TRUE
.
If as.dist=TRUE
, the matrix is compacted to an object of class dist
.
Given two DSM vectors
euclidean
The Euclidean distance given by
manhattan
The Manhattan (or “city block”) distance given by
maximum
The maximum distance given by
minkowski
The Minkowski distance is a family of metrics determined by a parameter rowNorms
).
Special cases include the Euclidean metric p=Inf
is not allowed. For
canberra
The Canberra metric has been implemented for compatibility with the dist
function, even though it is probably not very useful for DSM vectors. It is given by
Note that dist
uses a different formula
In addition, the following similarity measures can be computed and optionally converted to a distance metric:
cosine
(default)The cosine similarity given by normalized=TRUE
, the denominator is omitted. If convert=TRUE
(the default), the cosine similarity is converted to angular distance
# NOT RUN {
M <- DSM_TermTermMatrix
dist.matrix(M, as.dist=TRUE) # angular distance
dist.matrix(M, method="euclidean", as.dist=TRUE) # Euclidean distance
dist.matrix(M, method="manhattan", as.dist=TRUE) # Manhattan distance
dist.matrix(M, method="minkowski", p=1, as.dist=TRUE) # L_1 distance
dist.matrix(M, method="minkowski", p=99, as.dist=TRUE) # almost L_Inf
dist.matrix(M, method="maximum", as.dist=TRUE) # L_Inf (maximum)
dist.matrix(M, method="minkowski", p=.5, as.dist=TRUE) # L_0.5 distance
dist.matrix(M, method="minkowski", p=0, as.dist=TRUE) # Hamming distance
round(dist.matrix(M, method="cosine", convert=FALSE), 3) # cosine similarity
# }
Run the code above in your browser using DataLab