The results obtained for Euclidean dissimilarity are equivalent to those
returned by the stats::dist()
function, but are scaled
differently. However, f_diss
is considerably faster (which can be
advantageous when computing dissimilarities for very large matrices). The
final scaling of the dissimilarity scores in f_diss
where
the number of variables is used to scale the squared dissimilarity scores. See
the examples section for a comparison between stats::dist()
and
f_diss
.
In the case of both the Euclidean and Mahalanobis distances, the scaled
dissimilarity matrix DD between between observations in a given
matrix XX is computed as follows:
d(x_i, x_j)^2 = (x_i - x_j)M^-1(x_i - x_j)^Td(x_i, x_j)^2 = (x_i - x_j)M^-1(x_i - x_j)^T
d_scaled(x_i, x_j) = 1pd(x_i, x_j)^2d_scaled (x_i, x_j) = sqrt(1/p d(x_i, x_j)^2)
where pp is the number of variables in XX, MM is the identity
matrix in the case of the Euclidean distance and the variance-covariance
matrix of XX in the case of the Mahalanobis distance. The Mahalanobis
distance can also be viewed as the Euclidean distance after applying a
linear transformation of the original variables. Such a linear transformation
is done by using a factorization of the inverse covariance matrix as
M^-1 = W^TWM^-1 = W^TW, where MM is merely the square root of
M^-1M^-1 which can be found by using a singular value decomposition.
Note that when attempting to compute the Mahalanobis distance on a dataset
with highly correlated variables (i.e. spectral variables) the
variance-covariance matrix may result in a singular matrix which cannot be
inverted and therefore the distance cannot be computed.
This is also the case when the number of observations in the dataset is
smaller than the number of variables.
For the computation of the Mahalanobis distance, the mentioned method is
used.
The cosine dissimilarity cc between two observations
x_ix_i and x_jx_j is computed as follows:
c(x_i, x_j) = cos^-1_k=1^px_i,k x_j,k_k=1^p x_i,k^2 _k=1^p x_j,k^2c(x_i, x_j) = cos^-1 ((sum_(k=1)^p x_(i,k) x_(j,k))/(sum_(k=1)^p x_(i,k) sum_(k=1)^p x_(j,k)))
where pp is the number of variables of the observations.
The function does not accept input data containing missing values.
NOTE: The computed distances are divided by the number of variables/columns
in Xr
.