
This functions computes the distance/dissimilarity between two probability density functions.
distance(
x,
method = "euclidean",
p = NULL,
test.na = TRUE,
unit = "log",
est.prob = NULL,
use.row.names = FALSE,
as.dist.obj = FALSE,
diag = FALSE,
upper = FALSE
)
a numeric data.frame
or matrix
(storing probability vectors) or a numeric data.frame
or matrix
storing counts (if est.prob
is specified).
a character string indicating whether the distance measure that should be computed.
power of the Minkowski distance.
a boolean value indicating whether input vectors should be tested for NA
values. Faster computations if test.na = FALSE
.
a character string specifying the logarithm unit that should be used to compute distances that depend on log computations.
method to estimate probabilities from input count vectors such as non-probability vectors. Default: est.prob = NULL
. Options are:
est.prob = "empirical"
: The relative frequencies of each vector are computed internally. For example an input matrix rbind(1:10, 11:20)
will be transformed to a probability vector rbind(1:10 / sum(1:10), 11:20 / sum(11:20))
a logical value indicating whether or not row names from
the input matrix shall be used as rownames and colnames of the output distance matrix. Default value is use.row.names = FALSE
.
shall the return value or matrix be an object of class link[stats]{dist}
? Default is as.dist.obj = FALSE
.
if as.dist.obj = TRUE
, then this value indicates whether the diagonal of the distance matrix should be printed. Default
if as.dist.obj = TRUE
, then this value indicates whether the upper triangle of the distance matrix should be printed.
The following results are returned depending on the dimension of x
:
in case nrow(x)
= 2 : a single distance value.
in case nrow(x)
> 2 : a distance matrix
storing distance values for all pairwise probability vector comparisons.
Here a distance is defined as a quantitative degree of how far two mathamatical objects are apart from eachother (Cha, 2007).
This function implements the following distance/similarity measures to quantify the distance between probability density functions:
L_p Minkowski family
Euclidean :
Manhattan :
Minkowski :
Chebyshev :
L_1 family
Sorensen :
Gower :
Soergel :
Kulczynski d :
Canberra :
Lorentzian :
Intersection family
Intersection :
Non-Intersection :
Wave Hedges :
Czekanowski :
Motyka :
Kulczynski s :
Tanimoto :
Ruzicka :
Inner Product family
Inner Product :
Harmonic mean :
Cosine :
Kumar-Hassebrook (PCE) :
Jaccard :
Dice :
Squared-chord family
Fidelity :
Bhattacharyya :
Hellinger :
Matusita :
Squared-chord :
Squared L_2 family (
Squared Euclidean :
Pearson
Neyman
Squared
Probabilistic Symmetric
Divergence :
Clark :
Additive Symmetric
Shannon's entropy family
Kullback-Leibler :
Jeffreys :
K divergence :
Topsoe :
Jensen-Shannon :
Jensen difference :
Combinations
Taneja :
Kumar-Johnson :
Avg(L_1, L_n) :
In cases where x
specifies a count matrix, the argument est.prob
can be selected to first estimate probability vectors
from input count vectors and second compute the corresponding distance measure based on the estimated probability vectors.
The following probability estimation methods are implemented in this function:
est.prob = "empirical"
: relative frequencies of counts.
Sung-Hyuk Cha. (2007). Comprehensive Survey on Distance/Similarity Measures between Probability Density Functions. International Journal of Mathematical Models and Methods in Applied Sciences 4: 1.
# NOT RUN {
# Simple Examples
# receive a list of implemented probability distance measures
getDistMethods()
## compute the euclidean distance between two probability vectors
distance(rbind(1:10/sum(1:10), 20:29/sum(20:29)), method = "euclidean")
## compute the euclidean distance between all pairwise comparisons of probability vectors
ProbMatrix <- rbind(1:10/sum(1:10), 20:29/sum(20:29),30:39/sum(30:39))
distance(ProbMatrix, method = "euclidean")
# compute distance matrix without testing for NA values in the input matrix
distance(ProbMatrix, method = "euclidean", test.na = FALSE)
# alternatively use the colnames of the input data for the rownames and colnames
# of the output distance matrix
ProbMatrix <- rbind(1:10/sum(1:10), 20:29/sum(20:29),30:39/sum(30:39))
rownames(ProbMatrix) <- paste0("Example", 1:3)
distance(ProbMatrix, method = "euclidean", use.row.names = TRUE)
# Specialized Examples
CountMatrix <- rbind(1:10, 20:29, 30:39)
## estimate probabilities from a count matrix
distance(CountMatrix, method = "euclidean", est.prob = "empirical")
## compute the euclidean distance for count data
## NOTE: some distance measures are only defined for probability values,
distance(CountMatrix, method = "euclidean")
## compute the Kullback-Leibler Divergence with different logarithm bases:
### case: unit = log (Default)
distance(ProbMatrix, method = "kullback-leibler", unit = "log")
### case: unit = log2
distance(ProbMatrix, method = "kullback-leibler", unit = "log2")
### case: unit = log10
distance(ProbMatrix, method = "kullback-leibler", unit = "log10")
# }
Run the code above in your browser using DataLab