# distance

##### Flexibly calculate dissimilarity or distance measures

Flexibly calculates distance or dissimilarity measures between a
training set `x`

and a fossil or test set `y`

. If
`y`

is not supplied then the pairwise dissimilarities between
samples in the training set, `x`

, are calculated.

- Keywords
- multivariate, methods

##### Usage

```
distance(x, ...)
## S3 method for class 'default':
distance(x, y, method = c("euclidean", "SQeuclidean",
"chord", "SQchord", "bray", "chi.square",
"SQchi.square", "information", "chi.distance",
"manhattan", "kendall", "gower", "alt.gower",
"mixed"),
fast = TRUE,
weights = NULL, R = NULL, ...)
## S3 method for class 'join':
distance(x, \dots)
```

##### Arguments

- x
- data frame or matrix containing the training set samples, or
and object of class
`join`

. - y
- data frame or matrix containing the fossil or test set samples.
- method
- character; which choice of dissimilarity coefficient to use. One of the listed options. See Details below.
- fast
- logical; should fast versions of the dissimilarities be calculated? See details below.
- weights
- numeric; vector of weights for each descriptor.
- R
- numeric; vector of ranges for each descriptor.
- ...
- arguments passed to other methods

##### Details

A range of dissimilarity coefficients can be used to calculate
dissimilarity between samples. The following are currently available:

`euclidean`

$d_{jk} = \sqrt{\sum_i (x_{ij}-x_{ik})^2}$
`SQeuclidean`

$d_{jk} = \sum_i (x_{ij}-x_{ik})^2$
`chord`

$d_{jk} = \sqrt{\sum_i
(\sqrt{x_{ij}}-\sqrt{x_{ik}})^2}$
`SQchord`

$d_{jk} = \sum_i (\sqrt{x_{ij}}-\sqrt{x_{ik}})^2$
`bray`

$d_{jk} = \frac{\sum_i |x_{ij} - x_{ik}|}{\sum_i (x_{ij} +
x_{ik})}$
`chi.square`

$d_{jk} = \sqrt{\sum_i \frac{(x_{ij} - x_{ik})^2}{x_{ij} +
x_{ik}}}$
`SQchi.square`

$d_{jk} = \sum_i \frac{(x_{ij} - x_{ik})^2}{x_{ij} +
x_{ik}}$
`information`

$d_{jk} = \sum_i (p_{ij}log(\frac{2p_{ij}}{p_{ij} + p_{ik}})
+ p_{ik}log(\frac{2p_{ik}}{p_{ij} + p_{ik}}))$
`chi.distance`

$d_{jk} = \sqrt{\sum_i (x_{ij}-x_{ik})^2 / (x_{i+} /
x_{++})}$
`manhattan`

$d_{jk} = \sum_i (|x_{ij}-x_{ik}|)$
`kendall`

$d_{jk} = \sum_i MAX_i - minimum(x_{ij}, x_{ik})$
`gower`

$d_{jk} = \sum_i\frac{|p_{ij} -
p_{ik}|}{R_i}$
`alt.gower`

$d_{jk} = \sqrt{2\sum_i\frac{|p_{ij} -
p_{ik}|}{R_i}}$
where $R_i$ is the range of proportions for
descriptor (variable) $i$
`mixed`

$d_{jk} = \frac{\sum_{i=1}^p w_{i}s_{jki}}{\sum_{i=1}^p
w_{i}}$
where $w_i$ is the weight for descriptor $i$ and
$s_{jki}$ is the similarity
between samples $j$ and $k$ for descriptor (variable)
$i$.
}

Argument `fast`

determines whether fast C versions of some of the
dissimilarity coefficients are used. The fast versions make use of
`dist`

for `method`

s `"euclidean"`

,
`"SQeuclidean"`

, `"chord"`

, `"SQchord"`

, and
`vegdist`

for `method`

== `"bray"`

. These
fast versions are used only when `x`

is supplied, not when
`y`

is also supplied. Future versions of `distance`

will
include fast C versions of all the dissimilary coefficients and for
cases where `y`

is supplied.

##### Value

- A matrix of dissimilarities where columns are the samples in
`y`

and the rows the samples in`x`

. If`y`

is not provided then a square, symmetric matrix of pairwise sample dissimilarities for the training set`x`

is returned. The dissimilarity coefficient used (`method`

) is returned as attribute`"method"`

.

##### Note

The dissimilarities are calculated in native R code. As such, other
implementations (see See Also below) will be quicker. This is done for
one main reason - it is hoped to allow a user defined function to be
supplied as argument `"method"`

to allow for user-extension of
the available coefficients.
The other advantage of `distance`

over other implementations, is
the simplicity of calculating only the required pairwise sample
dissimilarities between each fossil sample (`y`

) and each
training set sample (`x`

). To do this in other implementations,
you would need to merge the two sets of samples, calculate the full
dissimilarity matrix and then subset it to achieve similar results.

##### concept

- dissimilarity
- dissimilarity coefficient
- similarity

##### warning

For `method = "mixed"`

it is essential that a factor in `x`

and `y`

have the same levels in the two data frames. Previous
versions of analogue would work even if this was not the case, which
will have generated incorrect dissimilarities for ```
method =
"mixed"
```

for cases where factors for a given species had different
levels in `x`

to `y`

.
`distance`

now checks for matching levels for each species
(column) recorded as a factor. If the factor for any individual
species has different levels in `x`

and `y`

, an error will
be issued.

##### References

Faith, D.P., Minchin, P.R. and Belbin, L. (1987) Compositional
dissimilarity as a robust measure of ecological
distance. *Vegetatio* **69**, 57--68.
Gavin, D.G., Oswald, W.W., Wahl, E.R. and Williams, J.W. (2003) A
statistical approach to evaluating distance metrics and analog
assignments for pollen records. *Quaternary Research*
**60**, 356--367.
Kendall, D.G. (1970) A mathematical approach to
seriation. *Philosophical Transactions of the Royal Society of
London - Series B* **269**, 125--135.
Legendre, P. and Legendre, L. (1998) *Numerical Ecology*, 2nd
English Edition. Elsevier Science BV, The Netherlands.
Overpeck, J.T., Webb III, T. and Prentice I.C. (1985) Quantitative
interpretation of fossil pollen spectra: dissimilarity coefficients and
the method of modern analogues. *Quaternary Research* **23**,
87--108.
Prentice, I.C. (1980) Multidimensional scaling as a research tool in
Quaternary palynology: a review of theory and methods. *Review of
Palaeobiology and Palynology* **31**, 71--104.

##### See Also

`vegdist`

in package `daisy`

in package `dist`

provide comparable functionality for the
case of missing `y`

and are implemented in compiled code, so
will be faster.

##### Examples

```
## simple example using dummy data
train <- data.frame(matrix(abs(runif(200)), ncol = 10))
rownames(train) <- LETTERS[1:20]
colnames(train) <- as.character(1:10)
fossil <- data.frame(matrix(abs(runif(100)), ncol = 10))
colnames(fossil) <- as.character(1:10)
rownames(fossil) <- letters[1:10]
## calculate distances/dissimilarities between train and fossil
## samples
test <- distance(train, fossil)
## using a different coefficient, chi-square distance
test <- distance(train, fossil, method = "chi.distance")
## calculate pairwise distances/dissimilarities for training
## set samples
test2 <- distance(train)
## Using distance on an object of class join
dists <- distance(join(train, fossil))
str(dists)
## calculate Gower's general coefficient for mixed data
## first, make a couple of variables factors
fossil[,4] <- factor(sample(rep(1:4, length = 10), 10))
train[,4] <- factor(sample(rep(1:4, length = 20), 20))
## now fit the mixed coefficient
test3 <- distance(train, fossil, "mixed")
## Example from page 260 of Legendre & Legendre (1998)
x1 <- t(c(2,2,NA,2,2,4,2,6))
x2 <- t(c(1,3,3,1,2,2,2,5))
Rj <- c(1,4,2,4,1,3,2,5) # supplied ranges
distance(x1, x2, method = "mixed", R = Rj)
## note this gives 1 - 0.66 (not 0.66 as the answer in
## Legendre & Legendre) as this is expressed as a
## distance whereas Legendre & Legendre describe the
## coefficient as similarity coefficient
```

*Documentation reproduced from package analogue, version 0.10-0, License: GPL-2*