distance: Flexibly calculate dissimilarity or distance measures

Description

Flexibly calculates distance or dissimilarity measures between a training set x and a fossil or test set y. If y is not supplied then the pairwise dissimilarities between samples in the training set, x, are calculated.

Usage

distance(x, ...)
## S3 method for class 'default':
distance(x, y, method = c("euclidean", "SQeuclidean",
         "chord", "SQchord", "bray", "chi.square",
         "SQchi.square", "information", "chi.distance",
         "manhattan", "kendall", "gower", "alt.gower",
         "mixed"),
         fast = TRUE,
         weights = NULL, R = NULL, ...)
## S3 method for class 'join':
distance(x, \dots)

Arguments

data frame or matrix containing the training set samples, or and object of class join.

data frame or matrix containing the fossil or test set samples.

method

character; which choice of dissimilarity coefficient to use. One of the listed options. See Details below.

fast

logical; should fast versions of the dissimilarities be calculated? See details below.

weights

numeric; vector of weights for each descriptor.

numeric; vector of ranges for each descriptor.

...

arguments passed to other methods

Value

A matrix of dissimilarities where columns are the samples in y and the rows the samples in x. If y is not provided then a square, symmetric matrix of pairwise sample dissimilarities for the training set x is returned.
The dissimilarity coefficient used (method) is returned as attribute "method".

concept

dissimilarity
dissimilarity coefficient
similarity

warning

For method = "mixed" it is essential that a factor in x and y have the same levels in the two data frames. Previous versions of analogue would work even if this was not the case, which will have generated incorrect dissimilarities for

method =
  "mixed"

for cases where factors for a given species had different levels in x to y.

distance now checks for matching levels for each species (column) recorded as a factor. If the factor for any individual species has different levels in x and y, an error will be issued.

Details

A range of dissimilarity coefficients can be used to calculate dissimilarity between samples. The following are currently available:

ll{ euclidean $d_{jk} = \sqrt{\sum_i (x_{ij}-x_{ik})^2}$ SQeuclidean $d_{jk} = \sum_i (x_{ij}-x_{ik})^2$ chord $d_{jk} = \sqrt{\sum_i (\sqrt{x_{ij}}-\sqrt{x_{ik}})^2}$ SQchord $d_{jk} = \sum_i (\sqrt{x_{ij}}-\sqrt{x_{ik}})^2$ bray $d_{jk} = \frac{\sum_i |x_{ij} - x_{ik}|}{\sum_i (x_{ij} + x_{ik})}$ chi.square $d_{jk} = \sqrt{\sum_i \frac{(x_{ij} - x_{ik})^2}{x_{ij} + x_{ik}}}$ SQchi.square $d_{jk} = \sum_i \frac{(x_{ij} - x_{ik})^2}{x_{ij} + x_{ik}}$ information $d_{jk} = \sum_i (p_{ij}log(\frac{2p_{ij}}{p_{ij} + p_{ik}}) + p_{ik}log(\frac{2p_{ik}}{p_{ij} + p_{ik}}))$ chi.distance $d_{jk} = \sqrt{\sum_i (x_{ij}-x_{ik})^2 / (x_{i+} / x_{++})}$ manhattan $d_{jk} = \sum_i (|x_{ij}-x_{ik}|)$ kendall $d_{jk} = \sum_i MAX_i - minimum(x_{ij}, x_{ik})$ gower $d_{jk} = \sum_i\frac{|p_{ij} - p_{ik}|}{R_i}$ alt.gower $d_{jk} = \sqrt{2\sum_i\frac{|p_{ij} - p_{ik}|}{R_i}}$ where $R_i$ is the range of proportions for descriptor (variable) $i$ mixed $d_{jk} = \frac{\sum_{i=1}^p w_{i}s_{jki}}{\sum_{i=1}^p w_{i}}$ where $w_i$ is the weight for descriptor $i$ and $s_{jki}$ is the similarity between samples $j$ and $k$ for descriptor (variable) $i$. }

Argument fast determines whether fast C versions of some of the dissimilarity coefficients are used. The fast versions make use of dist for methods "euclidean", "SQeuclidean", "chord", "SQchord", and vegdist for method == "bray". These fast versions are used only when x is supplied, not when y is also supplied. Future versions of distance will include fast C versions of all the dissimilary coefficients and for cases where y is supplied.

References

Faith, D.P., Minchin, P.R. and Belbin, L. (1987) Compositional dissimilarity as a robust measure of ecological distance. Vegetatio 69, 57--68. Gavin, D.G., Oswald, W.W., Wahl, E.R. and Williams, J.W. (2003) A statistical approach to evaluating distance metrics and analog assignments for pollen records. Quaternary Research 60, 356--367.

Kendall, D.G. (1970) A mathematical approach to seriation. Philosophical Transactions of the Royal Society of London - Series B 269, 125--135.

Legendre, P. and Legendre, L. (1998) Numerical Ecology, 2nd English Edition. Elsevier Science BV, The Netherlands. Overpeck, J.T., Webb III, T. and Prentice I.C. (1985) Quantitative interpretation of fossil pollen spectra: dissimilarity coefficients and the method of modern analogues. Quaternary Research 23, 87--108. Prentice, I.C. (1980) Multidimensional scaling as a research tool in Quaternary palynology: a review of theory and methods. Review of Palaeobiology and Palynology 31, 71--104.

Examples

Run this code

## simple example using dummy data
train <- data.frame(matrix(abs(runif(200)), ncol = 10))
rownames(train) <- LETTERS[1:20]
colnames(train) <- as.character(1:10)
fossil <- data.frame(matrix(abs(runif(100)), ncol = 10))
colnames(fossil) <- as.character(1:10)
rownames(fossil) <- letters[1:10]

## calculate distances/dissimilarities between train and fossil
## samples
test <- distance(train, fossil)

## using a different coefficient, chi-square distance
test <- distance(train, fossil, method = "chi.distance")

## calculate pairwise distances/dissimilarities for training
## set samples
test2 <- distance(train)

## Using distance on an object of class join
dists <- distance(join(train, fossil))
str(dists)

## calculate Gower's general coefficient for mixed data
## first, make a couple of variables factors
fossil[,4] <- factor(sample(rep(1:4, length = 10), 10))
train[,4] <- factor(sample(rep(1:4, length = 20), 20))
## now fit the mixed coefficient
test3 <- distance(train, fossil, "mixed")

## Example from page 260 of Legendre & Legendre (1998)
x1 <- t(c(2,2,NA,2,2,4,2,6))
x2 <- t(c(1,3,3,1,2,2,2,5))
Rj <- c(1,4,2,4,1,3,2,5) # supplied ranges

distance(x1, x2, method = "mixed", R = Rj)

## note this gives 1 - 0.66 (not 0.66 as the answer in
## Legendre & Legendre) as this is expressed as a
## distance whereas Legendre & Legendre describe the
## coefficient as similarity coefficient

Run the code above in your browser using DataLab