Learn R Programming

analogue (version 0.18.1)

distance: Flexibly calculate dissimilarity or distance measures

Description

Flexibly calculates distance or dissimilarity measures between a training set x and a fossil or test set y. If y is not supplied then the pairwise dissimilarities between samples in the training set, x, are calculated.

Usage

distance(x, ...)

# S3 method for default distance(x, y, method = "euclidean", weights = NULL, R = NULL, dist = FALSE, double.zero = FALSE, ...)

# S3 method for join distance(x, ...)

oldDistance(x, ...) # S3 method for default oldDistance(x, y, method = c("euclidean", "SQeuclidean", "chord", "SQchord", "bray", "chi.square", "SQchi.square", "information", "chi.distance", "manhattan", "kendall", "gower", "alt.gower", "mixed"), fast = TRUE, weights = NULL, R = NULL, ...) # S3 method for join oldDistance(x, ...)

Arguments

Value

A matrix of dissimilarities where columns are the samples in

y and the rows the samples in x. If y is not provided then a square, symmetric matrix of pairwise sample dissimilarities for the training set x is returned, unless argument dist is TRUE, in which case an object of class

"dist" is returned. See dist.

The dissimilarity coefficient used (method) is returned as attribute "method". Attribute "type" indicates whether the object was computed on a single data matrix ("symmetric") or across two matrices (i.e. the dissimilarties between the rows of two matrices; "asymmetric".

Details

A range of dissimilarity coefficients can be used to calculate dissimilarity between samples. The following are currently available:

euclidean\(d_{jk} = \sqrt{\sum_i (x_{ij}-x_{ik})^2}\)
SQeuclidean\(d_{jk} = \sum_i (x_{ij}-x_{ik})^2\)
chord\(d_{jk} = \sqrt{\sum_i (\sqrt{x_{ij}}-\sqrt{x_{ik}})^2}\)
SQchord\(d_{jk} = \sum_i (\sqrt{x_{ij}}-\sqrt{x_{ik}})^2\)
bray\(d_{jk} = \frac{\sum_i |x_{ij} - x_{ik}|}{\sum_i (x_{ij} + x_{ik})}\)
chi.square\(d_{jk} = \sqrt{\sum_i \frac{(x_{ij} - x_{ik})^2}{x_{ij} + x_{ik}}}\)
SQchi.square\(d_{jk} = \sum_i \frac{(x_{ij} - x_{ik})^2}{x_{ij} + x_{ik}}\)
information\(d_{jk} = \sum_i (p_{ij}log(\frac{2p_{ij}}{p_{ij} + p_{ik}}) + p_{ik}log(\frac{2p_{ik}}{p_{ij} + p_{ik}}))\)
chi.distance\(d_{jk} = \sqrt{\sum_i (x_{ij}-x_{ik})^2 / (x_{i+} / x_{++})}\)
manhattan\(d_{jk} = \sum_i (|x_{ij}-x_{ik}|)\)
kendall\(d_{jk} = \sum_i MAX_i - minimum(x_{ij}, x_{ik})\)
gower\(d_{jk} = \sum_i\frac{|p_{ij} - p_{ik}|}{R_i}\)
alt.gower\(d_{jk} = \sqrt{2\sum_i\frac{|p_{ij} - p_{ik}|}{R_i}}\)
where \(R_i\) is the range of proportions for descriptor (variable) \(i\)
mixed\(d_{jk} = \frac{\sum_{i=1}^p w_{i}s_{jki}}{\sum_{i=1}^p w_{i}}\)
where \(w_i\) is the weight for descriptor \(i\) and \(s_{jki}\) is the similarity
between samples \(j\) and \(k\) for descriptor (variable) \(i\).
metric.mixedas for mixed but with ordinal variables converted to ranks and handled as quantitative variables in Gower's mixed coefficient.

Argument fast determines whether fast C versions of some of the dissimilarity coefficients are used. The fast versions make use of dist for methods "euclidean", "SQeuclidean", "chord", "SQchord", and vegdist for method == "bray". These fast versions are used only when x is supplied, not when y is also supplied. Future versions of distance will include fast C versions of all the dissimilary coefficients and for cases where y is supplied.

References

Faith, D.P., Minchin, P.R. and Belbin, L. (1987) Compositional dissimilarity as a robust measure of ecological distance. Vegetatio 69, 57--68.

Gavin, D.G., Oswald, W.W., Wahl, E.R. and Williams, J.W. (2003) A statistical approach to evaluating distance metrics and analog assignments for pollen records. Quaternary Research 60, 356--367.

Kendall, D.G. (1970) A mathematical approach to seriation. Philosophical Transactions of the Royal Society of London - Series B 269, 125--135.

Legendre, P. and Legendre, L. (1998) Numerical Ecology, 2nd English Edition. Elsevier Science BV, The Netherlands.

Overpeck, J.T., Webb III, T. and Prentice I.C. (1985) Quantitative interpretation of fossil pollen spectra: dissimilarity coefficients and the method of modern analogues. Quaternary Research 23, 87--108.

Podani, J. (1999) Extending Gower's General Coefficient of Similarity to Ordinal Characters. Taxon 48, 331--340).

Prentice, I.C. (1980) Multidimensional scaling as a research tool in Quaternary palynology: a review of theory and methods. Review of Palaeobiology and Palynology 31, 71--104.

See Also

vegdist in package vegan, daisy in package cluster, and dist provide comparable functionality for the case of missing y.

Examples

Run this code
## simple example using dummy data
train <- data.frame(matrix(abs(runif(200)), ncol = 10))
rownames(train) <- LETTERS[1:20]
colnames(train) <- as.character(1:10)
fossil <- data.frame(matrix(abs(runif(100)), ncol = 10))
colnames(fossil) <- as.character(1:10)
rownames(fossil) <- letters[1:10]

## calculate distances/dissimilarities between train and fossil
## samples
test <- distance(train, fossil)

## using a different coefficient, chi-square distance
test <- distance(train, fossil, method = "chi.distance")

## calculate pairwise distances/dissimilarities for training
## set samples
test2 <- distance(train)

## Using distance on an object of class join
dists <- distance(join(train, fossil))
str(dists)

## calculate Gower's general coefficient for mixed data
## first, make a couple of variables factors

## fossil[,4] <- factor(sample(rep(1:4, length = 10), 10))
## train[,4] <- factor(sample(rep(1:4, length = 20), 20))
## ## now fit the mixed coefficient
## test3 <- distance(train, fossil, "mixed")

## ## Example from page 260 of Legendre & Legendre (1998)
x1 <- t(c(2,2,NA,2,2,4,2,6))
x2 <- t(c(1,3,3,1,2,2,2,5))
Rj <- c(1,4,2,4,1,3,2,5) # supplied ranges

## 1 - distance(x1, x2, method = "mixed", R = Rj)

## note this gives ~0.66 as Legendre & Legendre describe the
## coefficient as a similarity coefficient. Hence here we do
## 1 - Dij here to get the same answer.

## Tortula example from Podani (1999)
data(tortula)
Dij <- distance(tortula[, -1], method = "mixed") # col 1 includes Taxon ID

## Only one ordered factor
data(mite.env, package = "vegan")
Dij <- distance(mite.env, method = "mixed")

## Some variables are constant
data(BCI.env, package = "vegan")
Dij <- distance(BCI.env, method = "mixed")

Run the code above in your browser using DataLab