
x
and a fossil or test set y
. If
y
is not supplied then the pairwise dissimilarities between
samples in the training set, x
, are calculated.distance(x, ...)## S3 method for class 'default':
distance(x, y, method = "euclidean", weights = NULL,
R = NULL, dist = FALSE, \dots)
## S3 method for class 'join':
distance(x, \dots)
oldDistance(x, ...)
## S3 method for class 'default':
oldDistance(x, y, method = c("euclidean", "SQeuclidean",
"chord", "SQchord", "bray", "chi.square",
"SQchi.square", "information", "chi.distance",
"manhattan", "kendall", "gower", "alt.gower",
"mixed"),
fast = TRUE,
weights = NULL, R = NULL, ...)
## S3 method for class 'join':
oldDistance(x, \dots)
join
."dist"
? Ignored if y
is supplied.y
and the rows the samples in x
. If y
is
not provided then a square, symmetric matrix of pairwise sample
dissimilarities for the training set x
is returned, unless
argument dist
is TRUE
, in which case an object of class
"dist"
is returned. See dist
. The dissimilarity coefficient used (method
) is returned as
attribute "method"
. Attribute "type"
indicates whether
the object was computed on a single data matrix ("symmetric"
)
or across two matrices (i.e. the dissimilarties between the rows of
two matrices; "asymmetric"
.
method = "mixed"
it is essential that a factor in x
and y
have the same levels in the two data frames. Previous
versions of analogue would work even if this was not the case, which
will have generated incorrect dissimilarities for method =
"mixed"
for cases where factors for a given species had different
levels in x
to y
. distance
now checks for matching levels for each species
(column) recorded as a factor. If the factor for any individual
species has different levels in x
and y
, an error will
be issued.
euclidean
$d_{jk} = \sqrt{\sum_i (x_{ij}-x_{ik})^2}$
SQeuclidean
$d_{jk} = \sum_i (x_{ij}-x_{ik})^2$
chord
$d_{jk} = \sqrt{\sum_i
(\sqrt{x_{ij}}-\sqrt{x_{ik}})^2}$
SQchord
$d_{jk} = \sum_i (\sqrt{x_{ij}}-\sqrt{x_{ik}})^2$
bray
$d_{jk} = \frac{\sum_i |x_{ij} - x_{ik}|}{\sum_i (x_{ij} +
x_{ik})}$
chi.square
$d_{jk} = \sqrt{\sum_i \frac{(x_{ij} - x_{ik})^2}{x_{ij} +
x_{ik}}}$
SQchi.square
$d_{jk} = \sum_i \frac{(x_{ij} - x_{ik})^2}{x_{ij} +
x_{ik}}$
information
$d_{jk} = \sum_i (p_{ij}log(\frac{2p_{ij}}{p_{ij} + p_{ik}})
+ p_{ik}log(\frac{2p_{ik}}{p_{ij} + p_{ik}}))$
chi.distance
$d_{jk} = \sqrt{\sum_i (x_{ij}-x_{ik})^2 / (x_{i+} /
x_{++})}$
manhattan
$d_{jk} = \sum_i (|x_{ij}-x_{ik}|)$
kendall
$d_{jk} = \sum_i MAX_i - minimum(x_{ij}, x_{ik})$
gower
$d_{jk} = \sum_i\frac{|p_{ij} -
p_{ik}|}{R_i}$
alt.gower
$d_{jk} = \sqrt{2\sum_i\frac{|p_{ij} -
p_{ik}|}{R_i}}$
where $R_i$ is the range of proportions for
descriptor (variable) $i$
mixed
$d_{jk} = \frac{\sum_{i=1}^p w_{i}s_{jki}}{\sum_{i=1}^p
w_{i}}$
where $w_i$ is the weight for descriptor $i$ and
$s_{jki}$ is the similarity
between samples $j$ and $k$ for descriptor (variable)
$i$.
}
Argument fast
determines whether fast C versions of some of the
dissimilarity coefficients are used. The fast versions make use of
dist
for method
s "euclidean"
,
"SQeuclidean"
, "chord"
, "SQchord"
, and
vegdist
for method
== "bray"
. These
fast versions are used only when x
is supplied, not when
y
is also supplied. Future versions of distance
will
include fast C versions of all the dissimilary coefficients and for
cases where y
is supplied.
Kendall, D.G. (1970) A mathematical approach to seriation. Philosophical Transactions of the Royal Society of London - Series B 269, 125--135.
Legendre, P. and Legendre, L. (1998) Numerical Ecology, 2nd English Edition. Elsevier Science BV, The Netherlands. Overpeck, J.T., Webb III, T. and Prentice I.C. (1985) Quantitative interpretation of fossil pollen spectra: dissimilarity coefficients and the method of modern analogues. Quaternary Research 23, 87--108. Prentice, I.C. (1980) Multidimensional scaling as a research tool in Quaternary palynology: a review of theory and methods. Review of Palaeobiology and Palynology 31, 71--104.
vegdist
in package daisy
in package dist
provide comparable functionality for the
case of missing y
.## simple example using dummy data
train <- data.frame(matrix(abs(runif(200)), ncol = 10))
rownames(train) <- LETTERS[1:20]
colnames(train) <- as.character(1:10)
fossil <- data.frame(matrix(abs(runif(100)), ncol = 10))
colnames(fossil) <- as.character(1:10)
rownames(fossil) <- letters[1:10]
## calculate distances/dissimilarities between train and fossil
## samples
test <- distance(train, fossil)
## using a different coefficient, chi-square distance
test <- distance(train, fossil, method = "chi.distance")
## calculate pairwise distances/dissimilarities for training
## set samples
test2 <- distance(train)
## Using distance on an object of class join
dists <- distance(join(train, fossil))
str(dists)
## calculate Gower's general coefficient for mixed data
## first, make a couple of variables factors
fossil[,4] <- factor(sample(rep(1:4, length = 10), 10))
train[,4] <- factor(sample(rep(1:4, length = 20), 20))
## now fit the mixed coefficient
test3 <- distance(train, fossil, "mixed")
## Example from page 260 of Legendre & Legendre (1998)
x1 <- t(c(2,2,NA,2,2,4,2,6))
x2 <- t(c(1,3,3,1,2,2,2,5))
Rj <- c(1,4,2,4,1,3,2,5) # supplied ranges
1 - distance(x1, x2, method = "mixed", R = Rj)
## note this gives ~0.66 as Legendre & Legendre describe the
## coefficient as a similarity coefficient. Hence here we do
## 1 - Dij here to get the same answer.
Run the code above in your browser using DataLab