vegdist
Dissimilarity Indices for Community Ecologists
The function computes dissimilarity indices that are useful for or
popular with community ecologists. All indices use quantitative data,
although they would be named by the corresponding binary index, but you
can calculate the binary index using an appropriate argument.
If you do not find your favourite
index here, you can see if it can be implemented using
designdist
.
Gower, BrayCurtis, Jaccard and
Kulczynski indices are good in detecting underlying
ecological gradients (Faith et al. 1987). Morisita, HornMorisita,
Binomial, Cao and Chao
indices should be able to handle different sample sizes (Wolda 1981,
Krebs 1999, Anderson & Millar 2004),
and Mountford (1962) and RaupCrick indices for presenceabsence data should
be able to handle unknown (and variable) sample sizes.
 Keywords
 multivariate
Usage
vegdist(x, method="bray", binary=FALSE, diag=FALSE, upper=FALSE, na.rm = FALSE, ...)
Arguments
 x
 Community data matrix.
 method
 Dissimilarity index, partial match to
"manhattan"
,"euclidean"
,"canberra"
,"bray"
,"kulczynski"
,"jaccard"
,"gower"
,"altGower"
,"morisita"
,"horn"
,"mountford"
,"raup"
,"binomial"
,"chao"
,"cao"
or"mahalanobis"
.  binary
 Perform presence/absence standardization before analysis
using
decostand
.  diag
 Compute diagonals.
 upper
 Return only the upper diagonal.
 na.rm
 Pairwise deletion of missing observations when computing dissimilarities.
 ...
 Other parameters. These are ignored, except in
method ="gower"
which acceptsrange.global
parameter ofdecostand
. .
Details
Jaccard ("jaccard"
), Mountford ("mountford"
),
RaupCrick ("raup"
), Binomial and Chao indices are discussed
later in this section. The function also finds indices for presence/
absence data by setting binary = TRUE
. The following overview
gives first the quantitative version, where $x[ij]$
$x[ik]$ refer to the quantity on species (column) $i$
and sites (rows) $j$ and $k$. In binary versions $A$ and
$B$ are the numbers of species on compared sites, and $J$ is
the number of species that occur on both compared sites similarly as
in designdist
(many indices produce identical binary
versions):
euclidean

$d[jk] = sqrt(sum(x[ij]x[ik])^2)$ 
binary: $sqrt(A+B2*J)$ 
manhattan

$d[jk] = sum(abs(x[ij]  x[ik]))$ 
binary: $A+B2*J$ 
gower

$d[jk] = (1/M) sum(abs(x[ij]x[ik])/(max(x[i])min(x[i])))$ 
binary: $(A+B2*J)/M$, 
where $M$ is the number of columns (excluding missing values) 
altGower

$d[jk] = (1/NZ) sum(abs(x[ij]  x[ik]))$ 
where $NZ$ is the number of nonzero columns excluding doublezeros (Anderson et al. 2006). 
binary: $(A+B2*J)/(A+BJ)$ 
canberra

$d[jk] = (1/NZ) sum ((x[ij]x[ik])/(x[ij]+x[ik]))$ 
where $NZ$ is the number of nonzero entries. 
binary: $(A+B2*J)/(A+BJ)$ 
bray

$d[jk] = (sum abs(x[ij]x[ik]))/(sum (x[ij]+x[ik]))$ 
binary: $(A+B2*J)/(A+B)$ 
kulczynski

$d[jk] 1  0.5*((sum min(x[ij],x[ik])/(sum x[ij]) + (sum min(x[ij],x[ik])/(sum x[ik]))$ 
binary: $1(J/A + J/B)/2$ 
morisita

$d[jk] = 1  2*sum(x[ij]*x[ik])/((lambda[j]+lambda[k]) * sum(x[ij])*sum(x[ik]))$, where 
$lambda[j] = sum(x[ij]*(x[ij]1))/sum(x[ij])*sum(x[ij]1)$ 
binary: cannot be calculated 
horn

Like morisita , but $lambda[j] = sum(x[ij]^2)/(sum(x[ij])^2)$

binary: $(A+B2*J)/(A+B)$ 
binomial

$d[jk] = sum(x[ij]*log(x[ij]/n[i]) + x[ik]*log(x[ik]/n[i])  n[i]*log(1/2))/n[i]$, 
where $n[i] = x[ij] + x[ik]$ 
binary: $log(2)*(A+B2*J)$ 
cao

$d[jk] = (1/S) * sum(log(n[i]/2)  (x[ij]*log(x[ik]) + x[ik]*log(x[ij]))/n[i])$, 
Jaccard index is computed as $2B/(1+B)$, where $B$ is BrayCurtis dissimilarity.
Binomial index is derived from Binomial deviance under null hypothesis that the two compared communities are equal. It should be able to handle variable sample sizes. The index does not have a fixed upper limit, but can vary among sites with no shared species. For further discussion, see Anderson & Millar (2004).
Cao index or CYd index (Cao et al. 1997) was suggested as a minimally
biased index for high beta diversity and variable sampling intensity.
Cao index does not have a fixed upper limit, but can vary among sites
with no shared species. The index is intended for count (integer)
data, and it is undefined for zero abundances; these are replaced with
arbitrary value $0.1$ following Cao et al. (1997). Cao et
al. (1997) used $log10$, but the current function uses
natural logarithms so that the values are approximately $2.30$
times higher than with 10based logarithms. Anderson & Thompson (2004)
give an alternative formulation of Cao index to highlight its
relationship with Binomial index (above).
Mountford index is defined as $M = 1/\alpha$ where $\alpha$
is the parameter of Fisher's logseries assuming that the compared
communities are samples from the same community
(cf. fisherfit
, fisher.alpha
). The index
$M$ is found as the positive root of equation $exp(a*M) + exp(b*M) = 1 +
exp((a+bj)*M)$, where $j$ is the number of species occurring in
both communities, and $a$ and $b$ are the number of species
in each separate community (so the index uses presenceabsence
information). Mountford index is usually misrepresented in the
literature: indeed Mountford (1962) suggested an approximation to be
used as starting value in iterations, but the proper index is
defined as the root of the equation above. The function
vegdist
solves $M$ with the Newton method. Please note
that if either $a$ or $b$ are equal to $j$, one of the
communities could be a subset of other, and the dissimilarity is
$0$ meaning that nonidentical objects may be regarded as
similar and the index is nonmetric. The Mountford index is in the
range $0 \dots log(2)$, but the dissimilarities
are divided by $log(2)$ so that the results will be in
the conventional range $0 \dots 1$.
RaupCrick dissimilarity (method = "raup"
) is a probabilistic
index based on presence/absence data. It is defined as $1 
prob(j)$, or based on the probability of observing at least $j$
species in shared in compared communities. The current function uses
analytic result from hypergeometric distribution
(phyper
) to find the probabilities. This probability
(and the index) is dependent on the number of species missing in both
sites, and adding allzero species to the data or removing missing
species from the data will influence the index. The probability (and
the index) may be almost zero or almost one for a wide range of
parameter values. The index is nonmetric: two communities with no
shared species may have a dissimilarity slightly below one, and two
identical communities may have dissimilarity slightly above zero. The
index uses equal occurrence probabilities for all species, but Raup
and Crick originally suggested that sampling probabilities should be
proportional to species frequencies (Chase et al. 2011). A simulation
approach with unequal species sampling probabilities is implemented in
raupcrick
function following Chase et al. (2011). The
index can be also used for transposed data to give a probabilistic
dissimilarity index of species cooccurrence (identical to Veech
2013).
Chao index tries to take into account the number of unseen species
pairs, similarly as in method = "chao"
in
specpool
. Function vegdist
implements a Jaccard
type index defined as $d[jk] = 1  U[j]*U[k]/(U[j] + U[k]  U[j]*U[k])$, where
$U[j] = C[j]/N[j] + (N[k] 1)/N[k] * a1/(2*a2) * S[j]/N[j]$,
and similarly for $U[k]$. Here $C[j]$ is the total
number of individuals in the species of site $j$ that are shared
with site $k$, $N[j]$ is the total number of
individuals at site $j$, $a1$ (and $a2$) are
the number of species occurring in site $j$ that have only one
(or two) individuals in site $k$, and $S[j]$ is the
total number of individuals in the species present at site $j$
that occur with only one individual in site $k$ (Chao et
al. 2005).
Morisita index can be used with genuine count data (integers) only. Its HornMorisita variant is able to handle any abundance data.
Mahalanobis distances are Euclidean distances of a matrix where columns are centred, have unit variance, and are uncorrelated. The index is not commonly used for community data, but it is sometimes used for environmental variables. The calculation is based on transforming data matrix and then using Euclidean distances following Mardia et al. (1979).
Euclidean and Manhattan dissimilarities are not good in gradient separation without proper standardization but are still included for comparison and special needs.
BrayCurtis and Jaccard indices are rankorder similar, and some
other indices become identical or rankorder similar after some
standardizations, especially with presence/absence transformation of
equalizing site totals with decostand
. Jaccard index is
metric, and probably should be preferred instead of the default
BrayCurtis which is semimetric.
The naming conventions vary. The one adopted here is traditional
rather than truthful to priority. The function finds either
quantitative or binary variants of the indices under the same name,
which correctly may refer only to one of these alternatives For
instance, the Bray
index is known also as Steinhaus, Czekanowski and Sørensen index.
The quantitative version of Jaccard should probably called
Ružička index.
The abbreviation "horn"
for the HornMorisita index is
misleading, since there is a separate Horn index. The abbreviation
will be changed if that index is implemented in vegan
.
Value

Should provide a dropin replacement for
dist
and
return a distance object of the same type.
Note
The function is an alternative to dist
adding some
ecologically meaningful indices. Both methods should produce similar
types of objects which can be interchanged in any method accepting
either. Manhattan and Euclidean dissimilarities should be identical
in both methods. Canberra index is divided by the number of variables
in vegdist
, but not in dist
. So these differ by
a constant multiplier, and the alternative in vegdist
is in
range (0,1). Function daisy
(package
cluster) provides alternative implementation of Gower index that
also can handle mixed data of numeric and class variables. There are
two versions of Gower distance ("gower"
, "altGower"
)
which differ in scaling: "gower"
divides all distances by the
number of observations (rows) and scales each column to unit range,
but "altGower"
omits doublezeros and divides by the number of
pairs with at least one abovezero value, and does not scale columns
(Anderson et al. 2006). You can use decostand
to add
range standardization to "altGower"
(see Examples). Gower
(1971) suggested omitting double zeros for presences, but it is often
taken as the general feature of the Gower distances. See Examples for
implementing the Anderson et al. (2006) variant of the Gower index.
Most dissimilarity indices in vegdist
are designed for
community data, and they will give misleading values if there are
negative data entries. The results may also be misleading or
NA
or NaN
if there are empty sites. In principle, you
cannot study species composition without species and you should remove
empty sites from community data.
References
Anderson, M.J. and Millar, R.B. (2004). Spatial variation and effects of habitat on temperate reef fish assemblages in northeastern New Zealand. Journal of Experimental Marine Biology and Ecology 305, 191221.
Anderson, M.J., Ellingsen, K.E. & McArdle, B.H. (2006). Multivariate dispersion as a measure of beta diversity. Ecology Letters 9, 683693.
Anderson, M.J & Thompson, A.A. (2004). Multivariate control charts for ecological and environmental monitoring. Ecological Applications 14, 19211935.
Cao, Y., Williams, W.P. & Bark, A.W. (1997). Similarity measure bias in river benthic Auswuchs community analysis. Water Environment Research 69, 95106.
Chao, A., Chazdon, R. L., Colwell, R. K. and Shen, T. (2005). A new statistical approach for assessing similarity of species composition with incidence and abundance data. Ecology Letters 8, 148159.
Chase, J.M., Kraft, N.J.B., Smith, K.G., Vellend, M. and Inouye, B.D. (2011). Using null models to disentangle variation in community dissimilarity from variation in $alpha$diversity. Ecosphere 2:art24 [doi:10.1890/ES1000117.1] Faith, D. P, Minchin, P. R. and Belbin, L. (1987). Compositional dissimilarity as a robust measure of ecological distance. Vegetatio 69, 5768.
Gower, J. C. (1971). A general coefficient of similarity and some of its properties. Biometrics 27, 623637.
Krebs, C. J. (1999). Ecological Methodology. Addison Wesley Longman.
Mardia, K.V., Kent, J.T. and Bibby, J.M. (1979). Multivariate analysis. Academic Press.
Mountford, M. D. (1962). An index of similarity and its application to classification problems. In: P.W.Murphy (ed.), Progress in Soil Zoology, 4350. Butterworths.
Veech, J. A. (2013). A probabilistic model for analysing species cooccurrence. Global Ecology and Biogeography 22, 252260.
Wolda, H. (1981). Similarity indices, sample size and diversity. Oecologia 50, 296302.
See Also
Function designdist
can be used for defining your own
dissimilarity index. Alternative dissimilarity functions include
dist
in base R,
daisy
(package cluster), and
dsvdis
(package labdsv). Function
betadiver
provides indices intended for the analysis of
beta diversity.
Examples
data(varespec)
vare.dist < vegdist(varespec)
# Orlóci's Chord distance: range 0 .. sqrt(2)
vare.dist < vegdist(decostand(varespec, "norm"), "euclidean")
# Anderson et al. (2006) version of Gower
vare.dist < vegdist(decostand(varespec, "log"), "altGower")
# Range standardization with "altGower" (that excludes doublezeros)
vare.dist < vegdist(decostand(varespec, "range"), "altGower")