sim: Calculate similarities for binary vegetation data

Description

One of 56 (dis)similarity measures for binary data can be set to calculate (dis)similarities. The vegetational data can be in either database (list) or matrix format. Same holds for the output. Simultaneous calculation of geographical distances between plots and the virtual position of the calculated similarity values between the parental units can be achieved if a data.frame with coordinates is given.

Usage

sim(x, coord=NULL, method = "soer", dn=NULL, normalize = FALSE, 
listin = FALSE, listout = FALSE, ...)

Arguments

Vegetation data, either as matrix with rows = plots and columns = species (similarities are calculated between rows!), or as data.frame with first three columns representing plots, species and

coord

A data.frame with two columns containing the coordinate values of the sampling units. If given, it triggers the simultaneous calculation of the geographical distances between the sampling units, the coordinates of virtual centre-points betwee

method

Binary Similarity index (see Details for references and formulae), partial match to "soerensen", "jaccard", "ochiai", "mountford", "whittaker", "lande", "wils

Neighbor definition. A geographic distance represented by a numeric or a two value vector defining a ring around each plot. Only takes effect when coord != NULL. If specified, the output does only contain similarities between neighboring plot

normalize

Logical value indicating whether the values for a, b and c which are calculated in the process should be normalized to 100% (per row, which means per plot comparison). If normalize = TRUE an asymmetric index must

listin

if x is given in database (list) format this must be set to TRUE (there is no automatic detection of the format)

listout

If output is wanted in database format rather than as a dist-object set this to TRUE. Output is automatically given in database-format, when coord is specified.

...

Arguments to other functions

Value

If listout = FALSE a distance matrix of class dist is returned. If listout = TRUE, a data.frame is returned with 7 columns giving the names of the compared plots in the first two and the calculated similarity measure in the third column. The rest of the columns give the values for a, b, c, and d (in this order). Naming of the first three columns can be changed but defaults to NBX (one of the compared plots), NBY (the other one), used index (the values of the calculated index). If coord != NULL, the following columns are given in addition and the columns a:d shift to the end of the data.frame.
distanceGeographical distance between compared plots
XFor plotting purposes, the x-coordinate of the virtual position of the calculated similarity value in the center between the two compared plots
YFor plotting purposes, the y-coordinate of the virtual position of the calculated similarity value in the center between the two compared plots
xdistGeographical distance between compared plots, on the x-axis only
ydistGeographical distance between compared plots, on the y-axis only

encoding

UTF-8

Details

All binary similarity indices are based on the variables a, b and c (or can be expressed as such). Some of them also use d. Where a is the number of species shared by two compared plots, b is the number of species found only in one of the compared plots, and c is the number of species only found in the other of the compared plots. d refers to species which are absent from both the compared plots but present in the whole dataset. Indices incorporating d are discussed critically by Legendre & Legendre (1998) and elsewhere. They are called symmetric and expose a "double zero" problem as they take species into account which are absent from both compared units. Absence of species from a sampling site might be due to various factors, it does not necessarily reflect differences in the environment. Hence, it is preferable to avoid drawing ecological conclusions from the absence of species at two sites (Legendre & Legendre 1998). The indices presented here come from various sources as indicated. Comparative reviews can be found in e.g. Huhta (1979), Wolda (1981), Janson & Vegelius (1981), Shi (1993), Koleff et al. (2003), Albatineh (2006)

The indices considerably differ in their behaviour. For classification purposes and in ecology, Jaccard and Sørensen have been found to give robust and meaningful results (e.g. Janson & Vegelius 1981, Shi 1993). For other purposes other indices might be better suited. However, you are invited to use (at least with the asymmetric indices) ternary plots as suggested by Koleff et al. 2003. The matching components a, b, and c can be displayed in a ternary.plot to evaluate the position of the plots in similarity space. When output is in database-format, the matching components are always given and triax.plot can be used to plot them into a triangle-plot. Koleff et al. (2003) used an artificial set of matching components including all possibilities of values that a, b, and c can take from 0 to 100 to display the mathematical behavior of indices. An artificial data-set with this properties - together with the values for the asymmetric indices included here - is part of this package (ads.ternaries) and can be used to study the behavior of the indices prior to analysis. See details and examples there.

If coord is given, the geographic distances between plots/sampling units are calculated automatically, which may be of value when the display or further analyses of distance decay (sensu Tobler 1970, Nekola & White 1999) is in focus. For convenience the dn-trigger can be used to tell the function to only return similarities calculated between neighboring plots. Similarities between neighboring plots in an equidistant array are not subjected to the problem of auto-correlation because all plots share the same distance (Jurasinski & Beierkuhnlein 2006). Therefore, any variation occurring in the data are most likely caused by environmental differences alone.

In the following formulae...

a = number of shared species

b = number of species only found on one of the compared units

c = number of species only found on the other of the compared units

d = number of species not found on the compared plots but in the dataset

N = $a+b+c+d$

with $(n_1 \leq n_2)$...

$n_1$ = number of species of the plot with fewer species $(a+b)$ or $(a+c)$

$n_2$ = number of species of the plot with more species $(a+b)$ or $(a+c)$

Computable asymmetric indices: lll{ soerensen $sim = \frac{2a}{2a + b + c}$ Soerensen (1948) jaccard $sim = \frac{a}{a + b + c}$ Jaccard (1912) ochiai $sim = \frac{a}{\sqrt{(a+b)(a+c)}}$ Ochiai (1957), Shi (1993) mountford $sim = \frac{2a}{(a(b+c)+2bc)}$ Mountford (1962), Shi (1993) whittaker $sim = \frac{a+b+c}{\frac{2a+b+c}{2}}-1$ Whittaker (1960), Magurran (1988) lande $sim = \frac{b+c}{2}$ Lande (1996) wilsonshmida $sim = \frac{b+c}{2a+b+c}$ Wilson & Shmida (1984) cocogaston $sim = \frac{b+c}{a+b+c}$ Colwell & Coddington (1948), Gaston et al. (2001) magurran $sim = (2a+b+c)(1-\frac{a}{a+b+c})$ Magurran (1988) harrison $sim = \frac{min(b,c)}{max(b,c)+a}$ Harrison et al. (1992), Koleff et al. (2003) cody $sim = 1-\frac{a(2a+b+c)}{2(a+b)(a+c)}$ Cody (1993) williams $sim = \frac{min(b,c)}{a + b + c}$ Williams (1996), Koleff et al. (2003) williams2 $\frac{(bc)+1}{\frac{(a+b+c)^2-(a+b+c)}{2}}$ Williams (1996), Koleff et al. (2003) harte $1-\frac{2a}{2a+b+c}$ Harte & Kinzig (1997), Koleff et al. (2003) simpson $\frac{min(b,c)}{min(b,c)+a}$ Simpson (1949), Koleff et al. (2003) lennon $\frac{2|b-c|}{2a+b+c}$ Lennon et al. (2001), Koleff et al. (2003) weiher $sim = b+c$ Weiher & Boylen (1994) ruggiero $sim = \frac{a}{a+c}$ Ruggiero et al. (1998), Koleff et al. (2003) lennon2 $sim = 1 - \left[ \frac{log \left( \frac{2a+b+c}{a+b+c} \right) }{log2} \right]$ Lennon et al. (2001), Koleff et al. (2003) rout1ledge $sim = \frac{(a+b+c)^2}{(a+b+c)^2-2bc}-1$ Routledge (1977), Magurran (1988) rout2ledge $too long, see below$ Routledge (1977), Wilson & Shmida (1984) rout3ledge $sim = e^{rout2ledge}-1$ Routledge (1977) sokal1 $sim = \frac{a}{a+2(b+c)}$ Sokal & Sneath (1963) dice $sim = \frac{a}{min \left( (b+a),(c+a) \right)}$ Association index of Dice (1945), Wolda (1981) kulcz1insky $sim = \frac{a}{b+c}$ Oosting (1956), Southwood (1978) kulcz2insky $sim = \frac{\frac{a}{2}(2a+b+c)}{(a+b)(a+c)}$ Oosting (1956), Southwood (1978) mcconnagh $sim = \frac{a^2-bc}{(a+b)(a+c)}$ Hubalek (1982) simpson2 $sim = \frac{a}{a+b}$ Simpson (1960), Shi (1993) legendre2 $sim = \frac{3a}{3a + b + c}$ Legendre & Legendre (1998) fager $sim = \frac{a}{\sqrt{n_1n_2}} - \frac{1}{2*\sqrt{n_2}}$ Fager (1957), Shi (1993) maarel $sim = \frac{2a - (b+c)}{2a + b+ c}$ van der Maarel (1969) lamont $sim = \frac{a}{2a + b+ c}$ Lamont and Grant (1979) johnson $sim = \frac{a}{2b}$ Johnson (1971) sorgenfrei $sim = \frac{a^2}{(a+b)(a+c)}$ Sorgenfrei (1959) johnson2 $sim = \frac{a}{a+b}+\frac{a}{a+c}$ Johnson (1967) }

Computable symmetric indices (including unshared species): lll{ manhattan $sim = \frac{b+c}{a+b+c+d}$ Mean Manhattan, Legendre & Legendre (1998) simplematching $sim = \frac{a+d}{a+b+c+d}$ Sokal & Michener 1958 margaleff $sim = \frac{a(a+b+c+d)}{(a+b)(a+c)}$ Clifford & Stevenson (1975) pearson $sim = \frac{ad-bc}{\sqrt{(a+b)(a+c)(d+b)(d+c)}}$ Phi of Pearson, Gower & Legendre (1986), Yule (1912) roger $sim = \frac{a+d}{a+2(b+c)+d}$ Rogers & Tanimoto (1960), Gower & Legendre (1986) baroni $sim = \frac{\sqrt{ad}+a}{\sqrt{ad}+a+b+c}$ Baroni-Urbani & Buser (1976), Wolda (1981) dennis $sim = \frac{ad-bc}{\sqrt{(a+b+c+d)(a+b)(a+c)}}$ Holliday et al. (2002), Ellis et al. (1993) fossum $sim = \frac{(a+b+c+d)\left(-\frac{a}{2}\right)^2}{(a+b)(a+c)}$ Holliday et al. (2002), Ellis et al. (1993) gower $sim = \frac{a-(b+c)+d}{a+b+c+d}$ Gower & Legendre (1986) legendre $sim = \frac{a}{a+b+c+d}$ Gower & Legendre (1986), Russell/Rao in Ellis et al. (1993) sokal2 $sim = \frac{ad}{\sqrt{(a+b)(a+c)(d+b)(d+c)}}$ Sokal & Sneath (1963) sokal3 $sim = \frac{2a+2d}{(a+d+(a+b+c+d)}$ Sokal & Sneath (1963) sokal4 $sim = \frac{a+d}{b+c}$ Sokal & Sneath (1963) stiles $sim = log\frac{(a+b+c+d) \left( |ad-bc|-\frac{a+b+c+d}{2}\right)^2 }{(a+b)(a+c)(b+d)(c+d)}$ Stiles (1946) yule $sim = \frac{ad-bc}{ad+bc}$ Yule & Kendall (1973) michael $sim = \frac{4(ad-bc)}{(a+d)^2+(b+c)^2}$ Michael (1920), Shi (1993) hamann $sim = \frac{(a+d)-(b+c)}{N}$ Hamann (1961) forbes $sim = \frac{(aN-2n_2)}{(Nn_1-2n_2)}$ Forbes (1925), Shi (1993) chisquare $sim = \frac{(a+b+c+d)(ad-bc)^2}{(a+b)(a+c)(b+d)(c+d)}$ Yule & Kendall (1950) peirce $sim = \frac{(ad-bc)}{(a+c)(b+d)}$ Peirce (1884) eyraud $sim = \frac{a-(a+b)(a+c)}{(a+b)(a+c)(b+d)(c+d)}$ Eyraud (1936) in Shi (1993) euclidean $sim = \frac{\sqrt{b+c}}{a+b+c+d}$ Mean Euclidean in Ellis et al. (1993) divergence $sim = \frac{\sqrt{b+c}}{sqrt{a+b+c+d}}$ Ellis et al. (1993) } rout2ledge formula (Routledge, 1977; Koleff et al. 2003):

$\beta _{R2} = \log(2a + b + c) - \left( {\frac{1}{{2a + b + c}}2a\log 2} \right) - \left( {\frac{1}{{2a + b + c}}((a + b)\log (a + b) + (a + c)\log (a + c))} \right)$

References

Albatineh, A. N., Niewiadomska-Bugaj, M. & Mihalko, D. (2006) On Similarity Indices and Correction for Chance Agreement. Journal of Classification V23: 301-313.

Baroni-Urbani, C. & Buser, M. W. (1976) Similarity of Binary Data. Systematic Zoology 25: 251-259.

Clifford, H. T. & Stephenson, W. (1975) An introduction to numerical classification. Academic Press, New York, San Francisco, London.

Cody, M. L. (1993) Bird diversity components within and between habitats in Australia. - In: Ricklefs, R. E. & Schluter, D. (eds.), Species Diversity in Ecological Communities: historical and geographical perspectives, pp. 147-158, University of Chicago Press, Chicago

Colwell, R. K. & Coddington, J. A. (1994) Estimating terrestrial biodiversity through extrapolation. Philosophical Transactions of the Royal Society of London Series B-Biological Sciences 345: 101-118.

Dice, L. R. (1945) Measures of the amount of ecological association between species. Ecology 26: 297-302. Ellis, D., Furner-Hines, J., Willett, P. (1993) Measuring the degree of similarity between objects in text retrieval systems. Perspectives in Information Management 3(2): 128-149

Fager, E. W. (1957) Determination and analysis of recurrent groups. Ecology 38: 586-595.

Faith, D. P., Minchin, P. R. & Belbin, L. (1987) Compositional dissimilarity as a robust measure of ecological distance. Plant Ecology 69: 57-68.

Gaston, K. J., Rodrigues, A. S. L., van Rensburg, B. J., Koleff, P. & Chown, S. L. (2001) Complementary representation and zones of ecological transition. Ecology Letters 4: 4-9.

Gower, J. C. & Legendre, P. (1986) Metric and Euclidean properties of dissimilarity coefficients. Journal of Classification 3: 5-48.

Hajdu, L. J. (1981) Graphical comparison of resemblance measures in phytosociology. Plant Ecology V48: 47-59.

Harrison, S., Ross, S. J. & Lawton, J. H. (1992) Beta diversity on geographic gradients in Britain. Journal of Animal Ecology 61: 151-158.

Harte, J. & Kinzig, A. (1997) On the implications of species-area relationships for endemism, spatial turnover and food web patterns. Oikos 80.

Holliday, J. D., Hu, C.-Y. & Willett, P. (2002) Grouping of Coefficients for the Calculation of Inter-Molecular Similarity and Dissimilarity using 2D Fragment Bit-Strings. Combinatorial Chemistry & High Throughput Screening 5: 155-166.

Hubalek, Z. (1982) Coefficients of association and similarity, based on binary (presence-absence) data: An evaluation. Biological Reviews of the Cambridge Philosophical Society 57: 669-689.

Huhta, V. (1979) Evaluation of different similarity indices as measures of succession in arthropod communities of the forest floor after clear-cutting. Oecologia V41: 11-23.

Jaccard, P. (1901) Etude comparative de la distribution florale d’une portion des Alpes et du Jura. Bulletin de la Societé Vaudoise des Sciences Naturelles 37: 547-579.

Jaccard, P. (1912) The distribution of the flora of the alpine zone. New Phytologist 11: 37-50.

Johnson, J. G. (1971) A quantitative approach to faunal province analysis. American Journal of Science 270: 257-280.

Johnson, S. C. (1967) Hierarchical clustering schemes. Psychometrika 32: 241-254.

Jurasinski, G. & Beierkuhnlein, C. (2006) Spatial patterns of biodiversity - assessing vegetation using hexagonal grids. Proceedings of the Royal Irish Academy - Biology and Environment 106B: 401-411.

Jurasinski, G. & Beierkuhnlein, C. (submitted) Distance decay and non-stationarity in a semi-arid Mediterranean ecosystem. Journal of Vegetation Science.

Koleff, P., Gaston, K. J. & Lennon, J. J. (2003) Measuring beta diversity for presence-absence data. Journal of Animal Ecology 72: 367-382.

Lamont, B. B. & Grant, K. J. (1979) A comparison of twenty-one measures of site dissimilarity. - In: Orlóci, L., Rao, C. R. & Stiteler, W. M. (eds.), Multivariate Methods in Ecological Work, pp. 101-126, Int. Coop. Publ. House, Fairland, MD

Lande, R. (1996) Statistics and partitioning of species diversity and similarity along multiple communities. Oikos 76: 25-39.

Legendre, P. & Legendre, L. (1998) Numerical Ecology. Elsevier, Amsterdam.

Lennon, J. J., Koleff, P., Greenwood, J. J. D. & Gaston, K. J. (2001) The geographical structure of British bird distributions: diversity, spatial turnover and scale. J Anim Ecology 70: 966-979.

Magurran, A. E. (1988) Ecological Diversity and its Measurement. Chapman & Hall, London.

Mountford, M. D. (1962) An index of similarity and its application to classification problems. - In: Murphy, P. W. (ed.) Progress in Soil Zoology, pp. 43-50, Butterworths Ochiai, A. (1957) Zoogeographical studies on the soleoid fishes found in Japan and its neighbouring regions. Bulletin of the Japanese Society of Fisheries Science 22(9): pp. 526-530 Oosting, H. J. (1956) The study of plant communities: an introduction to plant ecolog. W. H. Freeman, San Francisco.

Rogers, D. J. & Tanimoto, T. T. (1960) A computer program for classifying plants. Science 132: 1115-1118.

Routledge, R. D. (1977) On Whittaker’s components of diversity. Ecology 58: 1120-1127.

Ruggiero, A., Lawton, J. H. & Blackburn, T. M. (1998) The geographic ranges of mammalian species in South America: spatial patterns in environmental resistance and anisotropy. Journal of Biogeography 25: 1093-1103.

Shi, G. R. (1993) Multivariate data analysis in palaeoecology and palaeobiogeography--a review. Palaeogeography, Palaeoclimatology, Palaeoecology 105: 199-234.

Simpson, E. H. (1949) The measurement of diversity. Nature 163: 688.

Simpson, G. G. (1960) Notes on the measurement of faunal resemblance. American Journal of Science 258-A: 300-311.

Sokal, R. R. & Michener, C. D. (1958) A statistical method for evaluating systematic relationships. University of Kansas Science Bulletin 38: 1409-1438.

Sokal, R. R. & Sneath, P. H. A. (1963) Principles of numerical taxonomy. W. H. Freeman, San Francisco.

Sørensen, T. (1948) A method of establishing groups of equal amplitude in plant sociology based on similarity of species content. Biologiske Skrifter 5: 1-34.

Sorgenfrei, T. (1959) Molluscan assemblages from the marine middle Miocene of South Jutland and their environments. Danmark Geologiske Undersøgelse. Serie 2 79: 403-408.

Southwood, T. S. (1978) Ecological Methods. Chapman and Hall, London. Stiles (1961) The association factor in information retrieval. Journal of the Association for Computing Machinery. 8 271-279

Weiher, E. & Boylen, C. W. (1994) Patterns and prediction of alpha and beta diversity of aquatic plants in Adirondack (New York) lakes. Canadian Journal of Botany-Revue Canadienne De Botanique 72: 1797-1804.

Whittaker, R. H. (1960) Vegetation of the Siskiyou Mountains, Orgeon and California. Ecological Monographs 30: 279-338.

Williams, P. H. (1996) Mapping variations in the strength and breadth of biogeographic transition zones using species turnover. Proceedings of the Royal Society of London Series B-Biological Sciences 263: 579-588.

Williams, P. H., Klerk, H. M. & Crowe, T. M. (1999) Interpreting biogeographical boundaries among Afrotropical birds: spatial patterns in richness gradients and species replacement. J Biogeography 26: 459-474. Wilson, M. V. & Shmida, A. (1984) Measuring beta-diversity with presence-absence data. Journal of Ecology 72: 1055-1064. Wolda, H. (1981) Similarity indices, sample size and diversity. Oecologia 50: 296-302. Yule, G. U. & Kendall, M. G. (1973) An introduction to the theory of statistics. Griffin, London. Yule, G. U. (1912) On the methods of measuring association between two attributes. Journal of the Royal Statistical Society 75(6): 579-642

Examples

Run this code

data(abis)
##calculate jaccard similarity and output as dist-object
jacc.dist <- sim(abis.spec, method="jaccard") 

##calculate Whittaker similarity (with prior normalisation) and 
##output as data.frame
whitt.list <- sim(abis.spec, method="whittaker", normalize=TRUE, 
listout=TRUE) 

##calculate similarity from a database list after Harte & Kinzig (1997) 
##and output as dist-object
abis.spec.ls <- liste(abis.spec, splist=TRUE)
hart.dist <- sim(abis.spec.ls, method="harte", listin=TRUE) 

## calculate the geographic distances between sites simultaneously
## and return only similarities calculated between neighboring plots
abis.soer <- sim(abis.spec, coord=abis.env[,1:2], dn=100)

## in an equidistant array
## you can plot this nice between the original positions of the
## sites (with the size of the dots expressing number of species
## for the sites, and value of the Sørensen coefficient in between)
require(geoR)
points.geodata(coord=abis.env[,1:2], data=abis.env$n.spec, 
cex.min=1, cex.max=5)
points.geodata(coord=abis.soer[,5:6], data=abis.soer$soerensen, 
cex.min=1, cex.max=5, col="grey50", add=TRUE)

Run the code above in your browser using DataLab