# dist.dna

##### Pairwise Distances from DNA Sequences

This function computes a matrix of pairwise distances from DNA sequences using a model of DNA evolution. Eleven substitution models (and the raw distance) are currently available.

- Keywords
- multivariate, manip, cluster

##### Usage

```
dist.dna(x, model = "K80", variance = FALSE,
gamma = FALSE, pairwise.deletion = FALSE,
base.freq = NULL, as.matrix = FALSE)
```

##### Arguments

- x
a matrix or a list containing the DNA sequences; this must be of class

`"DNAbin"`

(use`as.DNAbin`

is they are stored as character).- model
a character string specifying the evolutionary model to be used; must be one of

`"raw"`

,`"N"`

,`"TS"`

,`"TV"`

,`"JC69"`

,`"K80"`

(the default),`"F81"`

,`"K81"`

,`"F84"`

,`"BH87"`

,`"T92"`

,`"TN93"`

,`"GG95"`

,`"logdet"`

,`"paralin"`

,`"indel"`

, or`"indelblock"`

.- variance
a logical indicating whether to compute the variances of the distances; defaults to

`FALSE`

so the variances are not computed.- gamma
a value for the gamma parameter possibly used to apply a correction to the distances (by default no correction is applied).

- pairwise.deletion
a logical indicating whether to delete the sites with missing data in a pairwise way. The default is to delete the sites with at least one missing data for all sequences (ignored if

`model = "indel"`

or`"indelblock"`

).- base.freq
the base frequencies to be used in the computations (if applicable). By default, the base frequencies are computed from the whole set of sequences.

- as.matrix
a logical indicating whether to return the results as a matrix. The default is to return an object of class dist.

##### Details

The molecular evolutionary models available through the option
`model`

have been extensively described in the literature. A
brief description is given below; more details can be found in the
references.

`raw`

,`N`

: This is simply the proportion or the number of sites that differ between each pair of sequences. This may be useful to draw ``saturation plots''. The options`variance`

and`gamma`

have no effect, but`pairwise.deletion`

can.`TS`

,`TV`

: These are the numbers of transitions and transversions, respectively.`JC69`

: This model was developed by Jukes and Cantor (1969). It assumes that all substitutions (i.e. a change of a base by another one) have the same probability. This probability is the same for all sites along the DNA sequence. This last assumption can be relaxed by assuming that the substition rate varies among site following a gamma distribution which parameter must be given by the user. By default, no gamma correction is applied. Another assumption is that the base frequencies are balanced and thus equal to 0.25.`K80`

: The distance derived by Kimura (1980), sometimes referred to as ``Kimura's 2-parameters distance'', has the same underlying assumptions than the Jukes--Cantor distance except that two kinds of substitutions are considered: transitions (A <-> G, C <-> T), and transversions (A <-> C, A <-> T, C <-> G, G <-> T). They are assumed to have different probabilities. A transition is the substitution of a purine (C, T) by another one, or the substitution of a pyrimidine (A, G) by another one. A transversion is the substitution of a purine by a pyrimidine, or vice-versa. Both transition and transversion rates are the same for all sites along the DNA sequence. Jin and Nei (1990) modified the Kimura model to allow for variation among sites following a gamma distribution. Like for the Jukes--Cantor model, the gamma parameter must be given by the user. By default, no gamma correction is applied.`F81`

: Felsenstein (1981) generalized the Jukes--Cantor model by relaxing the assumption of equal base frequencies. The formulae used in this function were taken from McGuire et al. (1999).`K81`

: Kimura (1981) generalized his model (Kimura 1980) by assuming different rates for two kinds of transversions: A <-> C and G <-> T on one side, and A <-> T and C <-> G on the other. This is what Kimura called his ``three substitution types model'' (3ST), and is sometimes referred to as ``Kimura's 3-parameters distance''.`F84`

: This model generalizes K80 by relaxing the assumption of equal base frequencies. It was first introduced by Felsenstein in 1984 in Phylip, and is fully described by Felsenstein and Churchill (1996). The formulae used in this function were taken from McGuire et al. (1999).`BH87`

: Barry and Hartigan (1987) developed a distance based on the observed proportions of changes among the four bases. This distance is not symmetric.`T92`

: Tamura (1992) generalized the Kimura model by relaxing the assumption of equal base frequencies. This is done by taking into account the bias in G+C content in the sequences. The substitution rates are assumed to be the same for all sites along the DNA sequence.`TN93`

: Tamura and Nei (1993) developed a model which assumes distinct rates for both kinds of transition (A <-> G versus C <-> T), and transversions. The base frequencies are not assumed to be equal and are estimated from the data. A gamma correction of the inter-site variation in substitution rates is possible.`GG95`

: Galtier and Gouy (1995) introduced a model where the G+C content may change through time. Different rates are assumed for transitons and transversions.`logdet`

: The Log-Det distance, developed by Lockhart et al. (1994), is related to BH87. However, this distance is symmetric. Formulae from Gu and Li (1996) are used.`dist.logdet`

in phangorn uses a different implementation that gives substantially different distances for low-diverging sequences.`paralin`

: Lake (1994) developed the paralinear distance which can be viewed as another variant of the Barry--Hartigan distance.`indel`

: this counts the number of sites where there is an insertion/deletion gap in one sequence and not in the other.`indelblock`

: same than before but contiguous gaps are counted as a single unit. Note that the distance between`-A-`

and`A--`

is 3 because there are three different blocks of gaps, whereas the ``indel'' distance will be 2.

##### Value

an object of class dist (by default), or a numeric
matrix if `as.matrix = TRUE`

. If `model = "BH87"`

, a numeric
matrix is returned because the Barry--Hartigan distance is not
symmetric.

If `variance = TRUE`

an attribute called `"variance"`

is
given to the returned object.

##### Note

If the sequences are very different, most evolutionary distances are
undefined and a non-finite value (Inf or NaN) is returned. You may do
`dist.dna(, model = "raw")`

to check whether some values are
higher than 0.75.

##### References

Barry, D. and Hartigan, J. A. (1987) Asynchronous distance between
homologous DNA sequences. *Biometrics*, **43**, 261--276.

Felsenstein, J. (1981) Evolutionary trees from DNA sequences: a
maximum likelihood approach. *Journal of Molecular Evolution*,
**17**, 368--376.

Felsenstein, J. and Churchill, G. A. (1996) A Hidden Markov model
approach to variation among sites in rate of evolution.
*Molecular Biology and Evolution*, **13**, 93--104.

Galtier, N. and Gouy, M. (1995) Inferring phylogenies from DNA
sequences of unequal base compositions. *Proceedings of the
National Academy of Sciences USA*, **92**, 11317--11321.

Gu, X. and Li, W.-H. (1996) Bias-corrected paralinear and LogDet
distances and tests of molecular clocks and phylogenies under
nonstationary nucleotide frequencies. *Molecular Biology and
Evolution*, **13**, 1375--1383.

Jukes, T. H. and Cantor, C. R. (1969) Evolution of protein
molecules. in *Mammalian Protein Metabolism*, ed. Munro, H. N.,
pp. 21--132, New York: Academic Press.

Kimura, M. (1980) A simple method for estimating evolutionary rates of
base substitutions through comparative studies of nucleotide
sequences. *Journal of Molecular Evolution*, **16**, 111--120.

Kimura, M. (1981) Estimation of evolutionary distances between
homologous nucleotide sequences. *Proceedings of the National
Academy of Sciences USA*, **78**, 454--458.

Jin, L. and Nei, M. (1990) Limitations of the evolutionary parsimony
method of phylogenetic analysis. *Molecular Biology and
Evolution*, **7**, 82--102.

Lake, J. A. (1994) Reconstructing evolutionary trees from DNA and
protein sequences: paralinear distances. *Proceedings of the
National Academy of Sciences USA*, **91**, 1455--1459.

Lockhart, P. J., Steel, M. A., Hendy, M. D. and Penny, D. (1994)
Recovering evolutionary trees under a more realistic model of sequence
evolution. *Molecular Biology and Evolution*, **11**,
605--602.

McGuire, G., Prentice, M. J. and Wright, F. (1999). Improved error
bounds for genetic distances from DNA sequences. *Biometrics*,
**55**, 1064--1070.

Tamura, K. (1992) Estimation of the number of nucleotide substitutions
when there are strong transition-transversion and G + C-content
biases. *Molecular Biology and Evolution*, **9**, 678--687.

Tamura, K. and Nei, M. (1993) Estimation of the number of nucleotide
substitutions in the control region of mitochondrial DNA in humans and
chimpanzees. *Molecular Biology and Evolution*, **10**, 512--526.

##### See Also

`read.GenBank`

, `read.dna`

,
`write.dna`

, `DNAbin`

,
`dist.gene`

, `cophenetic.phylo`

,
`dist`

*Documentation reproduced from package ape, version 5.3, License: GPL (>= 2)*