Learn R Programming

bios2mds (version 1.2.2)

dis: Dissimilarity score

Description

Computes the dissimilarity score between two aligned amino acid sequences, based on substitution matrices.

Usage

dis(seq1, seq2, sub.mat.id = "PAM250", gap = NULL)

Arguments

seq1

a character vector representing a first amino acid sequence.

seq2

a character vector representing a second amino acid sequence.

sub.mat.id

a string of characters indicating the amino acid substitution matrix to be taken into account for calculation. This should be one of "PAM40", "PAM80", "PAM120", "PAM160", "PAM250", "BLOSUM30", "BLOSUM45", "BLOSUM62", "BLOSUM80", "GONNET", "JTT", "JTT_TM" and "PHAT". Default is PAM250. See sub.mat.

gap

a numeric vector of length 2, indicating the penalty for initiating and extending a strand of two gaps (gap ignored, default).

Value

A single numeric value representing the dissimilarity score.

Details

Grishin and Grishin (2002) developped a method to calculate the similarity score with amino acid substitution matrices.

Let s be an amino acid substitution matrix with elements s(a, b), let A be an alignment of n sequences, \(A_{ik}\) is a symbol (amino acid or gap: '-') in the site k of the sequence i. For each pair of sequences i and j from A, the following equations(1), (2), (3) and (4) are calculated as follows:

(1)

\(S_{ij}\), called score per site, is obtained as: $$S_{ij} = \sum_{k \in K_{ij}}s(A_{ik}, A_{jk})/l(K_{ij})$$ where \(K_{ik}\) is the set of sites k such that \(A_{ik}\) \(!=\) '-' and \(A_{jk}\) \(!=\) '-' and \(l(K_{ij})\) is the number of elements in \(K_{ij}\).

(2)

\(T_{ij}\), called average upper limit of the score per site, is obtained as: $$T_{ij} = 0.5 \sum_{k \in K_{ij}}(s(A_{ik}, A_{ik}) + s(A_{jk}, A_{jk}))/l(K_{ij})$$

(3)

\(S^{rand}_{ij}\), called score per site expected from random sequences, is obtained as: $$S^{rand}_{ij} = \sum^{20}_{a=1}\sum^{20}_{b=1}f^i_j(a)f^j_i(b)s(a, b)$$ where \(f^i_j(a)\) is the frequency of amino acid 'a' in \(i^{th}\) protein sequence of A over all sites in \(K_{ij}\).

(4)

\(V_{ij}\), called normalized score (Feng and Doolittle, 1997), is obtained as: $$V_{ij} = \frac{S_{ij} - S^{rand}_{ij}}{T_{ij} - S^{rand}_{ij}}$$

The normalized score \(V_{ij}\) ranges from 0 (for random sequences) to 1 (for identical sequences). However, for very divergent sequences, \(V_{ij}\) can become negative due to statistical errors. In this case, dis attributes 0 to negative scores.

The dissimilarity score \(D_{ij}\) between sequences i and j is obtained from the similarity score as:

D_{ij} = V_{ij} - 1

References

Grishin VN and Grishin NV (2002) Euclidian space and grouping of biological objects. Bioinformatics 18:1523-1534.

Feng DF and Doolittle RF (1997) Converting amino acid alignment scores into measures of evolutionary time: a simulation study of various relationships. J Mol Evol 44:361-370.

Examples

Run this code
# NOT RUN {
# calculating dis between the sequences of CLTR1_HUMAN and CLTR2_HUMAN:
aln <- import.fasta(system.file("msa/human_gpcr.fa", package = "bios2mds"))
dis <- dis(aln$CLTR1_HUMAN, aln$CLTR2_HUMAN)
dis
# }

Run the code above in your browser using DataLab