Weighting schemes for DNA and amino acid sequences.
weight(x, ...)# S3 method for DNAbin
weight(x, method = "Henikoff", k = 5, ...)
# S3 method for AAbin
weight(x, method = "Henikoff", k = 5, ...)
# S3 method for list
weight(x, method = "Henikoff", k = 5, residues = NULL, gap = "-", ...)
# S3 method for dendrogram
weight(x, method = "Gerstein", ...)
# S3 method for default
weight(x, method = "Henikoff", k = 5, residues = NULL, gap = "-", ...)
a named vector of weights, the sum of which is equal to the total number of sequences (average weight = 1).
a list or matrix of sequences
(usually a "DNAbin" or "AAbin" object).
Alternatively x can be an object of class "dendrogram"
for tree-base weighting.
additional arguments to be passed between methods.
a character string indicating the weighting method to be used.
Currently the only methods available are a modified version of the
maximum entropy weighting scheme proposed by
Henikoff and Henikoff (1994) (method = "Henikoff"
)
and the tree-based weighting scheme of Gerstein et al (1994)
(method = "Gerstein"
).
integer representing the k-mer size to be used. Defaults to 5. Note that higher values of k may be slow to compute and use excessive memory due to the large numbers of calculations required.
either NULL (default; emitted residues are automatically
detected from the sequences), a case sensitive character vector
specifying the residue alphabet, or one of the character strings
"RNA", "DNA", "AA", "AMINO". Note that the default option can be slow for
large lists of character vectors. Furthermore, the default setting
residues = NULL
will not detect rare residues that are not present
in the sequences, and thus will not assign them emission probabilities
in the model. Specifying the residue alphabet is therefore
recommended unless x is a "DNAbin" or "AAbin" object.
the character used to represent gaps in the alignment matrix
(if applicable). Ignored for "DNAbin"
or "AAbin"
objects.
Defaults to "-" otherwise.
Shaun Wilkinson
This is a generic function.
If method = "Henikoff"
the sequences are weighted
using a modified version of the maximum entropy method proposed by
Henikoff and Henikoff (1994). In this case the
maximum entropy weights are calculated from a k-mer presence absence
matrix instead of an alignment as originally described by
Henikoff and Henikoff (1994).
If method = "Gerstein"
the agglomerative method of
Gerstein et al (1994) is used to weight sequences based
on their relatedness as derived from a phylogenetic tree.
In this case a dendrogram is first derived using the
cluster
function in the
kmer
package.
Methods are available for
"dendrogram"
objects, "DNAbin"
and "AAbin"
sequence objects (as lists or matrices) and sequences in standard
character format provided either as lists or matrices.
For further details on sequence weighting schemes see Durbin et al (1998) chapter 5.8.
Durbin R, Eddy SR, Krogh A, Mitchison G (1998) Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge University Press, Cambridge, United Kingdom.
Gerstein M, Sonnhammer ELL, Chothia C (1994) Volume changes in protein evolution. Journal of Molecular Biology, 236, 1067-1078.
Henikoff S, Henikoff JG (1994) Position-based sequence weights. Journal of Molecular Biology, 243, 574-578.
## weight the sequences in the woodmouse dataset from the ape package
library(ape)
data(woodmouse)
woodmouse.weights <- weight(woodmouse)
woodmouse.weights
Run the code above in your browser using DataLab