estimaterates: Estimate substitution rate matrix.

Description

Estimate a substitution rate matrix. Users can specify a substitution rate matrix and estimate it. Partition analysis, gamma rate variation, estimating root frequencies of the discrete characters, and specifying clades (or a set of branches) with their own unique rates are also possible. See arguments below. See vignette for detailed examples.

Usage

estimaterates(usertree = NULL, userphyl = NULL, matchtipstodata = FALSE,
              unobserved = NULL, alphabet = NULL, modelmat = NULL, 
              bgtype = "listofnodes", bg = NULL, partition = NULL, 
              ratevar = FALSE, nocat = 4, reversible = FALSE, numhessian = TRUE, 
              rootprob = NULL, rpvec = NULL, init = 0.9, lowli = 0.001, upli = 100, ...)

Value

All arguments used while calling the estimaterates function are attached in a list. The following components are also returned in the same list:

call: Function call used.
conv: A vector of convergence indicators for the model run. 0 denotes successful convergence.
time: Time taken in seconds.
tree: The phylogenetic tree used.
bg: List of group of nodes that capture branches that follow unique substitution rates.
results: List of results including the parameter estimates for each unique entry of modelmat (excluding zeros), standard errors, number of parameters fit, and AIC and BIC values. Furthermore, details from the optimization routine applied are also available. Use ...$results$wop$parsep.
data_red: Unique phyletic patterns observed.
w: List of number of times each gene phyletic pattern was observed.

Arguments

usertree

Rooted binary tree of class "phylo". Read in Newick tree using read.tree() from package ape before passing this argument. The branch lengths must be in expected substitutions per site (but see lowli and upli arguments). Trees estimated using MrBayes and BEAST, for instance, yield branch lengths using that scale. If an unrooted tree is available, look into the ``root" and ``midpoint.root" functions from the APE and phytools packages, respectively.

userphyl

A matrix or data frame of phyletic patterns. Rows represent the discrete character patterns and columns represent the taxa. These data can be numeric or character (but not both).

matchtipstodata

The default is FALSE, which means that the user must ensure that the ordering of the taxa in the data matrix must match the internal ordering of tip labels of the tree. Set to TRUE, if the column names of the data matrix, i.e., the taxa names, are all present, with the same spelling and notation in the tip labels of the tree provided, in which case, the restriction on ordering is not necessary.

unobserved

A matrix of unobserved phyletic patterns, representing possible sampling or acquisition bias. Each row should be a unique phyletic pattern.

alphabet

The set of discrete characters used. May be integer only or character only.

modelmat

Can be one of two options: pre-built or user-specified. For the pre-built matrices, use:

"ER" for an equal rates matrix, e.g., matrix(c(NA, 1, 1, 1, 1, NA, 1, 1, 1, 1, NA, 1, 1, 1, 1, NA ), nrow = 4, ncol = 4)
"SYM" for a symmetric matrix, e.g., matrix(c(NA, 1, 2, 3, 1, NA, 4, 5, 2, 4, NA, 6, 3, 5, 6, NA ), 4,4)
"ARD" for an all rates different matrix, e.g., matrix(c(NA, 1, 2, 3, 4, NA, 5, 6, 7, 8, NA, 9, 10, 11, 12, NA), 4, 4)
"GTR" for a general time reversible model. This sets reversible to be TRUE, rootprob to be "maxlik", and the rate matrix supplied to the function is symmetric. The estimated rate matrix can be written as the product of a symmetric matrix multiplied by a diagonal matrix (consisting of the estimated root probabilities).
"BD" for a standard birth death matrix, e.g., matrix(c(NA, 1, 0, 0, 2, NA, 1, 0, 0, 2, NA, 1, 0, 0, 2, NA ), 4, 4)
"BDER" for a standard birth death matrix, e.g., matrix(c(NA, 1, 0, 0, 1, NA, 1, 0, 0, 1, NA, 1, 0, 0, 1, NA ), 4, 4)
"BDSYM" for a standard birth death matrix, e.g., matrix(c(NA, 1, 0, 0, 1, NA, 2, 0, 0, 2, NA, 3, 0, 0, 3, NA ), 4, 4)
"BDARD" for an all rates different birth death matrix, e.g., matrix(c(NA, 1, 0, 0, 4, NA, 2, 0, 0, 5, NA, 3, 0, 0, 6, NA ), 4, 4)

These pre-built matrices are inspired by the APE and DiscML packages.

A square matrix can also be input that consists of integer values denoting the indices of the substitution rates to be estimated. The number of rows and columns corresponds to the total number of discete states possible in the data. For example, using matrix(c(NA, 1, 2, NA), 2, 2) means that two rates must be estimated corresponding to the entries 1 and 2, respectively. Using matrix(c(NA, 1, 0, NA), 2, 2) means that only one rate must be estimated and the entry 0 corresponds to a substitution that is not permitted (hence, does not need to be estimated). See examples and vignette for more examples.

bgtype

Use this to group branches hypothesized to follow the same rates (but differ from other branches). If clade-specific insertion and deletion rates are required to be estimated, use argument option "ancestornodes". If, on the other hand, a group of branches (not in a clade) are hypothesized to follow the same rates, use argument option "listofnodes".

bg

A vector of nodes should be provided if the "ancestornodes" option was chosen for argument "bgtype". If, on the other hand, "listofnodes" was chosen, a list should be provided with each element of the list being a vector of nodes that limit the branches that follow the same rates. See examples and vignette.

partition

A list of vectors (of sites) subject to different evolutionary constraints. For example, supplying list(c(1:2500), c(2501:5000)) means that sites 1 through 2500 follow their own substitution rates distinct from sites 2501 through 5000. These sites correspond to the rows of the data supplied in userphyl. Partition models can be fitted with a common (or unique) gamma distribution for rate variation over all partitions, and/or common root probabilities can be estimated over all partitions.

ratevar

Default option is FALSE. Specifying "discgamma" implements the discrete gamma approximation model of Yang, 1994. Even if a partition of sites is specified, this option uses the same $\alpha$ parameter for the gamma distribution over all partitions. Specifying "partitionspecificgamma" implements the discrete gamma approximation model in each partition separately. The number of categories can be specified using the "nocat" argument. See examples and vignette.

nocat

The number of categories for the discrete gamma approximation.

reversible

This option forces a model to be reversible, i.e., the flow from state 'a' to 'b' is the same as the flow from state 'b' to 'a'. Only symmetric transition matrices can be specified in modelmat with this option. Inspired by the DiscML package.

numhessian

Set to FALSE if standard errors are not required. This speeds up the algorithm. Default is TRUE. Although the function being used to calculate these errors is reliable, it is rare but possible that the errors are not calculated (due to approximation-associated issues while calculating the Hessian). In this case, try bootstrapping with numhessian=FALSE.

rootprob

Four options are available: "equal", "stationary", "maxlik", and "user".

Option "equal" means that all the discrete characters are given equal weight at the root.
Using "stationary" means that the discrete characters are weighted at the root by the stationary frequencies implied by the substitution rate matrix. Note that these can differ based on the partition. In the case that the "bgtype" argument is also provided, an average of the stationary frequencies of all branch groupings is used.
If "maxlik" is supplied, then the root frequencies are also estimated. These do not differ based on the partition.
If "user" is supplied here, a vector of root frequency parameters can be provided to argument "rpvec".

rpvec

If option "user" is specified for argument "rootprob", supply a vector of the same length as that provided to argument "alphabet", representing the root frequency parameters.

init

Initial value for the rates. The default value is 0.9.

lowli

For finer control of the boundaries of the optimization problem. The default value is 0.001. This usually suffices if the branch lengths are in expected substitutions per site. However, if branch lengths are in different units, this should be changed accordingly.

upli

For finer control of the boundaries of the optimization problem. The default value is 100. This usually suffices if the branch lengths are in expected substitutions per site. However, if branch lengths are in different units, this should be changed accordingly.

...

Passing other arguments to the optimization algorithm "nlminb". For example, control = list(trace = 5) will print progress at every 5th iteration.

Author

Utkarsh J. Dang and G. Brian Golding

utkarshdang@cunet.carleton.ca

Details

See vignette for detailed examples.

References

Paradis, E., Claude, J. & Strimmer, K. 2004. APE: analyses of phylogenetics and evolution in R language. Bioinformatics 20: 289-290. R package version 3.2.

Tane Kim and Weilong Hao 2014. DiscML: An R package for estimating evolutionary rates of discrete characters using maximum likelihood. R package version 1.0.1.

Examples

Run this code

library(markophylo)
##############
data(simdata1) #example data for a 2-state continuous time markov chain model.
#Now, plot example tree.
ape::plot.phylo(simdata1$tree, edge.width = 2, show.tip.label = FALSE, no.margin = TRUE) 
ape::nodelabels(frame = "circle", cex = 0.7)
ape::tiplabels(frame = "circle", cex = 0.7)
print(simdata1$Q) #substitution matrix used to simulate example data
print(table(simdata1$data)) #states and frequencies
model1 <- estimaterates(usertree = simdata1$tree, userphyl = simdata1$data, 
                        alphabet = c(1, 2), rootprob = "equal", 
                        modelmat = matrix(c(NA, 1, 2, NA), 2, 2))
print(model1)
####
# \donttest{
#If the data is known to contain sampling bias such that certain phyletic
#patterns are not observed, then these unobserved data can be corrected for
#easily. First, let's create a filtered version of the data following which
#a correction will be applied within the function. Here, any patterns with all
#ones or twos are filtered out.
filterall1 <- which(apply(simdata1$data, MARGIN = 1, FUN = 
                            function(x) isTRUE(all.equal(as.vector(x), c(1, 1, 1, 1)))))
filterall2 <- which(apply(simdata1$data, MARGIN = 1, FUN = 
                            function(x) isTRUE(all.equal(as.vector(x), c(2, 2, 2, 2)))))
filteredsimdata1 <- simdata1$data[-c(filterall1, filterall2), ]
model1_f_corrected <- estimaterates(usertree = simdata1$tree, userphyl = filteredsimdata1, 
                          unobserved = matrix(c(1, 1, 1, 1, 2, 2, 2, 2), nrow = 2,  byrow = TRUE), 
                          alphabet = c(1, 2), rootprob = "equal", 
                          modelmat = matrix(c(NA, 1, 2, NA), 2, 2))
print(model1_f_corrected)
##############
data(simdata2)
print(simdata2$Q)
#While simulating the data found in simdata2, the clade with node 7 as its
#most recent common ancestor (MRCA) was constrained to have twice the 
#substitution rates as the rest of the branches in the tree.
print(table(simdata2$data))
model2 <- estimaterates(usertree = simdata2$tree, userphyl = simdata2$data, 
                        alphabet = c(1, 2), bgtype = "ancestornodes", bg = c(7),
                        rootprob = "equal", modelmat = matrix(c(NA, 1, 2, NA), 2, 2))
print(model2)
plottree(model2, colors=c("blue", "darkgreen"), edge.width = 2, show.tip.label = FALSE, 
         no.margin = TRUE)
ape::nodelabels(frame = "circle", cex = 0.7)
ape::tiplabels(frame = "circle", cex = 0.7)
##############
#Nucleotide data was simulated such that the first half of sites followed
#substitution rates different from the other half of sites. Data was simulated
#in the two partitions with rates 0.33 and 0.99.
data(simdata3)
print(dim(simdata3$data))
print(table(simdata3$data))
model3 <- estimaterates(usertree = simdata3$tree, userphyl = simdata3$data, 
                        alphabet = c("a", "c", "g", "t"), rootprob = "equal", 
                        partition = list(c(1:2500), c(2501:5000)), 
                        modelmat = matrix(c(NA, 1, 1, 1, 1, NA, 1, 1, 
                                            1, 1, NA, 1, 1, 1, 1, NA), 4, 4))
print(model3)
# }
#More examples in the vignette.

Run the code above in your browser using DataLab