ouchModel: Estimate parameters under a (multivariate) OU model of evolution

Description

The ouchModel function uses maximum likelihood to fit parameters of a multivariate OU model evolving on the phylogeny. The user is recommended to install the suggested package PCMBaseCpp which significantly speeds up the calculations (see Details).

Usage

ouchModel(phyltree, mData, regimes = NULL, regimes.times = NULL, 
root.regime = NULL, predictors = NULL, M.error = NULL, Atype = "Invertible", 
Syytype = "UpperTri", diagA = "Positive", estimate.root.state = FALSE, 
parameter_signs = NULL, start_point_for_optim = NULL, parscale = NULL, 
min_bl = 0.0003, maxiter = c(10,100))

Value

FinalFound: The point where the search procedure stopped. See Details for the description.
MaxLikFound: The point with the highest likelihood found by the search procedure, if it is the same as the final point then this field equals "Same as final found".

Arguments

phyltree

The phylogeny in phylo format. The tree can be obtained from e.g. a nexus file by the read.nexus() function from the ape package. The "standard" ape node indexing is assumed: for a tree with $n$ tips, the tips should have indices 1:n and the root index n+1. The root.edge field is ignored.

mData

A matrix with the rows corresponding to the tip species while the columns correspond to the traits. The rows should be named by species
(field phyltree$tip.label), if not, then a warning is thrown and the order of the species is assumed to be the same as the order in which the species are in the phylogeny (i.e. correspond to the node indices 1:n, where $n$ is the number of tips). The columns should be named by traits, otherwise a warning is thrown and generic names are generated.

regimes

A vector or list of regimes. If vector then each entry corresponds to each of the branches of phyltree, i.e. to each row of phyltree$edge. If list then each list entry corresponds to a tip node and is a vector for regimes on that lineage. If NULL, then a constant regime is assumed on the whole tree.

regimes.times

A list of vectors for each tree node, it starts with $0$ and ends with the current time of the species. In between are the times where the regimes (niches) changed. If NULL then each branch is considered to be a regime.

root.regime

The regime at the root of the tree. If not given, then it is taken as the regime that is present on the root's daughter lineages and is the most frequent one in the regimes vector. If more than one regime has the same maximum frequency, then alphabetically first one of the maximum ones is taken.

predictors

A vector giving the numbers of the columns from data which are to be considered predictor ones, i.e. conditioned on in the program output. If not provided, then none will be treated as predictors.

M.error

An optional measurement error covariance structure. The measurement errors between species are assumed independent. The program tries to recognize the structure of the passed matrix and accepts the following possibilities :

a single number that is a common measurement error for all tips and species,
a m element vector with each value corresponding to a variable, measurement errors are independent between variables and each species is assumed to have the same measurement errors,
a m x m ((number of variables) x (number of variables)) matrix, all species will have the same measurement error,
a list of length n (number of species), each list element is the covariance structure for the appropriate (numbering according to tree) species, either a single number (each variable has same variance), vector (of length m for each variable), or m x m matrix, the order of the list has to correspond to the order of the nodes in the phyltree object,
NULL no measurement error.

From version 2.0.0 of mvSLOUCH it is impossible to pass a single joint measurement error matrix for all the species and traits.

Atype

What class does the A matrix in the multivariate OUOU model belong to, possible values : "SingleValueDiagonal", "Diagonal", "UpperTri", "LowerTri", "Symmetric", "SymmetricPositiveDefinite",
"DecomposablePositive", "DecomposableNegative",
"DecomposableReal", "Invertible", "TwoByTwo", "Any"

Syytype

What class does the Syy matrix in the multivariate OUBM model belong to, possible values : "SingleValueDiagonal", "Diagonal", "UpperTri", "LowerTri", "Symmetric", "Any"

diagA

Whether the values on A's diagonal are to be "Positive", "Negative" or sign allowed to vary, NULL. However, setting this to a non-NULL value might be detrimental to the optimization process if Atype is "SymmetricPositiveDefinite". In this case the diagonal is always guaranteed to be positive.

estimate.root.state

Should the root state be estimate TRUE (not recommended) or set at the optimum FALSE (recommended). Root state estimation is usually unreliable hence if fossil measurements are available prediction based on them and the estimated model will probably be more accurate. If there is only one regime, then estimation of the root state separately is impossible and will not be allowed.

parameter_signs

WARNING: ONLY use this option if you understand what you are doing! This option is still in an experimental stage so some setups might not work (please report). A list allowing the user to control whether specific entries for each model parameter should be positive, negative, zero or set to a specific (other) value. The entries of the list have to be named, the admissible names are "signsA" (for A matrix), "signsSyy" (for Syy matrix) and "signsmPsi" (for mPsi matrix) and "signsvY0" (for vY0 matrix). Any other entry in this list will be ignored. Each entry of the list has to be a matrix of appropriate size, i.e. of the size of the parameter to which it corresponds. Inside this matrix the possible values are "+" if the given entry is to be positive, "-" if the given entry is to be negative, x, where x is a number, if the entry is to be set to specified value or NA if the entry is to be freely estimated. See Details for an example, further description and important warnings!

start_point_for_optim

A named list with starting parameters for of the parameters for be optimized by optim(), in this case A and Syy. One may provide both or only one of them. Make sure that the parameter is consistent with the other parameter restrictions as no check is done and this can result in undefined behaviour. For example one may provide this as (provided dimensions and other parameter restrictions agree)

start_point_for_optim=list(A=rbind(c(2,0),(0,4)), 
Syy=rbind(c(1,0.5),c(0,2))).

parscale

A vector to calculate the parscale argument for optim. It is a named vector with 3 entries, e.g.
c("parscale_A"=3,"logparscale_A"=5,"logparscale_other"=1). The entry parscale_A is the scale for entries of the A matrix, logparscale_A is the scale for entries of the A matrix that are optimized over on the logarithmic scale, e.g. if eigenvalues are assumed to be positive, then optimization is done over log(eigenvalue) for A's eigendecomposition and logparscale_other is the scale for entries other then of A that are done on the logarithmic scale (e.g. Syy's diagonal, or other entries indicated as positive via parameter_signs). If not provided (or if a name of the vector is misspelled), then made equal to the example value provided above. For other elements, then mentioned above, that are optimized over by optim(), 1 is used for optim()'s parscale. It is advised that the user experiments with a couple of different values and reads optim's man page.

min_bl

Value to which PCMBase's PCMBase.Threshold.Skip.Singular should be set. It indicates that branches of length shorter than min_bl should be skipped in likelihood calculations. Short branches can result in singular covariance matrices for the transition density along a branch. The user should adjust this value if a lot of warnings are raised by PCMBase about singularities during the likelihood calculations. However, this does not concern tip branches-these cannot be skipped and hence should be long enough so that numerical issues are not raised.

maxiter

The maximum number of iterations for different components of the estimation algorithm. A vector of two integers. The first is the number of iterations for phylogenetic GLS evaluations, i.e. conditional on the other parameters, the regime optima and perhaps initial state are estimated by a phylogenetic GLS procedure. After this the other parameters are optimized over by optim(). This first entry controls the number of iterations of this procedure. Finally, the second is the value of maxiter passed to optim().

Author

Krzysztof Bartoszek

Warning

The estimation can take a long time and should be repeated a couple of times so that it is run from different starting positions. The function can produce (a lot of) warnings and errors during the search procedure, this is nothing to worry about.

Details

The likelihood calculations are done by the PCMBase package. However, there is a C++ backend, PCMBaseCpp. If it is not available, then the likelihood is calculated slower using pure R. However, with the calculations in C++ up to a 100-fold increase in speed is possible (more realistically 10-20 times). The PCMBaseCpp package is available from https://github.com/venelin/PCMBaseCpp.

This function estimates the parameters of the following multivariate SDE, $$dY(t) = -A(Y-\Psi(t))dt + \Sigma dW(t), Y(0)=Y_{0}$$ on a phylogenetic tree. It uses a numerical optimization over A (parametrized by its eigenvalues and eigenvectors or its QR decomposition) and S (parametrized by its values) and conditional on A and S estimates the values of Psi corresponding to the different regimes by a GLS estimate. Y(0) is assumed to be equal to the root value of Psi (unless estimate.root.state=TRUE), then Y(0) is estimated is estimated by least squares).

The setting Atype="Any" means that one assumes the matrix A is eigendecomposable. If the estimation algorithm hits a defective A, then it sets the log-likelihood at the minimum value and will try to get out of this dip.

The function parameter parameter_signs is special in the sense that it can give the user great control over the estimation procedure but can also make the output very inconsistent with what the user provides. If we have two traits, then an EXAMPLE setting of this can be:
parameter_signs=list(signsA=rbind(c("+","-"),c(0,"+")),
signsSyy=rbind(c(NA,0),c(0,NA)). This means that A is upper triangular with positive values on the diagonal and a negative value on the off-diagonal, Syy is diagonal and A is also diagonal. It is advisable to set now Atype="Any" and Syytype="Any" (see further description).

If the given model parameter is to be estimated by a generalized least squares (currently mPsi and vY0), then the sign specifications are ignored. However, it is possible to set specific values. Furthermore, the package does not check (for A and Syy) if the specifications here agree with the Atype, Syytype and diagA. The settings in signsA and signsSyy will override the other settings. Hence, it is up to the user to make sure that the settings of signsA and signsSyy are consistent with Atype, Syytype and diagA. It is advisable to use signsA with "+" on the diagonal and have diagA=NULL. The diagonal of Syy is forced to be positive (unless "-" is used on the diagonal of signsSyy but this is strongly discouraged) so it is advisable to keep NA on the diagonal of signsSyy and not put there "+" there. Hence, in particular using the signs mechanism result in a wrong class of the matrix
(e.g. Atype="SymmetricPositiveDefinite", but after corrections for the provided entries in signsA one obtains a non-symmetric A with complex, negative-real-part eigenvalues). Lastly, using signsA and signsSyy can result in a wrong amount of dof and in turn incorrect AICc and BIC values. What the code does is subtracts the amount of fixed values in signsA and signsSyy from the amount of free parameters used to estimate A and Syy. For example if one sets
Atype="SingleValueDiagonal" (estimated by one free parameter) but specified two off-diagonal values, then the amount of dofs from A will be -1!! The ONLY fail-safe way to use this is to set Atype="Any" (if signsA used) and Syytype="Any" (if signsSyy used). If using Syytype="Any" and signsSyy the it is strongly advisable to set the entries either below or above the diagonal ofSyy to 0. The reason is that $\Sigma_{yy}\Sigma_{yy}^{T}$ enters the likelihood and not the given value of $\Sigma_{yy}$. Hence, having values below (or respectively above) the diagonal results in an overparameterized model. The package has the option of mixing different matrix types with specifying values in it but this is only for advanced users who need to dig into the code to see what the dof should be and if it is possible to find a correspondence between the parametrization and settings. If entries of mPsi and vY0 are pre-specified, then the dof are correctly adjusted for this.

The found point is described by a list containing four fields. The first field
HeuristicSearchPointFinalFind is the parametrization of the model parameters at the considered point with the value of the log-likelihood. The field ParamsInModel is the point estimate of the parameters of the SDE. The field ParamSummary are different composite (evaluated at the tree's height) and summary statistics, The field phylhalflife are the eigenvalues, eigenvectors and phylogenetic half lives associated with the A matrix, expmtA is $\exp(-A*(tree height))$, mPsi.rotated is each of the regime effects multiplied by $(1-\exp(-A*(tree height)))$, cov.matrix is the trait vector covariance matrix at the tree's height, corr.matrix is the trait vector correlation matrix at the tree's height, trait.regression is a list consisting of regression coefficients when taking each trait in turn and calculating its conditional expectation on all of the other trait, stationary.cov.matrix is the stationary covariance matrix of process if it exists (i.e. the eigenvalues have positive real part), stationary.corr.matrix is the stationary correlation matrix of process if it exists (i.e. the eigenvalues have positive real part), StS the infinitesimal covariance matrix $\Sigma_{yy}\Sigma_{yy}^{T}$, LogLik the log-likelihood, dof the degrees of freedom, m2loglik is $-2$log--likelihood, aic is the Akaike information criterion, aic.c is the Akaike information criterion corrected for small sample size, sic is the Schwarz information criterion, bic is the Bayesian information criterion (which is the same as the Schwarz information criterion) and RSS is the residual sum of squares. The field RSS_non_phylogenetic is a residual sum of squares calculated without correcting for the phylogeny--induced between species correlations, while the extension conditional_on_predictors indicates that we consider the RSS for the variables labelled as responses conditioned on the remaining variables. The R2_phylaverage field is R2, where the alternative model is the phylogenetically weighted sample average (see OU_phylreg).

From version 2.0.0 of mvSLOUCH the data has to be passed as a matrix. To underline this the data parameter's name has been changed to mData.

From version 2.0.0 of mvSLOUCH the parameter calcCI has been removed. The package now offers the possibility of bootstrap confidence intervals, see function parametric.bootstrap.

References

Bartoszek, K. and Pienaar, J. and Mostad. P. and Andersson, S. and Hansen, T. F. (2012) A phylogenetic comparative method for studying multivariate adaptation. Journal of Theoretical Biology 314:204-215.

Bartoszek, K. and Tredgett Clarke J. and Fuentes-Gonzalez, J. and Mitov, V. and Pienaar, J. and Piwczynski, M. and Puchalka, R. and Spalik, K. and Voje, K. L. (2024) Fast mvSLOUCH: Multivariate Ornstein-Uhlenbeck-based models of trait evolution on large phylogenies, Methods in Ecology and Evolution 15(9):1507-1515.

Butler, M.A. and A.A. King (2004) Phylogenetic comparative analysis: a modeling approach for adaptive evolution. American Naturalist 164:683-695.

Hansen, T.F. (1997) Stabilizing selection and the comparative analysis of adaptation. Evolution 51:1341-1351.

Mitov, V. and Bartoszek, K. and Asimomitis, G. and Stadler, T. (2020) Fast likelihood calculation for multivariate Gaussian phylogenetic models with shifts Theoretical Population Biology 131:66-78.

Pienaar et al (in prep) An overview of comparative methods for testing adaptation to external environments.

Examples

Run this code

RNGversion(min(as.character(getRversion()),"3.6.1"))
set.seed(12345, kind = "Mersenne-Twister", normal.kind = "Inversion")
### We will first simulate a small phylogenetic tree using functions from ape.
### For simulating the tree one could also use alternative functions, e.g. sim.bd.taxa 
### from the TreeSim package
phyltree<-ape::rtree(5)

## The line below is not necessary but advisable for speed
phyltree<-phyltree_paths(phyltree)

### Define a vector of regimes.
regimes<-c("small","small","large","small","small","large","large","large")

### Define SDE parameters to be able to simulate data under the OUOU model.
## 3D model
## OUOUparameters<-list(vY0=matrix(c(1,-1,0.5),nrow=3,ncol=1),
## A=rbind(c(9,0,0),c(0,5,0),c(0,0,1)),mPsi=cbind("small"=c(1,-1,0.5),"large"=c(-1,1,0.5)),
## Syy=rbind(c(1,0.25,0.3),c(0,1,0.2),c(0,0,1)))
## 2D model used to reduce running time on CRAN
OUOUparameters<-list(vY0=matrix(c(1,-1),nrow=2,ncol=1),
A=rbind(c(9,0),c(0,5)),mPsi=cbind("small"=c(1,-1),"large"=c(-1,1)),
Syy=rbind(c(1,0.25),c(0,1)))

### Now simulate the data.
OUOUdata<-simulOUCHProcPhylTree(phyltree,OUOUparameters,regimes,NULL)
OUOUdata<-OUOUdata[phyltree$tip.label,,drop=FALSE]

### Try to recover the parameters of the OUOU model.
### maxiter here set to minimal working possibility, in reality it should be larger
### e.g. default of c(10,100)
### Also the Atype and Syytype variables should be changed, here set as simplest
### for speed of evaluation, e.g. Atype="DecomposablePositive", Syytype="UpperTri"
OUOUestim<-ouchModel(phyltree,OUOUdata,regimes,Atype="SingleValueDiagonal",
Syytype="SingleValueDiagonal",diagA="Positive",maxiter=c(1,1))
RNGversion(as.character(getRversion()))

if (FALSE)  ##It takes too long to run this
### And finally bootstrap with particular interest in the evolutionary regression
OUOUbootstrap<-parametric.bootstrap(estimated.model=OUOUestim,phyltree=phyltree,
values.to.bootstrap=c("evolutionary.regression"),regimes=regimes,root.regime="small",
M.error=NULL,predictors=c(2),kY=NULL,numboot=5,Atype=NULL,Syytype=NULL,diagA=NULL)

Run the code above in your browser using DataLab