fit_hbd_model_on_grid: Fit a homogenous birth-death model on a discrete time grid.

Description

Given an ultrametric timetree, fit a homogenous birth-death (HBD) model in which speciation and extinction rates (\(\lambda\) and \(mu\)) are defined on a fixed grid of discrete time points and assumed to vary polynomially between grid points. ``Homogenous'' refers to the assumption that, at any given moment in time, all lineages exhibit the same speciation/extinction rates (in the literature this is sometimes referred to simply as ``birth-death model''). Every HBD model is defined based on the values that \(\lambda\) and \(\mu\) take over time as well as the sampling fraction \(\rho\) (fraction of extant species sampled). This function estimates the values of \(\lambda\) and \(\mu\) at each grid point by maximizing the likelihood (Morlon et al. 2011) of the timetree under the resulting HBD model.

Usage

fit_hbd_model_on_grid(tree, 
                      oldest_age        = NULL,
                      age0              = 0,
                      age_grid          = NULL,
                      min_lambda        = 0,
                      max_lambda        = +Inf,
                      min_mu            = 0,
                      max_mu            = +Inf,
                      min_rho0          = 1e-10,
                      max_rho0          = 1,
                      guess_lambda      = NULL,
                      guess_mu          = NULL,
                      guess_rho0        = 1,
                      fixed_lambda      = NULL,
                      fixed_mu          = NULL,
                      fixed_rho0        = NULL,
                      const_lambda      = FALSE,
                      const_mu          = FALSE,
                      splines_degree    = 1,
                      condition         = "auto",
                      relative_dt       = 1e-3,
                      Ntrials           = 1,
                      Nthreads          = 1,
                      max_model_runtime = NULL,
                      fit_control       = list())

Arguments

tree

An ultrametric timetree of class "phylo", representing the time-calibrated reconstructed phylogeny of a set of extant species.

oldest_age

Strictly positive numeric, specifying the oldest time before present (``age'') to consider when calculating the likelihood. If this is equal to or greater than the root age, then oldest_age is taken as the stem age, and the classical formula by Morlon et al. (2011) is used. If oldest_age is less than the root age, the tree is split into multiple subtrees at that age by treating every edge crossing that age as the stem of a subtree, and each subtree is considered an independent realization of the HBD model stemming at that age. This can be useful for avoiding points in the tree close to the root, where estimation uncertainty is generally higher. If oldest_age==NULL, it is automatically set to the root age.

age0

Non-negative numeric, specifying the youngest age (time before present) to consider for fitting, and with respect to which rho is defined. If age0>0, then rho0 refers to the sampling fraction at age age0, i.e. the fraction of lineages extant at age0 that are included in the tree. See below for more details.

age_grid

Numeric vector, listing ages in ascending order, on which \(\lambda\) and \(\mu\) are allowed to vary independently. This grid must cover at least the age range from age0 to oldest_age. If NULL or of length <=1 (regardless of value), then \(\lambda\) and \(\mu\) are assumed to be time-independent.

min_lambda

Numeric vector of length Ngrid (=max(1,length(age_grid))), or a single numeric, specifying lower bounds for the fitted \(\lambda\) at each point in the age grid. If a single numeric, the same lower bound applies at all ages.

max_lambda

Numeric vector of length Ngrid, or a single numeric, specifying upper bounds for the fitted \(\lambda\) at each point in the age grid. If a single numeric, the same upper bound applies at all ages. Use +Inf to omit upper bounds.

min_mu

Numeric vector of length Ngrid, or a single numeric, specifying lower bounds for the fitted \(\mu\) at each point in the age grid. If a single numeric, the same lower bound applies at all ages.

max_mu

Numeric vector of length Ngrid, or a single numeric, specifying upper bounds for the fitted \(\mu\) at each point in the age grid. If a single numeric, the same upper bound applies at all ages. Use +Inf to omit upper bounds.

min_rho0

Numeric, specifying a lower bound for the fitted sampling fraction \(\rho\) (fraction of extant species included in the tree).

max_rho0

Numeric, specifying an upper bound for the fitted sampling fraction \(\rho\).

guess_lambda

Initial guess for \(\lambda\) at each age-grid point. Either NULL (an initial guess will be computed automatically), or a single numeric (guessing the same \(\lambda\) at all ages) or a numeric vector of size Ngrid specifying a separate guess for \(\lambda\) at each age-grid point. To omit an initial guess for some but not all age-grid points, set their guess values to NA. Guess values are ignored for non-fitted (i.e., fixed) parameters.

guess_mu

Initial guess for \(\mu\) at each age-grid point. Either NULL (an initial guess will be computed automatically), or a single numeric (guessing the same \(\mu\) at all ages) or a numeric vector of size Ngrid specifying a separate guess for \(\mu\) at each age-grid point. To omit an initial guess for some but not all age-grid points, set their guess values to NA. Guess values are ignored for non-fitted (i.e., fixed) parameters.

guess_rho0

Numeric, specifying an initial guess for the sampling fraction \(\rho\) at age0. Setting this to NULL or NA is the same as setting it to 1.

fixed_lambda

Optional fixed (i.e. non-fitted) \(\lambda\) values on one or more age-grid points. Either NULL (\(\lambda\) is not fixed anywhere), or a single numeric (\(\lambda\) fixed to the same value at all grid points) or a numeric vector of size Ngrid (\(\lambda\) fixed on one or more age-grid points, use NA for non-fixed values).

fixed_mu

Optional fixed (i.e. non-fitted) \(\mu\) values on one or more age-grid points. Either NULL (\(\mu\) is not fixed anywhere), or a single numeric (\(\mu\) fixed to the same value at all grid points) or a numeric vector of size Ngrid (\(\mu\) fixed on one or more age-grid points, use NA for non-fixed values).

fixed_rho0

Numeric between 0 and 1, optionallly specifying a fixed value for the sampling fraction \(\rho\). If NULL or NA, the sampling fraction \(\rho\) is estimated, however note that this may not always be meaningful (Stadler 2009, Stadler 2013).

const_lambda

Logical, specifying whether \(\lambda\) should be assumed constant across the grid, i.e. time-independent. Setting const_lambda=TRUE reduces the number of free (i.e., independently fitted) parameters. If \(\lambda\) is fixed on some grid points (i.e. via fixed_lambda), then only the non-fixed lambdas are assumed to be identical to one another.

const_mu

Logical, specifying whether \(\mu\) should be assumed constant across the grid, i.e. time-independent. Setting const_mu=TRUE reduces the number of free (i.e., independently fitted) parameters. If \(\mu\) is fixed on some grid points (i.e. via fixed_mu), then only the non-fixed mus are assumed to be identical to one another.

splines_degree

Integer between 0 and 3 (inclusive), specifying the polynomial degree of \(\lambda\) and \(\mu\) between age-grid points. If 0, then \(\lambda\) and \(\mu\) are considered piecewise constant, if 1 then \(\lambda\) and \(\mu\) are considered piecewise linear, if 2 or 3 then \(\lambda\) and \(\mu\) are considered to be splines of degree 2 or 3, respectively. The splines_degree influences the analytical properties of the curve, e.g. splines_degree==1 guarantees a continuous curve, splines_degree==2 guarantees a continuous curve and continuous derivative, and so on. A degree of 0 is generally not recommended.

condition

Character, either "crown", "stem", "none" or "auto", specifying on what to condition the likelihood. If "crown", the likelihood is conditioned on the survival of the two daughter lineages branching off at the root. If "stem", the likelihood is conditioned on the survival of the stem lineage. Note that "crown" really only makes sense when oldest_age is equal to the root age, while "stem" is recommended if oldest_age differs from the root age. "none" is generally not recommended. If "auto", the condition is chosen according to the recommendations mentioned earlier.

relative_dt

Strictly positive numeric (unitless), specifying the maximum relative time step allowed for integration over time, when calculating the likelihood. Smaller values increase integration accuracy but increase computation time. Typical values are 0.0001-0.001. The default is usually sufficient.

Ntrials

Integer, specifying the number of independent fitting trials to perform, each starting from a random choice of model parameters. Increasing Ntrials reduces the risk of reaching a non-global local maximum in the fitting objective.

Nthreads

Integer, specifying the number of parallel threads to use for performing multiple fitting trials simultaneously. This should generally not exceed the number of available CPUs on your machine. Parallel computing is not available on the Windows platform.

max_model_runtime

Optional numeric, specifying the maximum number of seconds to allow for each evaluation of the likelihood function. Use this to abort fitting trials leading to parameter regions where the likelihood takes a long time to evaluate (these are often unlikely parameter regions).

fit_control

Named list containing options for the nlminb optimization routine, such as iter.max, eval.max or rel.tol. For a complete list of options and default values see the documentation of nlminb in the stats package.

Value

A list with the following elements:

success

Logical, indicating whether model fitting succeeded. If FALSE, the returned list will include an additional ``error'' element (character) providing a description of the error; in that case all other return variables may be undefined.

objective_value

The maximized fitting objective. Currently, only maximum-likelihood estimation is implemented, and hence this will always be the maximized log-likelihood.

objective_name

The name of the objective that was maximized during fitting. Currently, only maximum-likelihood estimation is implemented, and hence this will always be ``loglikelihood''.

loglikelihood

The log-likelihood of the fitted model for the given timetree.

fitted_lambda

Numeric vector of size Ngrid, listing fitted or fixed speciation rates \(\lambda\) at each age-grid point. Between grid points the fitted \(\lambda\) should be interpreted as a piecewise polynomial function (natural spline) of degree splines_degree; to evaluate this function at arbitrary ages use the castor routine evaluate_spline.

fitted_mu

Numeric vector of size Ngrid, listing fitted or fixed extinction rates \(\mu\) at each age-grid point. Between grid points the fitted \(\mu\) should be interpreted as a piecewise polynomial function (natural spline) of degree splines_degree; to evaluate this function at arbitrary ages use the castor routine evaluate_spline.

fitted_rho

Numeric, specifying the fitted or fixed sampling fraction \(\rho\).

guess_lambda

Numeric vector of size Ngrid, specifying the initial guess for \(\lambda\) at each age-grid point.

guess_mu

Numeric vector of size Ngrid, specifying the initial guess for \(\mu\) at each age-grid point.

guess_rho0

Numeric, specifying the initial guess for \(\rho\).

age_grid

The age-grid on which \(\lambda\) and \(\mu\) are defined. This will be the same as the provided age_grid, unless the latter was NULL or of length <=1.

NFP

Integer, number of free (i.e., independently) fitted parameters. If none of the \(\lambda\), \(\mu\) and \(\rho\) were fixed, and const_lambda=FALSE and const_mu=FALSE, then NFP will be equal to 2*Ngrid+1.

AIC

The Akaike Information Criterion for the fitted model, defined as \(2k-2\log(L)\), where \(k\) is the number of fitted parameters and \(L\) is the maximized likelihood.

BIC

The Bayesian information criterion for the fitted model, defined as \(\log(n)k-2\log(L)\), where \(k\) is the number of fitted parameters, \(n\) is the number of data points (number of branching times), and \(L\) is the maximized likelihood.

converged

Logical, specifying whether the maximum likelihood was reached after convergence of the optimization algorithm. Note that in some cases the maximum likelihood may have been achieved by an optimization path that did not yet converge (in which case it's advisable to increase iter.max and/or eval.max).

Niterations

Integer, specifying the number of iterations performed during the optimization path that yielded the maximum likelihood.

Nevaluations

Integer, specifying the number of likelihood evaluations performed during the optimization path that yielded the maximum likelihood.

Details

Warning: Unless well-justified constraints are imposed on either \(\lambda\) and/or \(\mu\) and \(\rho\), it is generally impossible to reliably estimate \(\lambda\) and \(\mu\) from extant timetrees alone (Louca and Pennell, 2019). This routine (and any other software that claims to estimate \(\lambda\) and \(\mu\) solely from extant timetrees) should thus be used with great suspicion. If your only source of information is an extant timetree, and you have no a priori information on how \(\lambda\) or \(\mu\) might have looked like, you should consider using the more appropriate routines fit_hbd_pdr_on_grid and fit_hbd_psr_on_grid instead.

If age0>0, the input tree is essentially trimmed at age0 (omitting anything younger than age0), and the various variables are fitted to this new (shorter) tree, with time shifted appropriately. For example, the fitted rho0 is thus the sampling fraction at age0, i.e. the fraction of lineages extant at age0 that are represented in the timetree.

It is generally advised to provide as much information to the function fit_hbd_model_on_grid as possible, including reasonable lower and upper bounds (min_lambda, max_lambda, min_mu, max_mu, min_rho0 and max_rho0) and a reasonable parameter guess (guess_lambda, guess_mu and guess_rho0). It is also important that the age_grid is sufficiently fine to capture the expected major variations of \(\lambda\) and \(\mu\) over time, but keep in mind the serious risk of overfitting when age_grid is too fine and/or the tree is too small.

References

T. Stadler (2009). On incomplete sampling under birth-death models and connections to the sampling-based coalescent. Journal of Theoretical Biology. 261:58-66.

T. Stadler (2013). How can we improve accuracy of macroevolutionary rate estimates? Systematic Biology. 62:321-329.

H. Morlon, T. L. Parsons, J. B. Plotkin (2011). Reconciling molecular phylogenies with the fossil record. Proceedings of the National Academy of Sciences. 108:16327-16332.

S. Louca et al. (2018). Bacterial diversification through geological time. Nature Ecology & Evolution. 2:1458-1467.

Examples

Run this code

# NOT RUN {
# Generate a random tree with exponentially varying lambda & mu
Ntips     = 10000
rho       = 0.5 # sampling fraction
time_grid = seq(from=0, to=100, by=0.01)
lambdas   = 2*exp(0.1*time_grid)
mus       = 1.5*exp(0.09*time_grid)
sim       = generate_random_tree( parameters  = list(rarefaction=rho), 
                                  max_tips    = Ntips/rho,
                                  coalescent  = TRUE,
                                  added_rates_times     = time_grid,
                                  added_birth_rates_pc  = lambdas,
                                  added_death_rates_pc  = mus)
tree = sim$tree
root_age = castor::get_tree_span(tree)$max_distance
cat(sprintf("Tree has %d tips, spans %g Myr\n",length(tree$tip.label),root_age))


# Fit mu on grid
# Assume that lambda & rho are known
Ngrid     = 5
age_grid  = seq(from=0,to=root_age,length.out=Ngrid)
fit = fit_hbd_model_on_grid(tree, 
          age_grid    = age_grid,
          max_mu      = 100,
          fixed_lambda= approx(x=time_grid,y=lambdas,xout=sim$final_time-age_grid)$y,
          fixed_rho0   = rho,
          condition   = "crown",
          Ntrials     = 10,	# perform 10 fitting trials
          Nthreads    = 2,	# use two CPUs
          max_model_runtime = 1) 	# limit model evaluation to 1 second
if(!fit$success){
  cat(sprintf("ERROR: Fitting failed: %s\n",fit$error))
}else{
  cat(sprintf("Fitting succeeded:\nLoglikelihood=%g\n",fit$loglikelihood))
  
  # plot fitted & true mu
  plot( x     = fit$age_grid,
        y     = fit$fitted_mu,
        main  = 'Fitted & true mu',
        xlab  = 'age',
        ylab  = 'mu',
        type  = 'b',
        col   = 'red',
        xlim  = c(root_age,0))
  lines(x     = sim$final_time-time_grid,
        y     = mus,
        type  = 'l',
        col   = 'blue');
        
  # get fitted mu as a function of age
  mu_fun = approxfun(x=fit$age_grid, y=fit$fitted_mu)
}
# }

Run the code above in your browser using DataLab