fit_tree_model: Fit a cladogenic model to an existing tree.

Description

Fit the parameters of a tree generation model to an existing phylogenetic tree; branch lengths are assumed to be in time units. The fitted model is a stochastic cladogenic process in which speciations (births) and extinctions (deaths) are Poisson processes, as simulated by the function generate_random_tree. The birth and death rates of tips can each be constant or power-law functions of the number of extant tips. For example, $$ B = I + F\cdot N^E, $$ where $B$ is the birth rate, $I$ is the intercept, $F$ is the power-law factor, $N$ is the current number of extant tips and $E$ is the power-law exponent. Each of the parameters I, F, E can be fixed or fitted.

Fitting can be performed via maximum-likelihood estimation, based on the waiting times between subsequent speciation and/or extinction events represented in the tree. Alternatively, fitting can be performed using least-squares estimation, based on the number of lineages represented in the tree over time ("diversity-vs-time" curve, a.k.a. "lineages-through-time"" curve). Note that the birth and death rates are NOT per-capita rates, they are absolute rates of species appearance and disappearance per time.

Usage

fit_tree_model( tree, 
                parameters          = list(),
                first_guess         = list(),
                min_age             = 0,
                max_age             = 0,
                age_centile         = NULL,
                Ntrials             = 1,
                Nthreads            = 1,
                coalescent          = FALSE,
                discovery_fraction  = NULL,
                fit_control         = list(),
                min_R2              = -Inf,
                min_wR2             = -Inf,
                grid_size           = 100,
                max_model_runtime   = NULL,
                objective           = 'LL')

Arguments

tree

A phylogenetic tree, in which branch lengths are assumed to be in time units. The tree may be a coalescent tree (i.e. only include extant clades) or a tree including extinct clades; the tree type influences what type of models can be fitted with each method.

parameters

A named list specifying fixed and/or unknown birth-death model parameters, with one or more of the following elements:

birth_rate_intercept: Non-negative number. The intercept of the Poissonian rate at which new species (tips) are added. In units 1/time.
birth_rate_factor: Non-negative number. The power-law factor of the Poissonian rate at which new species (tips) are added. In units 1/time.
birth_rate_exponent: Numeric. The power-law exponent of the Poissonian rate at which new species (tips) are added. Unitless.
death_rate_intercept: Non-negative number. The intercept of the Poissonian rate at which extant species (tips) go extinct. In units 1/time.
death_rate_factor: Non-negative number. The power-law factor of the Poissonian rate at which extant species (tips) go extinct. In units 1/time.
death_rate_exponent: Numeric. The power-law exponent of the Poissonian rate at which extant species (tips) go extinct. Unitless.
resolution: Numeric. Resolution at which the tree was collapsed (i.e. every node of age smaller than this resolution replaced by a single tip). In units time. A resolution of 0 means the tree was not collapsed.
rarefaction: Numeric. Species sampling fraction, i.e. fraction of extant species represented (as tips) in the tree. A rarefaction of 1, for example, implies that the tree is complete, i.e. includes all extant species. Rarefaction is assumed to have occurred after collapsing.
extant_diversity: The current total extant diversity, regardless of the rarefaction and resolution of the tree at hand. For example, if resolution==0 and rarefaction==0.5 and the tree has 1000 tips, then extant_diversity should be 2000. If resolution is fixed at 0 and rarefaction is also fixed, this can be left NULL and will be inferred automatically by the function.

Each of the above elements can also be NULL, in which case the parameter is fitted. Elements can also be vectors of size 2 (specifying constraint intervals), in which case the parameters are fitted and constrained within the intervals specified. For example, to fit death_rate_factor while constraining it to the interval [1,2], set its value to c(1,2).

first_guess

A named list (with entries named as in parameters) specifying starting values for any of the fitted model parameters. Note that if Ntrials>1, then start values may be randomly modified in all but the first trial. For any parameters missing from first_guess, initial values are always randomly chosen. first_guess can also be NULL.

min_age

Numeric. Minimum distance from the tree crown, for a node/tip to be considered in the fitting. If <=0 or NULL, this constraint is ignored. Use this option to omit most recent nodes.

max_age

Numeric. Maximum distance from the tree crown, for a node/tip to be considered in the fitting. If <=0 or NULL, this constraint is ignored. Use this option to omit old nodes, e.g. with highly uncertain placements.

age_centile

Numeric within 0 and 1. Fraction of youngest nodes/tips to consider for the fitting. This can be used as an alternative to max_age. E.g. if set to 0.6, then the 60% youngest nodes/tips are considered. Either age_centile or max_age must be non-NULL, but not both.

Ntrials

Integer. Number of fitting attempts to perform, each time using randomly varied start values for fitted parameters. The returned fitted parameter values will be taken from the trial with greatest achieved fit objective. A larger number of trials will decrease the chance of hitting a local non-global optimum during fitting.

Nthreads

Number of threads to use for parallel execution of multiple fitting trials. On Windows, this option has no effect because Windows does not support forks.

coalescent

Logical, specifying whether the input tree is a coalescent tree (and thus the coalescent version of the model should be fitted). Only available if objective=='R2'.

discovery_fraction

Function handle, mapping age to the fraction of discovered lineages in a tree. That is, discovery_fraction(tau) is the probability that a lineage at age tau, that has an extant descendant today, will be represented (discovered) in the coalescent tree. In particular, discovery_fraction(0) equals the fraction of extant lineages represented in the tree. If this is provided, then parameters$rarefaction is fixed to 1, and discovery_fraction is applied after simulation. Only relevant if coalescent==TRUE. Experimental, so leave this NULL if you don't know what it means.

fit_control

Named list containing options for the stats::nlminb optimization routine, such as eval.max (max number of evaluations), iter.max (max number of iterations) and rel.tol (relative tolerance for convergence).

min_R2

Minimum coefficient of determination of the diversity curve (clade counts vs time) of the model when compared to the input tree, for a fitted model to be accepted. For example, if set to 0.5 then only fit trials achieving an R2 of at least 0.5 will be considered. Set this to -Inf to not filter fitted models based on the R2.

min_wR2

Similar to min_R2, but applying to the weighted R2, where squared-error weights are proportional to the inverse squared diversities.

grid_size

Integer. Number of equidistant time points to consider when calculating the R2 of a model's diversity-vs-time curve.

max_model_runtime

Numeric. Maximum runtime (in seconds) allowed for each model evaluation during fitting. Use this to escape from badly parameterized models during fitting (this will likely cause the affected fitting trial to fail). If NULL or <=0, this option is ignored.

objective

Character. Objective function to optimize during fitting. Can be either "LL" (log-likelihood of waiting times between speciation events and between extinction events), "R2" (coefficient of determination of diversity-vs-time curve), "wR2" (weighted R2, where weights of squared errors are proportional to the inverse diversities observed in the tree) or "lR2" (logarithmic R2, i.e. R2 calculated for the logarithm of the diversity-vs-time curve). Note that "wR2" will weight errors at lower diversities more strongly than "R2".

Value

A named list with the following elements:

success

Logical, indicating whether the fitting was successful.

objective_value

Numeric. The achieved maximum value of the objective function (log-likelihood, R2 or weighted R2).

parameters

A named list listing all model parameters (fixed and fitted).

start_parameters

A named list listing the start values of all model parameters. In the case of multiple fitting trials, this will list the initial (non-randomized) guess.

Numeric. The achieved coefficient of determination of the fitted model, based on the diversity-vs-time curve.

wR2

Numeric. The achieved weighted coefficient of determination of the fitted model, based on the diversity-vs-time curve. Weights of squared errors are proportional to the inverse squared diversities observed in the tree.

lR2

Numeric. The achieved coefficient of determination of the fitted model on a log axis, i.e. based on the logarithm of the diversity-vs-time curve.

Nspeciations

Integer. Number of speciation events (=nodes) considered during fitting. This only includes speciations visible in the tree.

Nextinctions

Integer. Number of extinction events (=non-crown tips) considered during fitting. This only includes extinctions visible in the tree, i.e. tips whose distance from the root is lower than the maximum.

grid_times

Numeric vector. Time points considered for the diversity-vs-time curve. Times will be constrained between min_age and max_age if these were specified.

tree_diversities

Number of lineages represented in the tree through time, calculated for each of grid_times.

model_diversities

Number of lineages through time as predicted by the model (in the deterministic limit), calculated for each of grid_times. If coalescent==TRUE then these are the number of lineages expected to be represented in the coalescent tree (this may be lower than the actual number of extant clades at any given time point, if the model includes extinctions).

fitted_parameter_names

Character vector, listing the names of fitted (i.e. non-fixed) parameters.

locally_fitted_parameters

Named list of numeric vectors, listing the fitted values for each parameter and for each fitting trial. For example, if birth_rate_factor was fitted, then locally_fitted_parameters$birth_rate_factor will be a numeric vector of size Ntrials (or less, if some trials failed or omitted), listing the locally-optimized values of the parameter for each considered fitting trial. Mainly useful for diagnostic purposes.

objective

Character. The name of the objective function used for fitting ("LL", "R2" or "wR2").

Ntips

The number of tips in the input tree.

Nnodes

The number of nodes in the input tree.

min_age

The minimum age of nodes/tips considered during fitting.

max_age

The maximum age of nodes/tips considered during fitting.

age_centile

Numeric or NULL, equal to the age_centile specified as input to the function.

Examples

Run this code

# NOT RUN {
# Generate a tree using a simple speciation model
parameters = list(birth_rate_intercept  = 1, 
                  birth_rate_factor     = 0,
                  birth_rate_exponent   = 0,
                  death_rate_intercept  = 0,
                  death_rate_factor     = 0,
                  death_rate_exponent   = 0,
                  resolution            = 0,
                  rarefaction           = 1)
tree = generate_random_tree(parameters, max_tips=100)

# Fit model to the tree
fitting_parameters = parameters
fitting_parameters$birth_rate_intercept = NULL # fit only this parameter
fitting = fit_tree_model(tree,fitting_parameters)

# compare fitted to true value
T = parameters$birth_rate_intercept
F = fitting$parameters$birth_rate_intercept
cat(sprintf("birth_rate_intercept: true=%g, fitted=%g\n",T,F))
# }

Run the code above in your browser using DataLab

Description

Usage

Arguments

Value

See Also

Examples