Fit the parameters of a tree generation model to an existing phylogenetic tree; branch lengths are assumed to be in time units. The fitted model is a stochastic cladogenic process in which speciations (births) and extinctions (deaths) are Poisson processes, as simulated by the function generate_random_tree
. The birth and death rates of tips can each be constant or power-law functions of the number of extant tips. For example,
$$
B = I + F\cdot N^E,
$$
where \(B\) is the birth rate, \(I\) is the intercept, \(F\) is the power-law factor, \(N\) is the current number of extant tips and \(E\) is the power-law exponent. Each of the parameters I, F, E can be fixed or fitted.
Fitting can be performed via maximum-likelihood estimation, based on the waiting times between subsequent speciation and/or extinction events represented in the tree. Alternatively, fitting can be performed using least-squares estimation, based on the number of lineages represented in the tree over time ("diversity-vs-time" curve, a.k.a. "lineages-through-time"" curve). Note that the birth and death rates are NOT per-capita rates, they are absolute rates of species appearance and disappearance per time.
fit_tree_model( tree,
parameters = list(),
first_guess = list(),
min_age = 0,
max_age = 0,
age_centile = NULL,
Ntrials = 1,
Nthreads = 1,
coalescent = FALSE,
discovery_fraction = NULL,
fit_control = list(),
min_R2 = -Inf,
min_wR2 = -Inf,
grid_size = 100,
max_model_runtime = NULL,
objective = 'LL')
A phylogenetic tree, in which branch lengths are assumed to be in time units. The tree may be a coalescent tree (i.e. only include extant clades) or a tree including extinct clades; the tree type influences what type of models can be fitted with each method.
A named list specifying fixed and/or unknown birth-death model parameters, with one or more of the following elements:
birth_rate_intercept
: Non-negative number. The intercept of the Poissonian rate at which new species (tips) are added. In units 1/time.
birth_rate_factor
:
Non-negative number. The power-law factor of the Poissonian rate at which new species (tips) are added. In units 1/time.
birth_rate_exponent
:
Numeric. The power-law exponent of the Poissonian rate at which new species (tips) are added. Unitless.
death_rate_intercept
:
Non-negative number. The intercept of the Poissonian rate at which extant species (tips) go extinct. In units 1/time.
death_rate_factor
:
Non-negative number. The power-law factor of the Poissonian rate at which extant species (tips) go extinct. In units 1/time.
death_rate_exponent
:
Numeric. The power-law exponent of the Poissonian rate at which extant species (tips) go extinct. Unitless.
resolution
:
Numeric. Resolution at which the tree was collapsed (i.e. every node of age smaller than this resolution replaced by a single tip). In units time. A resolution of 0 means the tree was not collapsed.
rarefaction
:
Numeric. Species sampling fraction, i.e. fraction of extant species represented (as tips) in the tree. A rarefaction of 1, for example, implies that the tree is complete, i.e. includes all extant species. Rarefaction is assumed to have occurred after collapsing.
extant_diversity
: The current total extant diversity, regardless of the rarefaction and resolution of the tree at hand. For example, if resolution==0
and rarefaction==0.5
and the tree has 1000 tips, then extant_diversity
should be 2000
. If resolution
is fixed at 0 and rarefaction
is also fixed, this can be left NULL
and will be inferred automatically by the function.
Each of the above elements can also be NULL
, in which case the parameter is fitted. Elements can also be vectors of size 2 (specifying constraint intervals), in which case the parameters are fitted and constrained within the intervals specified. For example, to fit death_rate_factor
while constraining it to the interval [1,2], set its value to c(1,2)
.
A named list (with entries named as in parameters
) specifying starting values for any of the fitted model parameters. Note that if Ntrials>1
, then start values may be randomly modified in all but the first trial. For any parameters missing from first_guess
, initial values are always randomly chosen. first_guess
can also be NULL
.
Numeric. Minimum distance from the tree crown, for a node/tip to be considered in the fitting. If <=0 or NULL
, this constraint is ignored. Use this option to omit most recent nodes.
Numeric. Maximum distance from the tree crown, for a node/tip to be considered in the fitting. If <=0 or NULL
, this constraint is ignored. Use this option to omit old nodes, e.g. with highly uncertain placements.
Numeric within 0 and 1. Fraction of youngest nodes/tips to consider for the fitting. This can be used as an alternative to max_age. E.g. if set to 0.6, then the 60% youngest nodes/tips are considered. Either age_centile
or max_age
must be non-NULL, but not both.
Integer. Number of fitting attempts to perform, each time using randomly varied start values for fitted parameters. The returned fitted parameter values will be taken from the trial with greatest achieved fit objective. A larger number of trials will decrease the chance of hitting a local non-global optimum during fitting.
Number of threads to use for parallel execution of multiple fitting trials. On Windows, this option has no effect because Windows does not support forks.
Logical, specifying whether the input tree is a coalescent tree (and thus the coalescent version of the model should be fitted). Only available if objective=='R2'
.
Function handle, mapping age to the fraction of discovered lineages in a tree. That is, discovery_fraction(tau)
is the probability that a lineage at age tau
, that has an extant descendant today, will be represented (discovered) in the coalescent tree. In particular, discovery_fraction(0)
equals the fraction of extant lineages represented in the tree. If this is provided, then parameters$rarefaction
is fixed to 1, and discovery_fraction
is applied after simulation. Only relevant if coalescent==TRUE
. Experimental, so leave this NULL
if you don't know what it means.
Named list containing options for the stats::nlminb
optimization routine, such as eval.max
(max number of evaluations), iter.max
(max number of iterations) and rel.tol
(relative tolerance for convergence).
Minimum coefficient of determination of the diversity curve (clade counts vs time) of the model when compared to the input tree, for a fitted model to be accepted. For example, if set to 0.5 then only fit trials achieving an R2 of at least 0.5 will be considered. Set this to -Inf
to not filter fitted models based on the R2.
Similar to min_R2
, but applying to the weighted R2, where squared-error weights are proportional to the inverse squared diversities.
Integer. Number of equidistant time points to consider when calculating the R2 of a model's diversity-vs-time curve.
Numeric. Maximum runtime (in seconds) allowed for each model evaluation during fitting. Use this to escape from badly parameterized models during fitting (this will likely cause the affected fitting trial to fail). If NULL
or <=0, this option is ignored.
Character. Objective function to optimize during fitting. Can be either "LL" (log-likelihood of waiting times between speciation events and between extinction events), "R2" (coefficient of determination of diversity-vs-time curve), "wR2" (weighted R2, where weights of squared errors are proportional to the inverse diversities observed in the tree) or "lR2" (logarithmic R2, i.e. R2 calculated for the logarithm of the diversity-vs-time curve). Note that "wR2" will weight errors at lower diversities more strongly than "R2".
A named list with the following elements:
Logical, indicating whether the fitting was successful.
Numeric. The achieved maximum value of the objective function (log-likelihood, R2 or weighted R2).
A named list listing all model parameters (fixed and fitted).
A named list listing the start values of all model parameters. In the case of multiple fitting trials, this will list the initial (non-randomized) guess.
Numeric. The achieved coefficient of determination of the fitted model, based on the diversity-vs-time curve.
Numeric. The achieved weighted coefficient of determination of the fitted model, based on the diversity-vs-time curve. Weights of squared errors are proportional to the inverse squared diversities observed in the tree.
Numeric. The achieved coefficient of determination of the fitted model on a log axis, i.e. based on the logarithm of the diversity-vs-time curve.
Integer. Number of speciation events (=nodes) considered during fitting. This only includes speciations visible in the tree.
Integer. Number of extinction events (=non-crown tips) considered during fitting. This only includes extinctions visible in the tree, i.e. tips whose distance from the root is lower than the maximum.
Numeric vector. Time points considered for the diversity-vs-time curve. Times will be constrained between min_age
and max_age
if these were specified.
Number of lineages represented in the tree through time, calculated for each of grid_times
.
Number of lineages through time as predicted by the model (in the deterministic limit), calculated for each of grid_times
. If coalescent==TRUE
then these are the number of lineages expected to be represented in the coalescent tree (this may be lower than the actual number of extant clades at any given time point, if the model includes extinctions).
Character vector, listing the names of fitted (i.e. non-fixed) parameters.
Named list of numeric vectors, listing the fitted values for each parameter and for each fitting trial. For example, if birth_rate_factor
was fitted, then locally_fitted_parameters$birth_rate_factor
will be a numeric vector of size Ntrials
(or less, if some trials failed or omitted), listing the locally-optimized values of the parameter for each considered fitting trial. Mainly useful for diagnostic purposes.
Character. The name of the objective function used for fitting ("LL", "R2" or "wR2").
The number of tips in the input tree.
The number of nodes in the input tree.
The minimum age of nodes/tips considered during fitting.
The maximum age of nodes/tips considered during fitting.
Numeric or NULL
, equal to the age_centile
specified as input to the function.
generate_random_tree
,
simulate_diversification_model
reconstruct_past_diversification
# NOT RUN {
# Generate a tree using a simple speciation model
parameters = list(birth_rate_intercept = 1,
birth_rate_factor = 0,
birth_rate_exponent = 0,
death_rate_intercept = 0,
death_rate_factor = 0,
death_rate_exponent = 0,
resolution = 0,
rarefaction = 1)
tree = generate_random_tree(parameters, max_tips=100)
# Fit model to the tree
fitting_parameters = parameters
fitting_parameters$birth_rate_intercept = NULL # fit only this parameter
fitting = fit_tree_model(tree,fitting_parameters)
# compare fitted to true value
T = parameters$birth_rate_intercept
F = fitting$parameters$birth_rate_intercept
cat(sprintf("birth_rate_intercept: true=%g, fitted=%g\n",T,F))
# }
Run the code above in your browser using DataLab