Given a rooted phylogenetic tree and geographic coordinates (latitudes & longitudes) for its tips, this function estimates the diffusivity of a Spherical Brownian Motion (SBM) model for the evolution of geographic location along lineages (Perrin 1928; Brillinger 2012), assuming that the diffusivity varies linearly over time. Estimation is done via maximum-likelihood and using independent contrasts between sister lineages. This function is designed to estimate the diffusivity over time, by fitting two parameters defining the diffusivity as a linear function of time. For fitting more general functional forms see fit_sbm_parametric
.
fit_sbm_linear(tree,
tip_latitudes,
tip_longitudes,
radius,
planar_approximation = FALSE,
only_basal_tip_pairs = FALSE,
only_distant_tip_pairs= FALSE,
min_MRCA_time = 0,
max_MRCA_age = Inf,
time1 = 0,
time2 = NULL,
Ntrials = 1,
Nthreads = 1,
Nbootstraps = 0,
Ntrials_per_bootstrap = NULL,
Nsignificance = 0,
NQQ = 0,
fit_control = list(),
verbose = FALSE,
verbose_prefix = "")
A rooted tree of class "phylo". The root is assumed to be the unique node with no incoming edge. Edge lengths are assumed to represent time intervals or a similarly interpretable phylogenetic distance.
Numeric vector of length Ntips, listing latitudes of tips in decimal degrees (from -90 to 90). The order of entries must correspond to the order of tips in the tree (i.e., as listed in tree$tip.label
).
Numeric vector of length Ntips, listing longitudes of tips in decimal degrees (from -180 to 180). The order of entries must correspond to the order of tips in the tree (i.e., as listed in tree$tip.label
).
Strictly positive numeric, specifying the radius of the sphere. For Earth, the mean radius is 6371 km.
Logical, specifying whether to estimate the diffusivity based on a planar approximation of the SBM model, i.e. by assuming that geographic distances between tips are as if tips are distributed on a 2D cartesian plane. This approximation is only accurate if geographical distances between tips are small compared to the sphere's radius.
Logical, specifying whether to only compare immediate sister tips, i.e., tips connected through a single parental node.
Logical, specifying whether to only compare tips at distinct geographic locations.
Numeric, specifying the minimum allowed time (distance from root) of the most recent common ancestor (MRCA) of sister tips considered in the fitting. In other words, an independent contrast is only considered if the two sister tips' MRCA has at least this distance from the root. Set min_MRCA_time=0
to disable this filter.
Numeric, specifying the maximum allowed age (distance from youngest tip) of the MRCA of sister tips considered in the fitting. In other words, an independent contrast is only considered if the two sister tips' MRCA has at most this age (time to present). Set max_MRCA_age=Inf
to disable this filter.
Optional numeric, specifying the first time point at which to estimate the diffusivity. By default this is set to root (i.e., time 0).
Optional numeric, specifying the first time point at which to estimate the diffusivity. By default this is set to the present day (i.e., the maximum distance of any tip from the root).
Integer, specifying the number of independent fitting trials to perform, each starting from a random choice of model parameters. Increasing Ntrials
reduces the risk of reaching a non-global local maximum in the fitting objective.
Integer, specifying the number of parallel threads to use for performing multiple fitting trials simultaneously. This should generally not exceed the number of available CPUs on your machine. Parallel computing is not available on the Windows platform.
Integer, specifying the number of parametric bootstraps to perform for estimating standard errors and confidence intervals of estimated model parameters. Set to 0 for no bootstrapping.
Integer, specifying the number of fitting trials to perform for each bootstrap sampling. If NULL
, this is set equal to max(1,Ntrials)
. Decreasing Ntrials_per_bootstrap
will reduce computation time, at the expense of potentially inflating the estimated confidence intervals; in some cases (e.g., for very large trees) this may be useful if fitting takes a long time and confidence intervals are very narrow anyway. Only relevant if Nbootstraps>0
.
Integer, specifying the number of simulations to perform under a const-diffusivity model for assessing the statistical significance of the fitted slope. Set to 0 to not calculate the significance of the slope.
Integer, optional number of simulations to perform for creating QQ plots of the theoretically expected distribution of geodistances vs. the empirical distribution of geodistances (across independent contrasts). The resolution of the returned QQ plot will be equal to the number of independent contrasts used for fitting. If <=0, no QQ plots will be calculated.
Named list containing options for the nlminb
optimization routine, such as iter.max
, eval.max
or rel.tol
. For a complete list of options and default values see the documentation of nlminb
in the stats
package.
Logical, specifying whether to print progress reports and warnings to the screen. Note that errors always cause a return of the function (see return values success
and error
).
Character, specifying the line prefix for printing progress reports to the screen.
A list with the following elements:
Logical, indicating whether the fitting was successful. If FALSE
, then an additional return variable, error
, will contain a description of the error; in that case all other return variables may be undefined.
The maximized fitting objective. Currently, only maximum-likelihood estimation is implemented, and hence this will always be the maximized log-likelihood.
The name of the objective that was maximized during fitting. Currently, only maximum-likelihood estimation is implemented, and hence this will always be ``loglikelihood''.
Numeric vector of size 2, listing the two time points at which the diffusivity was estimated (time1
and time2
).
Numeric vector of size 2, listing the fitted diffusivity at time1
and time2
. The fitted model assumes that the diffusivity varied linearly between those two time points.
The log-likelihood of the fitted linear model for the given data.
Integer, number of fitted (i.e., non-fixed) model parameters. Will always be 2.
Integer, number of independent contrasts used for fitting.
The Akaike Information Criterion for the fitted model, defined as \(2k-2\log(L)\), where \(k\) is the number of fitted parameters and \(L\) is the maximized likelihood.
The Bayesian information criterion for the fitted model, defined as \(\log(n)k-2\log(L)\), where \(k\) is the number of fitted parameters, \(n\) is the number of data points (number of independent contrasts), and \(L\) is the maximized likelihood.
Logical, specifying whether the maximum likelihood was reached after convergence of the optimization algorithm. Note that in some cases the maximum likelihood may have been achieved by an optimization path that did not yet converge (in which case it's advisable to increase iter.max
and/or eval.max
).
Integer, specifying the number of iterations performed during the optimization path that yielded the maximum likelihood.
Integer, specifying the number of likelihood evaluations performed during the optimization path that yielded the maximum likelihood.
Numeric vector of size Ntrials
, listing the initial objective values (e.g., loglikelihoods) for each fitting trial, i.e. at the start parameter values.
Numeric vector of size Ntrials
, listing the final maximized objective values (e.g., loglikelihoods) for each fitting trial.
Integer vector of size Ntrials
, listing the number of start attempts for each fitting trial, until a starting point with valid likelihood was found.
Integer vector of size Ntrials
, listing the number of iterations needed for each fitting trial.
Integer vector of size Ntrials
, listing the number of likelihood evaluations needed for each fitting trial.
Numeric vector of size NP, estimated standard error of the parameters, based on parametric bootstrapping. Only returned if Nbootstraps>0
.
Numeric vector of size 2, lower bound of the 50% confidence interval (25-75% percentile) for the fitted diffusivity at the root and present, based on parametric bootstrapping. Only returned if Nbootstraps>0
.
Numeric vector of size 2, upper bound of the 50% confidence interval for the fitted diffusivity at the root and present, based on parametric bootstrapping. Only returned if Nbootstraps>0
.
Numeric vector of size 2, lower bound of the 95% confidence interval (2.5-97.5% percentile) for the fitted diffusivity at the root and present, based on parametric bootstrapping. Only returned if Nbootstraps>0
.
Numeric vector of size 2, upper bound of the 95% confidence interval for the fitted diffusivity at the root and present, based on parametric bootstrapping. Only returned if Nbootstraps>0
.
Numeric between 0 and 1, estimated consistency of the data with the fitted model. See the documentation of fit_sbm_const
for an explanation. Only returned if Nbootstraps>0
.
Numeric between 0 and 1, estimate statistical significance of the fitted linear slope. Only returned if Nsignificance>0
.
Numeric matrix of size Ncontrasts x 2, listing the computed QQ-plot. The first column lists quantiles of geodistances in the original dataset, the 2nd column lists quantiles of hypothetical geodistances simulated based on the fitted model.
This function is essentially a wrapper for the more general function fit_sbm_parametric
, with the addition that it can estimate the statistical significance of the fitted linear slope.
The statistical significance of the slope is the probability that a constant-diffusivity SBM model would generate data that would yield a fitted linear slope equal to or greater than the one fitted to the original data; the significance is estimated by simulating Nsignificance
constant-diffusivity models and then fitting a linear-diffusivity model. The constant diffusivity assumed in these simulations is the maximum-likelihood diffusivity fitted internally using fit_sbm_const
.
Note that estimation of diffusivity at older times is only possible if the timetree includes extinct tips or tips sampled at older times (e.g., as is often the case in viral phylogenies). If tips are only sampled once at present-day, i.e. the timetree is ultrametric, reliable diffusivity estimates can only be achieved near present times.
For short expected transition distances this function uses the approximation formula by Ghosh et al. (2012) to calculate the probability density of geographical transitions along edges. For longer expected transition distances the function uses a truncated approximation of the series representation of SBM transition densities (Perrin 1928).
If edge.length
is missing from one of the input trees, each edge in the tree is assumed to have length 1. The tree may include multifurcations as well as monofurcations, however multifurcations are internally expanded into bifurcations by adding dummy nodes.
F. Perrin (1928). Etude mathematique du mouvement Brownien de rotation. 45:1-51.
D. R. Brillinger (2012). A particle migrating randomly on a sphere. in Selected Works of David Brillinger. Springer.
A. Ghosh, J. Samuel, S. Sinha (2012). A Gaussian for diffusion on the sphere. Europhysics Letters. 98:30003.
# NOT RUN {
# generate a random tree, keeping extinct lineages
tree_params = list(birth_rate_factor=1, death_rate_factor=0.95)
tree = generate_random_tree(tree_params,max_tips=1000,coalescent=FALSE)$tree
# calculate max distance of any tip from the root
max_time = get_tree_span(tree)$max_distance
# simulate time-dependent SBM on the tree
# we assume that diffusivity varies linearly with time
# in this example we measure distances in Earth radii
radius = 1
diffusivity_functor = function(times, params){
return(params[1] + (times/max_time)*(params[2]-params[1]))
}
true_params = c(1, 2)
time_grid = seq(0,max_time,length.out=2)
simulation = simulate_sbm(tree,
radius = radius,
diffusivity = diffusivity_functor(time_grid,true_params),
time_grid = time_grid)
# fit time-independent SBM to get a rough estimate
fit_const = fit_sbm_const(tree,simulation$tip_latitudes,simulation$tip_longitudes,radius=radius)
# fit SBM model with linearly varying diffusivity
fit = fit_sbm_linear(tree,
simulation$tip_latitudes,
simulation$tip_longitudes,
radius = radius,
Ntrials = 10)
# compare fitted & true params
print(true_params)
print(fit$diffusivities)
# }
Run the code above in your browser using DataLab