ou_haltlost: Handling missing data and lost traits in Ornstein-Uhlenbeck processes

Description

ou_haltlost and ou_zaplost handles lost traits and missing data. Each of them wraps the function oupar and returns a new function that accepts the same arguments and output the same form of result, but takes into account lost traits and missing data. dou_haltlost and dou_zaplost wraps the Jacobian function oujac, and hou_haltlost and hou_zaplost wraps the Hessian function ouhess.

Usage

ou_haltlost(parfn)
dou_haltlost(jacfn)
hou_haltlost(hessfn)
ou_zaplost(parfn)
dou_zaplost(jacfn)
hou_zaplost(hessfn)

Value

ou_haltlost and ou_zaplost returns a wrapped versions of `parfn`, which accepts the same arguments and outputs in the same format. dou_haltlost and dou_zaplost, analogously, wraps jacfn.

hou_zaplost and hou_zaplost wraps hessfn.

Arguments

parfn: A function that maps from the user-parametrisation to the underlying Gaussian parameters. Each of them returns a vector of concatenated $(\Phi, w, V')$, where $V'$ is the lower triangular part of $V$, and accepts four arguments: a vector of parameters whose length is specified by the pardims argument to the glinv_gauss function, the branch length leading to the currently processing node, a vector of factors with three levels indicating which dimensions are missing or lost in the mother of the current node, and a vector of factors with the same three levels indicating missingness of the current node.
jacfn: A function that accepts the same arguments as parfn and returns the Jacobian of parfn.
hessfn: A function that accepts the same arguments as parfns and returns a list of three 3D arrays, named Phi, w, V respectively inside the list. ((hessfn)(...))$Phi[m,i,j] contains the cross second-order partial derivative of $\Phi_m$ (here we treat the matrix $\Phi$ as a column-major-flattened vector) with respect to the $i$-th and$j$-th parameters in the joint $(H,\theta,\Sigma_x)$ vector, and ((hessfn)(...))$w[m,i,j] and ((hessfn)(...))$V[m,i,j] analogously contains second-order derivative of $w_m$ and $V'_m$.

Details

What is missing traits and lost traits

A `missing' trait refers to a trait value whose data is missing due to data collection problems. Fundamentally, they evolves in the same manner as other traits. An NA entry in the data is deemed `missing'. On the other hand, a lost trait is a trait dimension which had ceased to exists during the evolutionary process. An NaN entry in the data indicates a `lost' trait.

Each nodes has their own missing-ness tags

Each trait dimension of each nodes, either internal or tip, are tagged with one of the three labels: MISSING, LOST, and OK. If the data contains an NA in the $p$-th dimension of the $i$-th tip then $X_pi$ is tagged MISSING. No other tags of any other nodes and dimensions are changed in the case of missing-ness. On the other hands, the $p$-th dimension of any node $j$, regardless of whether or not it is an internal node or a tips, is tagged LOST if and only if the $p$-th dimension of all tips inside the clade started at $j$ are NaN. Any entry that is neither tagged LOST nor MISSING are tagged OK.

This corresponds to the biological intuition that, if a value is missing only due to data collection problems, the missingness should not influence the random walk process way up the phylogenetic tree; and this is obviously not true if the trait had ceased to exists instead.

Handling of missing data and lost traits

ou_haltlost and ou_zaplost handles missing data in the same way: they simply marginalises the unobserved dimensions in the joint Gaussian distributions of tip data.

For lost traits, ou_haltlost assumes the followings:

In the entire branch leading to the earliest node $j$ whose $p$-th dimension is tagged LOST, the lost trait dimension does not evolve at all.
In the entire same branch, the magnitude of the $p$-th dimension at $j$'s mother node has no influence on other dimensions, in any instantaneous moments during the evolution in the branch, neither through the linear combination with the drift matrix nor the Wiener process covariance; in other words, the SDE governing the non-lost dimensions' random walk is invariant of $j$'s mother nodes' $p$-th dimension.

Therefore, ou_haltlost first set the $p$-th row and column of both of $H_j$ and the $p$-th row of $Sigma_x$ to zero and marginalise out the degenerate Gaussian dimension.

On the other hands, ou_zaplost does not assume the lost trait to stop evolving immediately at moment when the branch leading to $j$ starts, but, instead, simply marginalise out the lost, non-degenerate Gaussian dimensions. This method is the same as the one that is used in the PCMBase package.

Usage in combination with parameter restrictions

Without paramter restriction, the following is an example usage in a call to the glinv function. It constructs a glinv model object which is capable of handling missing data and lost traits.


        mod.full = glinv(tree, x0, my_data,
                         parfns  = haltlost(oupar),
                         pardims = nparams_ou(k),
                         parjacs = dhaltlost(oujac),
                         parhess = hhaltlost(ouhess))

Note that we have the same naming convention that functions wrappers whose nams have prefix d wraps the Jacobians, while prefix d wraps the Hessians.

If parameter restriction is needed, then *ou_*lost should called before any reparameterisation/restriction functions because it expects the passed-in function parfn to accept the full $H$ matrix, rather than only the diagonal or lower-triangular part of it. Example:


        f = haltlost(oupar)
        g = dhaltlost(oujac)
        h = hhaltlost(oujac)
        mod.full = glinv(tree, x0, my_data,
                         parfns  = ou_spdH(f),
                         pardims = nparams_ou_spdH(k),
                         parjacs = dou_spdH(g),
                         parhess = ou_spdH(h,g))