gfpca_twoStep: Generalized functional principal component analysis

Description

Function for applying FPCA to different exponential family distributions. Used in the FPCA step for registering functional data, called by register_fpca when fpca_type = "two-step".

The method implements the `two-step approach` of Gertheiss et al. (2017) and is based on the approach of Hall et al. (2008) to estimate functional principal components.

The number of functional principal components (FPCs) can either be specified directly (argument npc) or chosen based on the explained share of variance (npc_criterion). Using the latter, we approximate the overall variance in the data Y with the variance represented by the smoothed covariance surface estimated with cov_hall. Note that the Eigenvalue decomposition of this covariance surface sometimes leads to a long tail of subordinate FPCs with small eigenvalues. Such subordinate dimensions seem to often represent phase rather than amplitude variation, and can be cut off by specifying the second element of argument npc_criterion.

This function is an adaptation of the implementation of Jan Gertheiss for Gertheiss et al. (2017), with focus on higher (RAM) efficiency for large data settings.

Usage

gfpca_twoStep(
  Y,
  family = "gaussian",
  npc = NULL,
  npc_criterion = NULL,
  Kt = 8,
  t_min = NULL,
  t_max = NULL,
  row_obj = NULL,
  index_significantDigits = 4L,
  estimation_accuracy = "high",
  start_params = NULL,
  periodic = FALSE,
  verbose = 1,
  ...
)

Value

An object of class fpca containing:

fpca_type: Information that FPCA was performed with the 'two-step' approach, in contrast to registr::fpca_gauss or registr::bfpca.
t_vec: Time vector over which the mean mu was evaluated. The resolution is can be specified by setting index_significantDigits.
knots: Cutpoints for B-spline basis used to rebuild alpha.
efunctions: \(D \times npc\) matrix of estimated FPC basis functions.
evalues: Estimated variance of the FPC scores.
evalues_sum: Sum of all (nonnegative) eigenvalues of the smoothed covariance surface estimated with cov_hall. Can be used as an approximation for the total variance present in Y to compute the shares of explained variance of the FPC scores.
npc: number of FPCs.
scores: \(I \times npc\) matrix of estimated FPC scores.
alpha: Estimated population-level mean.
mu: Estimated population-level mean. Same value as alpha but included for compatibility with refund.shiny package.
subject_coefs: Always NA but included for full consistency with fpca_gauss and bfpca.
Yhat: FPC approximation of subject-specific means, before applying the response function.
Y: The observed data.
family: binomial, for compatibility with refund.shiny package.
gamm4_theta: Estimated parameters of the mixed model.

Arguments

Y: Dataframe. Should have values id, value, index.
family: One of c("gaussian","binomial","gamma","poisson"). Poisson data are rounded before performing the GFPCA to ensure integer data, see Details section below. Defaults to "gaussian".
npc, npc_criterion: The number of functional principal components (FPCs) has to be specified either directly as npc or based on their explained share of variance. In the latter case, npc_criterion can either be set to (i) a share between 0 and 1, or (ii) a vector with two elements comprising the targeted explained share of variance and a cut-off scree plot criterion, both between 0 and 1. As an example for the latter, npc_criterion = c(0.9,0.02) tries to choose a number of FPCs that explains at least 90% of variation, but only includes FPCs that explain at least 2% of variation (even if this means 90% explained variation is not reached).
Kt: Number of B-spline basis functions used to estimate mean functions and functional principal components. Default is 8.
t_min: Minimum value to be evaluated on the time domain.
t_max: Maximum value to be evaluated on the time domain.
row_obj: If NULL, the function cleans the data and calculates row indices. Keep this NULL if you are using standalone register function.
index_significantDigits: Positive integer >= 2, stating the number of significant digits to which the index grid should be rounded. Coarsening the index grid is necessary since otherwise the covariance surface matrix explodes in size in the presence of too many unique index values (which is always the case after some registration step). Defaults to 4. Set to NULL to prevent rounding.
estimation_accuracy: One of c("high","low"). When set to "low", the mixed model estimation step in lme4 is performed with lower accuracy, reducing computation time. Defaults to "high".
start_params: Optional start values for gamm4. Not used if npc_criterion is specified.
periodic: Only contained for full consistency with fpca_gauss and bfpca. If TRUE, returns the knots vector for periodic b-spline basis functions. Defaults to FALSE. This parameter does not change the results of the two-step GFPCA.
verbose: Can be set to integers between 0 and 4 to control the level of detail of the printed diagnostic messages. Higher numbers lead to more detailed messages. Defaults to 1.
...: Additional arguments passed to cov_hall.

Author

Alexander Bauer alexander.bauer@stat.uni-muenchen.de, based on work of Jan Gertheiss

Details

For family = "poisson" the values in Y are rounded before performing the GFPCA to ensure integer data. This is done to ensure reasonable computation times. Computation times tend to explode when estimating the underlying high-dimensional mixed model with continuous Poisson data based on the gamm4 package.

If negative eigenvalues are present, the respective eigenfunctions are dropped and not considered further.

References

Gertheiss, J., Goldsmith, J., & Staicu, A. M. (2017). A note on modeling sparse exponential-family functional response curves. Computational statistics & data analysis, 105, 46--52.

Hall, P., Müller, H. G., & Yao, F. (2008). Modelling sparse generalized longitudinal observations with latent Gaussian processes. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(4), 703--723.

Examples

Run this code

data(growth_incomplete)

# estimate 2 FPCs
fpca_obj = gfpca_twoStep(Y = growth_incomplete, npc = 2, family = "gaussian")
plot(fpca_obj)

# estimate npc adaptively, to explain 90% of the overall variation
fpca_obj2 = gfpca_twoStep(Y = growth_incomplete, npc_criterion = 0.9, family = "gaussian")
plot(fpca_obj2, plot_FPCs = 1:2)

Run the code above in your browser using DataLab