Estimation of functional linear mixed models (FLMMs) for irregularly or sparsely
sampled data based on functional principal component analysis (FPCA).
The implemented models are special cases of the general FLMM
$$Y_i(t_{ij}) = \mu(t_{ij},x_i) + z_i^T U(t_{ij}) + \epsilon_i(t_{ij}), i = 1,...,n, j = 1,...,D_i,$$
with \(Y_i(t_{ij})\) the value of the response of curve \(i\) at observation point
\(t_{ij}\), \(\mu(t_{ij},x_i)\) is a mean function, which may depend on covariates
\(x_i = (x_{i1},\ldots,x_{ip})^T\). \(z_i\) is a covariate vector,
which is multiplied with the vector of functional random effects \(U(t_{ij})\).
\(\epsilon_i(t_{ij})\) is independent and identically distributed white noise
measurement error with homoscedastic, constant variance. For more details, see references below.
The current implementation can be used to fit four special cases
of the above general FLMM:
a model for independent functional data (e.g. longitudinal data), for which \(z_i^T U(t_{ij})\) only consists of a smooth curve-specific deviation (smooth error curve)
a model for correlated functional data with one functional random intercept (fRI) for one grouping variable in addition to a smooth curve-specific error
a model for correlated functional data with two crossed fRIs for two grouping variables in addition to a smooth curve-specific error
a model for correlated functional data with two nested fRIs for two grouping variables in addition to a smooth curve-specific error.
sparseFLMM(
curve_info,
use_RI = FALSE,
use_simple = FALSE,
method = "fREML",
use_bam = TRUE,
bs = "ps",
d_grid = 100,
min_grid = 0,
max_grid = 1,
my_grid = NULL,
bf_mean = 8,
bf_covariates = 8,
m_mean = c(2, 3),
covariate = FALSE,
num_covariates,
covariate_form,
interaction,
which_interaction = matrix(NA),
save_model_mean = FALSE,
para_estim_mean = FALSE,
para_estim_mean_nc = 0,
bf_covs,
m_covs,
use_whole = FALSE,
use_tri = FALSE,
use_tri_constr = TRUE,
use_tri_constr_weights = FALSE,
np = TRUE,
mp = TRUE,
use_discrete_cov = FALSE,
para_estim_cov = FALSE,
para_estim_cov_nc = 0,
var_level = 0.95,
N_B = NA,
N_C = NA,
N_E = NA,
use_famm = FALSE,
use_bam_famm = TRUE,
bs_int_famm = list(bs = "ps", k = 8, m = c(2, 3)),
bs_y_famm = list(bs = "ps", k = 8, m = c(2, 3)),
save_model_famm = FALSE,
use_discrete_famm = FALSE,
para_estim_famm = FALSE,
para_estim_famm_nc = 0,
nested = FALSE
)
The function returns a list of two elements: time_all
and results
.
time_all
contains the total system.time() for calling function sparseFLMM()
.
results
is a list of all estimates, including:
mean_hat
: includes the components of the estimated mean function.
mean_pred
contains effects of dummy covariates or metric covariates with a linear effect (varying coefficients).
mean_pred_smooth
contains effects of metric covariates with a smooth effect.
intercept
is the estimated intercept, which is part of \(f_0(t_{ij})\).
For each auto-covariance smoothing alternative X
(use_whole
, use_tri
,
use_tri_constr
, use_tri_constr_weights
):
cov_hat_X
: includes
sigmasq
: the estimated error variance
sigmasq_int
: the integral of the estimated error variance over the domain
grid_mat_B/C/E
: the estimated auto-covariance(s) evaluated on the pre-specified grid
sp
: the smoothing parameter(s) for smoothing the auto-covariance(s)
time_cov_estim
: the time for the smoothing the auto-covariance(s) only
time_cov_pred_grid
: the time for evaluating the estimated auto-covariance(s) on the pre-specified grid.
time_cov_X
: the total system.time() for the auto-covariance estimation
fpc_hat_X
: including
phi_B/C/E_hat_grid
: the estimated rescaled eigenfunctions evaluated on the pre-specified grid
nu_B/C/E_hat
: the estimated rescaled eigenvalues
N_B/C/E
: the estimated truncation numbers, i.e., number of FPCs which are chosen
total_var
: the estimated total variance
var_explained
: the estimated explained variance
xi_B/C/E_hat
: the predicted FPC weights (scores).
time_fpc_X
: the total system.time() for the eigen decompositions
and prediction on the FPC weights (scores)
If use_famm = TRUE
, the list results
additionally contains:
fpc_famm_hat_X
: including
intercept
: the estimated intercept, which is part of \(f_0(t_{ij})\)
residuals
: the residuals of the FAMM estimation
xi_B/C/E_hat_famm
: the predicted basis weights
famm_predict_B/C/E
: the predicted functional processes evaluated on the pre-specified grid
famm_cb_mean
: the re-estimated functional intercept \(f_0(t_{ij})\)
famm_cb_covariate.1
, famm_cb_covariate.1
, etc: possible re-estimated covariate effects
famm_cb_inter_1_2
, famm_cb_inter_1_3
, etc: possible interaction effects
time_fpc_famm_X
: the total system.time() for the FAMM estimation.
The unique identification numbers for the levels of the grouping variables and curves are
renumbered for convenience during estimation from 1 in ascending order.
The original identification numbers are returned in the list results
:
n_orig
: curve levels as they entered the estimation
subject_orig
: levels of the first grouping variable as they entered the estimation
word_orig
: levels of the second grouping variable (if existent) as they entered the estimation
my_grid
: pre-specified grid.
data table in which each row represents a single observation
point. curve_info
needs to contain the following columns:
y_vec
(numeric): the response values for each observation point
t
(numeric): the observations point locations, i.e. \(t_{ij}\)
n_long
(integer): unique identification number for each curve
subject_long
(integer): unique identification number for each
level of the first grouping variable
(e.g. speakers for the phonetics data in the example below).
In the case of independent functions, subject_long
should be set equal to n_long
.
For models with two crossed functional random intercepts, the data table additionally needs to have columns:
word_long
(integer): unique identification number for each level of
the second grouping variable (e.g. words for the phonetics data in the example below)
combi_long
(integer): number of the repetition of the combination of the
corresponding level of the first and of the second grouping variable.
For models with two nested functional random intercepts, the data table additionally needs to have columns: #'
word_long
(integer): unique identification number for each level
of the second grouping variable (e.g. phases of a randomized controled
trial). Note that the nested model is only implemented for two levels in the
second grouping variable.
combi_long
(integer): number of the repetition of the combination of the
corresponding level of the first and of the second grouping variable.
For models with covariates as part of the mean function \(\mu(t_{ij},x_i)\),
the covariate values (numeric) need to be in
separate columns with names: covariate.1
, covariate.2
, etc.
TRUE to specify a model with one functional random intercept
for the first grouping variable (subject_long
) and a smooth random error curve.
Defaults to FALSE
, which specifies a model with crossed functional random
intercepts for the
first and second grouping variable and a smooth error curve.
TRUE
to specify a model with only a smooth random error function,
use_RI
should then also be set to TRUE
. Defaults to FALSE
.
estimation method for gam
or bam
, see mgcv
for more details.
Defaults to "fREML"
.
TRUE
to use function bam instead of function gam
(syntax is the same, bam is faster for large data sets).
bam
is recommended and set as default.
spline basis function type for the estimation of the mean function and
the auto-covariance, see s
and te
for more details.
Defaults to penalized B-splines, i.e. bs = "ps"
. This choice is recommended as others have not been
tested yet.
pre-specified grid length for equidistant grid on which the mean, the auto-covariance surfaces, the eigenfunctions
and the functional random effects are evaluated. NOTE: the length of the grid can be important for computation time (approx. quadratic influence).
Defaults to d_grid = 100
.
minimum value of equidistant grid (should approx. correspond to minimum value of time interval). Defaults to min_grid = 0
.
maximum value of equidistant grid (should approx. correspond to maximum value of time interval). Defaults to max_grid = 1
.
optional evaluation grid, which can be specified and used instead of d_grid
, min_grid
, max_grid
.
NOTE: the grid should be equidistant.
basis dimension (number of basis functions) used for the functional intercept
\(f_0(t_{ij})\) in the mean estimation via bam/gam
. Defaults to bf_mean = 8
.
basis dimension (number of basis functions) used for the functional effects
of covariates in the mean estimation via bam/gam
. Defaults to bf_covariates = 8
.
NOTE: in the current implementation, the same basis dimension for all covariates is used.
order of the penalty for this term in bam/gam
of mean estimation, for bs = "ps"
spline and penalty order,
defaults to m_mean = c(2, 3)
, i.e., cubic B-splines with third order difference penalty, see s
for details.
TRUE
to estimate covariate effects (as part of the mean function).
number of covariates that are included in the model. NOTE: not number of effects in case interactions of covariates are specified.
vector with entries for each covariate that specify the form in which the
respective covariate enters the mean function. Possible forms are "by"
for varying-coefficient \((f(t_{ij})*covariate)\), which is
possible for dummy coded covariates and metric covariates and
"smooth"
for smooth effect in t and in covariate \((f(t_{ij}, covariate))\), which is only
possible for metric covariates! NOTE: metric covariates should be centered such that the global functional intercept \(f_0(t_{ij})\) can be interpreted as
global mean function and the effect can be interpreted as difference from the global mean.
TRUE
to estimate interaction effects of covariates, which interactions, see which_interaction
(below).
Interactions are possible for dummy-coded covariates that act as varying coefficients.
symmetric matrix that specifies which interactions should be considered in case covariate = TRUE
and interaction = TRUE
.
Entry which_interaction
[k, l] specifies that the interaction between covariate.k
and covariate.l
is modeled (example below).
NOTE: entries are redundant, which_interaction
[l, k] should be set to the
same as which_interaction
[k, l] (symmetric). Defaults to which_interaction = matrix(NA)
which should be specified when interaction = FALSE
.
TRUE
to give out gam/bam
object (attention: can be large!), defaults to FALSE
.
TRUE
to parallelize mean estimation (only possible using bam
), defaults to FALSE
.
number of cores for parallelization of mean estimation (only possible using bam
,
only active if para_estim_mean = TRUE
). Defaults to 0.
vector of marginal basis dimensions (number of basis functions) used for covariance estimation via bam/gam
for each functional random effect (including the smooth error curve).
In the case of multiple grouping variables, the first entry corresponds to the first grouping variable,
the second vector entry corresponds to the second grouping variable, and the third to the smooth error curve.
list of marginal orders of the penalty for bam/gam
for covariance estimation, for bs = "ps"
marginal spline and
penalty order. As only symmetric surfaces are considered: same for both directions.
For crossed fRIs: list of three vectors, e.g. m_covs = list(c(2, 3), c(2, 3), c(2, 3))
, where first and second entry correspond to first
and second grouping variable, respectively and third entry corresponds to smooth error.
For one fRI: list of two vectors, e.g. m_covs = list(c(2, 3), c(2, 3))
, where first entry corresponds to (first) grouping variable
and second entry corresponds to smooth error.
For independent curves: list of one vector, e.g. m_covs = list(c(2,3))
corresponding to smooth error.
TRUE
to estimate the whole auto-covariance surfaces without symmetry constraint.
Defaults to FALSE
as is much slower than use_tri_constr
and use_tri_constr_weights
. For more details, see references below.
TRUE
to estimate only the upper triangle of the auto-covariance surfaces without
symmetry constraint. Defaults to FALSE
and not recommended. For more details, see references below.
TRUE
to estimate only the upper triangle of the auto-covariance surfaces with symmetry constraint using the
smooth class 'symm'
. Defaults to TRUE
. For more details, see references below.
TRUE
to estimate only the upper triangle of the auto-covariances with symmetry constraint, using the
smooth class 'symm'
and weights of 0.5 on the diagonal to use the same weights as for estimating the whole auto-covariance surfaces.
Defaults to FALSE
. For more details, see references below.
TRUE
to use 'normal parameterization' for a tensor product smooth, see te
for more details.
Defaults to TRUE
.
FALSE
to use Kronecker product penalty instead of Kronecker sum penalty
with only one smoothing parameter (use_whole = TRUE
and use_tri = TRUE
), for details see te
.
For use_tri_constr = TRUE
and use_tri_constr_weights = TRUE
, only one smoothing parameter is estimated anyway.
Defaults to TRUE
.
TRUE
to further speed up the auto-covariance computation by discretization of
covariates for storage and efficiency reasons, includes parallelization controlled by para_estim_cov_nc
(below),
see bam
for more details. Defaults to FALSE
.
TRUE
to parallelize auto-covariance estimation (only possible using bam
), defaults to FALSE
.
number of cores (if use_discrete_cov = FALSE
) or number of threads (if use_discrete_cov = TRUE
)
for parallelization of auto-covariance estimation (only possible using bam
, only active if para_estim_cov = TRUE
).
Defaults to 0.
pre-specified level of explained variance used for the choice of the number of the functional principal
components (FPCs). Alternatively, a specific number of FPCs can be specified (see below). Defaults to var_level = 0.95
.
number of components for B (fRI for first grouping variable) to keep, overrides var_level
if not NA
.
number of components for C (fRI for second grouping variable) to keep, overrides var_level
if not NA
.
number of components for E (smooth error) to keep, overrides var_level
if not NA
.
TRUE
to embed the model into the framework of functional additive mixed
models (FAMMs) using re-estimation of the mean function together with the prediction of the FPC weights (scores).
This allows for point-wise confidence bands for the covariate effects. Defaults to FALSE
.
TRUE
to use function bam
instead of function gam
in FAMM estimation
(reduces computation time for large data sets),
highly recommended. Defaults to TRUE
.
specification of the estimation of the functional intercept \(f_0(t_{ij})\)
(as part of the mean function), see pffr
for details.
Defaults to bs_int = list(bs = "ps", k = 8, m = c(2, 3))
, where
bs
: type of basis functions, k
: number of basis functions, m
: order of the spline and order of the penalty.
specification of the estimation of the covariates effects (as part of the mean function), see pffr
for details.
Defaults to bs_y_famm = list(bs = "ps", k = 8, m = c(2, 3))
, where
bs
: type of basis functions, k
: number of basis functions, m
: order of the spline and order of the penalty.
TRUE
to give out the FAMM model object (attention: can be very large!).
Defaults to FALSE.
TRUE
to further speed up the fpc-famm computation by discretization of
# covariates for storage and efficiency reasons, includes parallelization controlled by para_estim_famm_nc
(below),
see bam
for more details. Defaults to FALSE
.
TRUE
to parallelize FAMM estimation. Defaults to FALSE
.
number of cores (if use_discrete_famm = FALSE
) or number of
threads (if use_discrete_famm = TRUE
)
for parallelization of FAMM estimation (only possible using bam
, only active if para_estim_famm = TRUE
).
Defaults to 0.
TRUE
to specify a model with nested functional random
intercepts for the first and second grouping variable and a smooth error
curve. Defaults to FALSE
.
Jona Cederbaum
The code can handle irregularly and possibly sparsely sampled
data. Of course, it can also be used to analyze regular grid data,
but as it is especially designed for the irregular case and there may
be a more efficient way to analyze regular grid data.
The mean function is of the form $$\mu(t_{ij},x_i) = f_0(t_{ij}) +
\sum_{k=1}^r f_k(t_{ij},x_{ik}),$$ where \(f_0(t_{ij})\) is a functional
intercept.
Currently implemented are effects of dummy-coded and metric covariates which act as
varying-coefficients of the
form \(f_k(t_{ij})*x_{ik}\) and smooth effects of metric covariates (smooth in t and in the covariate)
of the form \(f(t_{ij}, x_{ik})\). NOTE: metric covariates should be centered such that the global functional intercept can be interpreted as global mean function and
the effect can be interpreted as difference from the global mean. Interaction effects of dummy-coded
covariates acting as varying coefficients are possible.
The estimation consists of four main steps:
estimation of the smooth mean function (including covariate effects) under independence assumption using splines.
estimation of the smooth auto-covariances of the functional random effects. A fast bivariate symmetric smoother implemented in the smooth class 'symm' can be used to speed up estimation (see below).
eigen decomposition of the estimated auto-covariances, which are evaluated on a pre-specified equidistant grid. This yields estimated eigenvalues and eigenfunctions, which are rescaled to ensure orthonormality with respect to the L2-scalar product.
prediction of the functional principal component weights (scores) yielding predictions for the functional random effects.
The estimation of the mean function and auto-covariance functions is based on package mgcv.
The functional principal component weights (scores) are predicted as best (linear)
unbiased predictors. In addition, this implementation allows to embed the model
in the general framework of functional additive mixed models (FAMM) based on package refund, which allows for the construction of
point-wise confidence bands for covariate effects (in the mean function) conditional on the FPCA.
Note that the estimation as FAMM may be computationally expensive as the model
is re-estimated in a mixed model framework.
The four special cases of the general FLMM (two nested fRIs, two crossed fRIs, one fRI, independent curves) are implemented as follows:
In the special case with two nested fRIs, three random processes B, C,
and E are considered, where B is the fRI for the first grouping variable (e.
g. patient in a random controlled trial), C denotes the fRI for the second
grouping variable (e.g. individual specific effect in the follow-up) and E
denotes the smooth error. For this special
case, arguments use_RI
and use_simple
are both set to
FALSE
and argument nested
is set to TRUE
. Note that
this implementation only allows for a simple before/after study design.
In the special case with two crossed fRIs, three
random processes B, C, and E are considered, where B is the
fRI for the first grouping variable (e.g. speakers in the phonetics example below),
C denotes the fRI for the second grouping variable
(e.g. target words in the phonetics example below) and
E denotes the smooth error. For this special case, arguments
use_RI
and use_simple
are both set to FALSE
.
In the special case with only one fRI, only B and E are considered
and the number of levels for the second grouping variable is to zero.
For this special case, argument use_RI
is set to TRUE
and argument
use_simple
is set to FALSE
.
The special case with independent curves is internally seen as a special case
of the model with one fRI for the first grouping variable, with the number of
levels for this grouping variable corresponding to the number of curves.
Thus, for each level of the first grouping variable there is one curve.
Therefore, for the special case of independent curves, the estimation returns an
estimate for the auto-covariance of B (instead of E) and all corresponding results are indicated with '_B'
, although they
correspond to the smooth error. For this special case, arguments
use_RI
and use_simple
are both set to TRUE
.
Cederbaum, Pouplier, Hoole, Greven (2016): Functional Linear Mixed Models for Irregularly or Sparsely Sampled Data. Statistical Modelling, 16(1), 67-88.
Cederbaum, Scheipl, Greven (2016): Fast symmetric additive covariance smoothing. Submitted on arXiv.
Scheipl, F., Staicu, A.-M. and Greven, S. (2015): Functional Additive Mixed Models, Journal of Computational and Graphical Statistics, 24(2), 477-501.
if (FALSE) {
# subset of acoustic data (very small subset, no meaningful results can be expected and
# FAMM estimation does not work for this subset example. For FAMM estimation, see below.)
data("acoustic_subset")
acoustic_results <- sparseFLMM(curve_info = acoustic_subset, use_RI = FALSE, use_simple = FALSE,
method = "fREML", use_bam = TRUE, bs = "ps", d_grid = 100, min_grid = 0,
max_grid = 1, my_grid = NULL, bf_mean = 8, bf_covariates = 8, m_mean = c(2,3),
covariate = TRUE, num_covariates = 4, covariate_form = rep("by", 4),
interaction = TRUE,
which_interaction = matrix(c(FALSE, TRUE, TRUE, TRUE, TRUE,
FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, TRUE, FALSE,
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE),
byrow = TRUE, nrow = 4, ncol = 4),
save_model_mean = FALSE, para_estim_mean = FALSE, para_estim_mean_nc = 0,
bf_covs = c(5, 5, 5), m_covs = list(c(2, 3), c(2, 3), c(2, 3)),
use_whole = FALSE, use_tri = FALSE, use_tri_constr = TRUE,
use_tri_constr_weights = FALSE, np = TRUE, mp = TRUE,
use_discrete_cov = FALSE,
para_estim_cov = FALSE, para_estim_cov_nc = 5,
var_level = 0.95, N_B = NA, N_C = NA, N_E = NA,
use_famm = FALSE, use_bam_famm = TRUE,
bs_int_famm = list(bs = "ps", k = 8, m = c(2, 3)),
bs_y_famm = list(bs = "ps", k = 8, m = c(2, 3)),
save_model_famm = FALSE, use_discrete_famm = FALSE,
para_estim_famm = FALSE, para_estim_famm_nc = 0)}
if (FALSE) {
# whole data set with estimation in the FAMM framework
data("acoustic")
acoustic_results <- sparseFLMM(curve_info = acoustic, use_RI = FALSE, use_simple = FALSE,
method = "fREML", use_bam = TRUE, bs = "ps", d_grid = 100, min_grid = 0,
max_grid = 1, my_grid = NULL, bf_mean = 8, bf_covariates = 8, m_mean = c(2,3),
covariate = TRUE, num_covariates = 4, covariate_form = rep("by", 4),
interaction = TRUE,
which_interaction = matrix(c(FALSE, TRUE, TRUE, TRUE, TRUE,
FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, TRUE, FALSE,
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE),
byrow = TRUE, nrow = 4, ncol = 4),
save_model_mean = FALSE, para_estim_mean = FALSE, para_estim_mean_nc = 0,
bf_covs = c(5, 5, 5), m_covs = list(c(2, 3), c(2, 3), c(2, 3)),
use_whole = FALSE, use_tri = FALSE, use_tri_constr = TRUE,
use_tri_constr_weights = FALSE, np = TRUE, mp = TRUE,
use_discrete_cov = FALSE,
para_estim_cov = TRUE, para_estim_cov_nc = 5,
var_level = 0.95, N_B = NA, N_C = NA, N_E = NA,
use_famm = TRUE, use_bam_famm = TRUE,
bs_int_famm = list(bs = "ps", k = 8, m = c(2, 3)),
bs_y_famm = list(bs = "ps", k = 8, m = c(2, 3)),
save_model_famm = FALSE, use_discrete_famm = FALSE,
para_estim_famm = TRUE, para_estim_famm_nc = 5)}
Run the code above in your browser using DataLab