pffr: Penalized flexible functional regression

Description

Implements additive regression for functional and scalar covariates and functional responses. This function is a wrapper for mgcv's gam and its siblings to fit models of the general form $E(Y_i(t)) = g(\mu(t) + \int X_i(s)\beta(s,t)ds + f(z_{1i}, t) + f(z_{2i}) + z_{3i} \beta_3(t) + \dots )$ with a functional (but not necessarily continuous) response $Y(t)$, response function $g$, (optional) smooth intercept $\mu(t)$, (multiple) functional covariates $X(t)$ and scalar covariates $z_1$, $z_2$, etc.

Usage

pffr(
  formula,
  yind,
  data = NULL,
  ydata = NULL,
  algorithm = NA,
  method = "REML",
  tensortype = c("ti", "t2"),
  bs.yindex = list(bs = "ps", k = 5, m = c(2, 1)),
  bs.int = list(bs = "ps", k = 20, m = c(2, 1)),
  ...
)

Arguments

formula

a formula with special terms as for gam, with additional special terms ff(), sff(), ffpc(), pcre() and c().

yind

a vector with length equal to the number of columns of the matrix of functional responses giving the vector of evaluation points $(t_1, \dots ,t_{G})$. If not supplied, yind is set to 1:ncol(<response>).

data

an (optional) data.frame containing the data. Can also be a named list for regular data. Functional covariates have to be supplied as <no. of observations> by <no. of evaluations> matrices, i.e. each row is one functional observation.

ydata

an (optional) data.frame supplying functional responses that are not observed on a regular grid. See Details.

algorithm

the name of the function used to estimate the model. Defaults to gam if the matrix of functional responses has less than 2e5 data points and to bam if not. 'gamm', 'gamm4' and 'jagam' are valid options as well. See Details for 'gamm4' and 'jagam'.

method

Defaults to "REML"-estimation, including of unknown scale. If algorithm="bam", the default is switched to "fREML". See gam and bam for details.

tensortype

which typ of tensor product splines to use. One of "ti" or "t2", defaults to ti. t2-type terms do not enforce the more suitable special constraints for functional regression, see Details.

bs.yindex

a named (!) list giving the parameters for spline bases on the index of the functional response. Defaults to list(bs="ps", k=5, m=c(2, 1)), i.e. 5 cubic B-splines bases with first order difference penalty.

bs.int

a named (!) list giving the parameters for the spline basis for the global functional intercept. Defaults to list(bs="ps", k=20, m=c(2, 1)), i.e. 20 cubic B-splines bases with first order difference penalty.

...

additional arguments that are valid for gam, bam, 'gamm4' or 'jagam'. subset is not implemented.

Value

A fitted pffr-object, which is a gam-object with some additional information in an pffr-entry. If algorithm is "gamm" or "gamm4", only the $gam part of the returned list is modified in this way. Available methods/functions to postprocess fitted models: summary.pffr, plot.pffr, coef.pffr, fitted.pffr, residuals.pffr, predict.pffr, model.matrix.pffr, qq.pffr, pffr.check. If algorithm is "jagam", only the location of the model file and the usual jagam-object are returned, you have to run the sampler yourself.

Details

The routine can estimate

linear functional effects of scalar (numeric or factor) covariates that vary smoothly over $t$ (e.g. $z_{1i} \beta_1(t)$, specified as ~z1),
nonlinear, and possibly multivariate functional effects of (one or multiple) scalar covariates $z$ that vary smoothly over the index $t$ of $Y(t)$ (e.g. $f(z_{2i}, t)$, specified in the formula simply as ~s(z2))
(nonlinear) effects of scalar covariates that are constant over $t$ (e.g. $f(z_{3i})$, specified as ~c(s(z3)), or $\beta_3 z_{3i}$, specified as ~c(z3)),
function-on-function regression terms (e.g. $\int X_i(s)\beta(s,t)ds$, specified as ~ff(X, yindex=t, xindex=s), see ff). Terms given by sff and ffpc provide nonlinear and FPC-based effects of functional covariates, respectively.
concurrent effects of functional covariates X measured on the same grid as the response are specified as follows: ~s(x) for a smooth, index-varying effect $f(X(t),t)$, ~x for a linear index-varying effect $X(t)\beta(t)$, ~c(s(x)) for a constant nonlinear effect $f(X(t))$, ~c(x) for a constant linear effect $X(t)\beta$.
Smooth functional random intercepts $b_{0g(i)}(t)$ for a grouping variable g with levels $g(i)$ can be specified via ~s(g, bs="re")), functional random slopes $u_i b_{1g(i)}(t)$ in a numeric variable u via ~s(g, u, bs="re")). Scheipl, Staicu, Greven (2013) contains code examples for modeling correlated functional random intercepts using mrf-terms.

Use the c()-notation to denote model terms that are constant over the index of the functional response.

Internally, univariate smooth terms without a c()-wrapper are expanded into bivariate smooth terms in the original covariate and the index of the functional response. Bivariate smooth terms (s(), te() or t2()) without a c()-wrapper are expanded into trivariate smooth terms in the original covariates and the index of the functional response. Linear terms for scalar covariates or categorical covariates are expanded into varying coefficient terms, varying smoothly over the index of the functional response. For factor variables, a separate smooth function with its own smoothing parameter is estimated for each level of the factor. The marginal spline basis used for the index of the the functional response is specified via the global argument bs.yindex. If necessary, this can be overriden for any specific term by supplying a bs.yindex-argument to that term in the formula, e.g. ~s(x, bs.yindex=list(bs="tp", k=7)) would yield a tensor product spline over x and the index of the response in which the marginal basis for the index of the response are 7 cubic thin-plate spline functions (overriding the global default for the basis and penalty on the index of the response given by the global bs.yindex-argument). Use ~-1 + c(1) + ... to specify a model with only a constant and no functional intercept.

The functional covariates have to be supplied as a $n$ by <no. of evaluations> matrices, i.e. each row is one functional observation. For data on a regular grid, the functional response is supplied in the same format, i.e. as a matrix-valued entry in data, which can contain missing values.

If the functional responses are sparse or irregular (i.e., not evaluated on the same evaluation points across all observations), the ydata-argument can be used to specify the responses: ydata must be a data.frame with 3 columns called '.obs', '.index', '.value' which specify which curve the point belongs to ('.obs'=$i$), at which $t$ it was observed ('.index'=$t$), and the observed value ('.value'=$Y_i(t)$). Note that the vector of unique sorted entries in ydata$.obs must be equal to rownames(data) to ensure the correct association of entries in ydata to the corresponding rows of data. For both regular and irregular functional responses, the model is then fitted with the data in long format, i.e., for data on a grid the rows of the matrix of the functional response evaluations $Y_i(t)$ are stacked into one long vector and the covariates are expanded/repeated correspondingly. This means the models get quite big fairly fast, since the effective number of rows in the design matrix is number of observations times number of evaluations of $Y(t)$ per observation.

Note that pffr does not use mgcv's default identifiability constraints (i.e., $\sum_{i,t} \hat f(z_i, x_i, t) = 0$ or $\sum_{i,t} \hat f(x_i, t) = 0$) for tensor product terms whose marginals include the index $t$ of the functional response. Instead, $\sum_i \hat f(z_i, x_i, t) = 0$ for all $t$ is enforced, so that effects varying over $t$ can be interpreted as local deviations from the global functional intercept. This is achieved by using ti-terms with a suitably modified mc-argument. Note that this is not possible if algorithm='gamm4' since only t2-type terms can then be used and these modified constraints are not available for t2. We recommend using centered scalar covariates for terms like $z \beta(t)$ (~z) and centered functional covariates with $\sum_i X_i(t) = 0$ for all $t$ in ff-terms so that the global functional intercept can be interpreted as the global mean function.

The family-argument can be used to specify all of the response distributions and link functions described in family.mgcv. Note that family = "gaulss" is treated in a special way: Users can supply the formula for the variance by supplying a special argument varformula, but this is not modified in the way that the formula-argument is but handed over to the fitter directly, so this is for expert use only. If varformula is not given, pffr will use the parameters from argument bs.int to define a spline basis along the index of the response, i.e., a smooth variance function over $t$ for responses $Y(t)$.

References

Ivanescu, A., Staicu, A.-M., Scheipl, F. and Greven, S. (2015). Penalized function-on-function regression. Computational Statistics, 30(2):539--568. https://biostats.bepress.com/jhubiostat/paper254/

Scheipl, F., Staicu, A.-M. and Greven, S. (2015). Functional Additive Mixed Models. Journal of Computational & Graphical Statistics, 24(2): 477--501. https://arxiv.org/abs/1207.5947

F. Scheipl, J. Gertheiss, S. Greven (2016): Generalized Functional Additive Mixed Models, Electronic Journal of Statistics, 10(1), 1455--1492. https://projecteuclid.org/euclid.ejs/1464710238/

Examples

Run this code

# NOT RUN {
###############################################################################
# univariate model:
# Y(t) = f(t)  + \int X1(s)\beta(s,t)ds + eps
set.seed(2121)
data1 <- pffrSim(scenario="ff", n=40)
t <- attr(data1, "yindex")
s <- attr(data1, "xindex")
m1 <- pffr(Y ~ ff(X1, xind=s), yind=t, data=data1)
summary(m1)
plot(m1, pers=TRUE, pages=1)

# }
# NOT RUN {
###############################################################################
# multivariate model:
# E(Y(t)) = \beta_0(t)  + \int X1(s)\beta_1(s,t)ds + xlin \beta_3(t) +
#        f_1(xte1, xte2) + f_2(xsmoo, t) + \beta_4 xconst
data2 <- pffrSim(scenario="all", n=200)
t <- attr(data2, "yindex")
s <- attr(data2, "xindex")
m2 <- pffr(Y ~  ff(X1, xind=s) + #linear function-on-function
                xlin  +  #varying coefficient term
                c(te(xte1, xte2)) + #bivariate smooth term in xte1 & xte2, const. over Y-index
                s(xsmoo) + #smooth effect of xsmoo varying over Y-index
                c(xconst), # linear effect of xconst constant over Y-index
        yind=t,
        data=data2)
summary(m2)
plot(m2, pers=TRUE)
str(coef(m2))
# convenience functions:
preddata <- pffrSim(scenario="all", n=20)
str(predict(m2, newdata=preddata))
str(predict(m2, type="terms"))
cm2 <- coef(m2)
cm2$pterms
str(cm2$smterms, 2)
str(cm2$smterms[["s(xsmoo)"]]$coef)

#############################################################################
# sparse data (80% missing on a regular grid):
set.seed(88182004)
data3 <- pffrSim(scenario=c("int", "smoo"), n=100, propmissing=0.8)
t <- attr(data3, "yindex")
m3.sparse <- pffr(Y ~ s(xsmoo), data=data3$data, ydata=data3$ydata, yind=t)
summary(m3.sparse)
plot(m3.sparse, pers=TRUE, pages=1)
# }

Run the code above in your browser using DataLab