pffr: Penalized function-on-function regression

Description

Implements additive regression for functional and scalar covariates and functional responses. This function is a wrapper for mgcv's gam and its siblings to fit models of the general form $E(Y_i(t)) = g(\mu(t) + \int X_i(s)\beta(s,t)ds + f(z_{1i}, t) + f(z_{2i}) + z_{3i} \beta_3(t) + \dots$ ) with a functional (but not necessarily continuous) response $Y(t)$, response function $g$, (optional) smooth intercept $\mu(t)$, (multiple) functional covariates $X(t)$ and scalar covariates $z_1$, $z_2$, etc.

Usage

pffr(formula, yind, fitter = NA, method = "REML",
  bsy.default = list(bs = "ps", m = c(2, 1)), ...)

Arguments

formula

a formula with special terms as for gam, with additional special terms ff() and c()

yind

a vector with length equal to the number of columns of the matrix of functional responses giving the vector of evaluation points $(t_1, \dots ,t_{G})$. If formula contains an ff-term which

fitter

the name of the function used to estimate the model. Defaults to gam if the matrix of functional responses has less than 2e5 data points and to ba

method

Defaults to "REML"-estimation, including of unknown scale. See gam for details.

bsy.default

a named (!) list giving the parameters for spline bases on the index of the functional response. Defaults to list(bs="ps", m=c(2, 1)), i.e. a cubic B-spline basis with first order difference penalty. Only arguments bs, k, m<

...

additional arguments that are valid for gam or bam. weights, subset, offset are not yet implemented!

Value

a fitted pffr-object, which is a gam-object with some additional information in an pffr-entry. If fitter is "gamm" or "gamm4", only the $gam part of the returned list is modified in this way.

Details

The routine can estimate

(nonlinear, and possibly multivariate) effects of (one or multiple) scalar covariates$z$that vary smoothly over the index$t$of$Y(t)$(e.g.$f(z_{1i}, t)$, specified in theformulasimply as~s(z1)),
(nonlinear) effects of scalar covariates that are constant over$t$(e.g.$f(z_{2i})$, specified as~c(s(z2)), or$\beta_2 z_{2i}$, specified as~c(z2)),
linear functional effects of scalar (numeric or factor) covariates that vary smoothly over$t$(e.g.$z_{3i} \beta_3(t)$, specified as~z3),
function-on-function regression terms (e.g.$\int X_i(s)\beta(s,t)ds$, specified as~ff(X, yindex=t, xindex=s), seeff).

Use the c()-notation to denote model terms that are constant over the index of the functional response. Internally, univariate smooth terms without a c()-wrapper are expanded into bivariate smooth terms in the original covariate and the index of the functional response. Bivariate smooth terms (

s(),
  te()

or t2()) without a c()-wrapper are expanded into trivariate smooth terms in the original covariates and the index of the functional response. Linear terms for scalar covariates or categorical covariates are expanded into varying coefficient terms, varying smoothly over the index of the functional response. Use ~-1 + c(1) + ... to specify a model with a constant instead of a functional intercept. The functional response and functional covariates have to be supplied as n by matrices, i.e. each row is one functional observation. The model is then fitted with the data in long format, i.e., the rows of the matrix of the functional response evaluations $Y_i(t)$ are stacked into one long vector and the covariates are expanded/repeated correspondingly. This means the models get quite big fairly fast, since the effective number of rows in the design matrix is number of observations times number of evaluations of $Y(t)$ per observation. Note that only default identifiability constraints as implemented in gam.side are used, i.e. $\sum_{i,t} \hat f(z_i, t) = 0$. (gam.side may impose other identifiability constraints as well but doesn't give interpretable results.) These may not be well suited for models with multiple functional effects varying across $t$. Centering estimated effects $\hat f(z_i, t)$ (~s(z)) across observations so that $\sum_i \hat f(z_i, t) = 0$ for all $t$, i.e., transforming $\hat f(z_i, t) \rightarrow \hat f(z_i, t) - n^{-1} \sum_i \hat f(z_i, t)$ and adding the mean function $n^{-1} \sum_i \hat f(z_i, t)$ to the global functional intercept can improve the interpretability of results. We also recommend using centered scalar covariates for terms like $z \beta(t)$ (~z) and centered functional covariates with $\sum_i X_i(t) = 0$ for all $t$ in ff-terms.

References

Ivanescu, A. E., Staicu, A.-M., Greven, S., Scheipl, F., and Crainiceanu, C. M. (2011). Penalized function-on-function regression. Submitted.

Examples

Run this code

data(DTI, package="refund")
DTI$case <- factor(DTI$case)
DTI$ID <- factor(DTI$ID)
DTI$visit <- factor(DTI$visit)
rcst.complete <- apply(DTI, 1, function(x) !any(is.na(x)))
DTIcomplete <- DTI[rcst.complete,]

## split into test and training data
set.seed(2213213)
trainind <- sort(sample(1:sum(rcst.complete), 150))
train <- DTIcomplete[trainind,]
test <-  DTIcomplete[-trainind,]
# drop subjects not in the train data
test <- test[test$ID %in% unique(train$ID),]

rcstind <- 1:55
# constant random intercepts for ID,
# smooth effect of pasat varying over index of rcst
# constant effect of visit
# m1 adds a functional effect for cca
summary(m0 <- pffr(rcst ~ c(s(ID, bs="re")) + s(pasat) + c(visit),
                yind=rcstind, data=train))
summary(m1 <- pffr(rcst ~ c(s(ID, bs="re")) + s(pasat) +
                c(visit) + ff(cca, yind=rcstind),
                yind=rcstind, data=train))
AIC(m0, m1)
plot(m1, pers=TRUE, pages=1, ticktype="detailed")
str(coefs.m1 <- coef(m1))

#get predicted trajectories for test data
pr.m0 <- predict(m0, newdata=test)
pr.m1 <- predict(m1, newdata=test)
# (riemann integrated) squared prediction error:
(IMSE <- c(m0 = sum((pr.m0-test$rcst)^2),
                    m1 = sum((pr.m1-test$rcst)^2)))
layout(t(1:3))
matplot(t(test$rcst), type="l", lty=1, col=rgb(0,0,0,.1), main="observed rcst",
          ylim=range(test$rcst, pr.m0, pr.m1))
matplot(t(pr.m0),type="l", lty=1, col=rgb(0,0,0,.1), main="predicted: m0",
          ylim=range(test$rcst, pr.m0, pr.m1))
matplot(t(pr.m1),type="l", lty=1, col=rgb(0,0,0,.1), main="predicted: m1",
          ylim=range(test$rcst, pr.m0, pr.m1))

Run the code above in your browser using DataLab