bigssa: Fits Smoothing Spline ANOVA Models

Description

Given a real-valued response vector $y = {y_{i}}_{n \times 1}$ , a Smoothing Spline Anova (SSA) has the form $y_{i} = η (x_{i}) + e_{i}$ where $y_{i}$ is the $i$ -th observation's respone, $x_{i} = (x_{i 1}, \dots, x_{i p})$ is the $i$ -th observation's nonparametric predictor vector, $η$ is an unknown smooth function relating the response and nonparametric predictors, and $e_{i} \sim N (0, σ^{2})$ is iid Gaussian error. Function can fit additive models, and also allows for 2-way and 3-way interactions between any number of predictors (see Details and Examples).

Usage

bigssa(formula,data=NULL,type=NULL,nknots=NULL,rparm=NA,
       lambdas=NULL,skip.iter=TRUE,se.fit=FALSE,rseed=1234,
       gcvopts=NULL,knotcheck=TRUE,gammas=NULL,weights=NULL,
       random=NULL,remlalg=c("FS","NR","EM","none"),remliter=500,
       remltol=10^-4,remltau=NULL)

Arguments

formula

An object of class "formula": a symbolic description of the model to be fitted (see Details and Examples for more information).

data

Optional data frame, list, or environment containing the variables in formula. Or an object of class "makessa", which is output from makessa.

type

List of smoothing spline types for predictors in formula (see Details). Options include type="cub" for cubic, type="cub0" for another cubic, type="per" for cubic periodic, type="tps" for cubic thin-plate, type="ord" for ordinal, and type="nom" for nominal.

nknots

Two possible options: (a) scalar giving total number of random knots to sample, or (b) vector indexing which rows of data to use as knots.

rparm

List of rounding parameters for each predictor. See Details.

lambdas

Vector of global smoothing parameters to try. Default lambdas=10^-c(9:0).

skip.iter

Logical indicating whether to skip the iterative smoothing parameter update. Using skip.iter=FALSE should provide a more optimal solution, but the fitting time may be substantially longer. See Skip Iteration section.

se.fit

Logical indicating if the standard errors of the fitted values should be estimated.

rseed

Random seed for knot sampling. Input is ignored if nknots is an input vector of knot indices. Set rseed=NULL to obtain a different knot sample each time, or set rseed to any positive integer to use a different seed than the default.

gcvopts

Control parameters for optimization. List with 3 elements: (a) maxit: maximum number of algorithm iterations, (b) gcvtol: covergence tolerance for iterative GCV update, and (c) alpha: tuning parameter for GCV minimization. Default: gcvopts=list(maxit=5,gcvtol=10^-5,alpha=1)

knotcheck

If TRUE, only unique knots are used (for stability).

gammas

List of initial smoothing parameters for each predictor. See Details.

weights

Vector of positive weights for fitting (default is vector of ones).

random

Adds random effects to model (see Random Effects section).

remlalg

REML algorithm for estimating variance components (see Random Effects section). Input is ignored if random=NULL.

remliter

Maximum number of iterations for REML estimation of variance components. Input is ignored if random=NULL.

remltol

Convergence tolerance for REML estimation of variance components. Input is ignored if random=NULL.

remltau

Initial estimate of variance parameters for REML estimation of variance components. Input is ignored if random=NULL.

Value

fitted.values

Vector of fitted values corresponding to the original data points in xvars (if rparm=NA) or the rounded data points in xunique (if rparm is used).

se.fit

Vector of standard errors of fitted.values (if input se.fit=TRUE).

yvar

Response vector.

xvars

List of predictors.

type

Type of smoothing spline that was used for each predictor.

yunique

Mean of yvar for unique points after rounding (if rparm is used).

xunique

Unique rows of xvars after rounding (if rparm is used).

sigma

Estimated error standard deviation, i.e., $\hat{σ}$ .

ndf

Data frame with two elements: n is total sample size, and df is effective degrees of freedom of fit model (trace of smoothing matrix).

info

Model fit information: vector containing the GCV, multiple R-squared, AIC, and BIC of fit model (assuming Gaussian error).

modelspec

List containing specifics of fit model (needed for prediction).

converged

Convergence status: converged=TRUE if iterative update converged, converged=FALSE if iterative update failed to converge, and converged=NA if option skip.iter=TRUE was used.

tnames

Names of the terms in model.

random

Random effects formula (same as input).

tau

Variance parameters such that sigma*sqrt(tau) gives standard deviation of random effects (if !is.null(random)).

blup

Best linear unbiased predictors (if !is.null(random)).

call

Called model in input formula.

Warnings

Cubic and cubic periodic splines transform the predictor to the interval [0,1] before fitting.

When using rounding parameters, output fitted.values corresponds to unique rounded predictor scores in output xunique. Use predict.bigssa function to get fitted values for full yvar vector.

Computational Details

To estimate $η$ I minimize the penalized least-squares functional $\frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - η (x_{i}))}^{2} + λ J (η)$ where $J (\cdot)$ is a nonnegative penalty functional quantifying the roughness of $η$ and $λ > 0$ is a smoothing parameter controlling the trade-off between fitting and smoothing the data. Note that for $p > 1$ nonparametric predictors, there are additional $θ_{k}$ smoothing parameters embedded in $J$ .

The penalized least squares functioncal can be rewritten as $‖ y - K d - J_{θ} c ‖^{2} + n λ c^{'} Q_{θ} c$ where $K = {ϕ (x_{i})}_{n \times m}$ is the null (parametric) space basis function matrix, $J_{θ} = \sum_{k = 1}^{s} θ_{k} J_{k}$ with $J_{k} = {ρ_{k} (x_{i}, x_{h}^{*})}_{n \times q}$ denoting the $k$ -th contrast space basis funciton matrix, $Q_{θ} = \sum_{k = 1}^{s} θ_{k} Q_{k}$ with $Q_{k} = {ρ_{k} (x_{g}^{*}, x_{h}^{*})}_{q \times q}$ denoting the $k$ -th penalty matrix, and $d = (d_{0}, \dots, d_{m})^{'}$ and $c = (c_{1}, \dots, c_{q})^{'}$ are the unknown basis function coefficients. The optimal smoothing parameters are chosen by minimizing the GCV score (see bigspline).

Note that this function uses the efficient SSA reparameterization described in Helwig (2013) and Helwig and Ma (2015); using is parameterization, there is one unique smoothing parameter per predictor ( $γ_{j}$ ), and these $γ_{j}$ parameters determine the structure of the $θ_{k}$ parameters in the tensor product space. To evaluate the GCV score, this function uses the improved (scalable) SSA algorithm discussed in Helwig (2013) and Helwig and Ma (2015).

Skip Iteration

For $p > 1$ predictors, initial values for the $γ_{j}$ parameters (that determine the structure of the $θ_{k}$ parameters) are estimated using the smart starting algorithm described in Helwig (2013) and Helwig and Ma (2015).

Default use of this function (skip.iter=TRUE) fixes the $γ_{j}$ parameters afer the smart start, and then finds the global smoothing parameter $λ$ (among the input lambdas) that minimizes the GCV score. This approach typically produces a solution very similar to the more optimal solution using skip.iter=FALSE.

Setting skip.iter=FALSE uses the same smart starting algorithm as setting skip.iter=TRUE. However, instead of fixing the $γ_{j}$ parameters afer the smart start, using skip.iter=FALSE iterates between estimating the optimal $λ$ and the optimal $γ_{j}$ parameters. The R function nlm is used to minimize the GCV score with respect to the $γ_{j}$ parameters, which can be time consuming for models with many predictors and/or a large number of knots.

Random Effects

The input random adds random effects to the model assuming a variance components structure. Both nested and crossed random effects are supported. In all cases, the random effects are assumed to be indepedent zero-mean Gaussian variables with the variance depending on group membership.

Random effects are distinguished by vertical bars ("|"), which separate expressions for design matrices (left) from group factors (right). For example, the syntax ~1|group includes a random intercept for each level of group, whereas the syntax ~1+x|group includes both a random intercept and a random slope for each level of group. For crossed random effects, parentheses are needed to distinguish different terms, e.g., ~(1|group1)+(1|group2) includes a random intercept for each level of group1 and a random intercept for each level of group2, where both group1 and group2 are factors. For nested random effects, the syntax ~group|subject can be used, where both group and subject are factors such that the levels of subject are nested within those of group.

The input remlalg determines the REML algorithm used to estimate the variance components. Setting remlalg="FS" uses a Fisher Scoring algorithm (default). Setting remlalg="NR" uses a Newton-Raphson algorithm. Setting remlalg="EM" uses an Expectation Maximization algorithm. Use remlalg="none" to fit a model with known variance components (entered through remltau).

The input remliter sets the maximum number of iterations for the REML estimation. The input remltol sets the convergence tolerance for the REML estimation, which is determined via relative change in the REML log-likelihood. The input remltau sets the initial estimates of variance parameters; default is remltau = rep(1,ntau) where ntau is the number of variance components.

Details

The formula syntax is similar to that used in lm and many other R regression functions. Use y~x to predict the response y from the predictor x. Use y~x1+x2 to fit an additive model of the predictors x1 and x2, and use y~x1*x2 to fit an interaction model. The syntax y~x1*x2 includes the interaction and main effects, whereas the syntax y~x1:x2 is not supported. See Computational Details for specifics about how nonparametric effects are estimated.

See bigspline for definitions of type="cub", type="cub0", and type="per" splines, which can handle one-dimensional predictors. See Appendix of Helwig and Ma (2015) for information about type="tps" and type="nom" splines. Note that type="tps" can handle one-, two-, or three-dimensional predictors. I recommend using type="cub" if the predictor scores have no extreme outliers; when outliers are present, type="tps" may produce a better result.

Using the rounding parameter input rparm can greatly speed-up and stabilize the fitting for large samples. For typical cases, I recommend using rparm=0.01 for cubic and periodic splines, but smaller rounding parameters may be needed for particularly jagged functions. For thin-plate splines, the data are NOT transformed to the interval [0,1] before fitting, so the rounding parameter should be on the raw data scale. Also, for type="tps" you can enter one rounding parameter for each predictor dimension. Use rparm=1 for ordinal and nominal splines.

References

Gu, C. (2013). Smoothing spline ANOVA models, 2nd edition. New York: Springer.

Helwig, N. E. (2013). Fast and stable smoothing spline analysis of variance models for large samples with applications to electroencephalography data analysis. Unpublished doctoral dissertation. University of Illinois at Urbana-Champaign.

Helwig, N. E. (2016). Efficient estimation of variance components in nonparametric mixed-effects models with large samples. Statistics and Computing, 26, 1319-1336.

Helwig, N. E. (2017). Regression with ordered predictors via ordinal smoothing splines. Frontiers in Applied Mathematics and Statistics, 3(15), 1-13.

Helwig, N. E. and Ma, P. (2015). Fast and stable multiple smoothing parameter selection in smoothing spline analysis of variance models with large samples. Journal of Computational and Graphical Statistics, 24, 715-732.

Helwig, N. E. and Ma, P. (2016). Smoothing spline ANOVA for super-large samples: Scalable computation via rounding parameters. Statistics and Its Interface, 9, 433-444.

Examples

Run this code

# NOT RUN {
##########   EXAMPLE 1   ##########

# define univariate function and data
set.seed(773)
myfun <- function(x){ sin(2*pi*x) }
x <- runif(500)
y <- myfun(x) + rnorm(500)

# cubic, periodic, and thin-plate spline models with 20 knots
cubmod <- bigssa(y~x,type="cub",nknots=20,se.fit=TRUE)
cubmod
permod <- bigssa(y~x,type="per",nknots=20,se.fit=TRUE)
permod
tpsmod <- bigssa(y~x,type="tps",nknots=20,se.fit=TRUE)
tpsmod


##########   EXAMPLE 2   ##########

# function with two continuous predictors
set.seed(773)
myfun <- function(x1v,x2v){sin(2*pi*x1v)+log(x2v+.1)+cos(pi*(x1v-x2v))}
x1v <- runif(500)
x2v <- runif(500)
y <- myfun(x1v,x2v) + rnorm(500)

# cubic splines with 50 randomly selected knots
intmod <- bigssa(y~x1v*x2v,type=list(x1v="cub",x2v="cub"),nknots=50)
intmod
crossprod( myfun(x1v,x2v) - intmod$fitted.values )/500

# fit additive model (with same knots)
addmod <- bigssa(y~x1v+x2v,type=list(x1v="cub",x2v="cub"),nknots=50)
addmod
crossprod( myfun(x1v,x2v) - addmod$fitted.values )/500


##########   EXAMPLE 3   ##########

# function with two continuous and one nominal predictor (3 levels)
set.seed(773)
myfun <- function(x1v,x2v,x3v){
  fval <- rep(0,length(x1v))
  xmeans <- c(-1,0,1)
  for(j in 1:3){
    idx <- which(x3v==letters[j])
    fval[idx] <- xmeans[j]
  }
  fval[idx] <- fval[idx] + cos(4*pi*(x1v[idx]))
  fval <- (fval + sin(3*pi*x1v*x2v+pi)) / sqrt(2)
}
x1v <- runif(500)
x2v <- runif(500)
x3v <- sample(letters[1:3],500,replace=TRUE)
y <- myfun(x1v,x2v,x3v) + rnorm(500)

# 3-way interaction with 50 knots
cuimod <- bigssa(y~x1v*x2v*x3v,type=list(x1v="cub",x2v="cub",x3v="nom"),nknots=50)
crossprod( myfun(x1v,x2v,x3v) - cuimod$fitted.values )/500

# fit correct interaction model with 50 knots
cubmod <- bigssa(y~x1v*x2v+x1v*x3v,type=list(x1v="cub",x2v="cub",x3v="nom"),nknots=50)
crossprod( myfun(x1v,x2v,x3v) - cubmod$fitted.values )/500

# fit model using 2-dimensional thin-plate and nominal
x1new <- cbind(x1v,x2v)
x2new <- x3v
tpsmod <- bigssa(y~x1new*x2new,type=list(x1new="tps",x2new="nom"),nknots=50)
crossprod( myfun(x1v,x2v,x3v) - tpsmod$fitted.values )/500


##########   EXAMPLE 4   ##########

# function with four continuous predictors
set.seed(773)
myfun <- function(x1v,x2v,x3v,x4v){
  sin(2*pi*x1v) + log(x2v+.1) + x3v*cos(pi*(x4v))
  }
x1v <- runif(500)
x2v <- runif(500)
x3v <- runif(500)
x4v <- runif(500)
y <- myfun(x1v,x2v,x3v,x4v) + rnorm(500)

# fit cubic spline model with x3v*x4v interaction
cubmod <- bigssa(y~x1v+x2v+x3v*x4v,type=list(x1v="cub",x2v="cub",x3v="cub",x4v="cub"),nknots=50)
crossprod( myfun(x1v,x2v,x3v,x4v) - cubmod$fitted.values )/500

# }

Run the code above in your browser using DataLab

Last chance! 50% off unlimited learning