rgam: Fit reluctant generalized additive model

Description

Fits a reluctant generalized additive model (RGAM) for an entire regularization path indexed by the parameter lambda. Fits linear, logistic, Poisson and Cox regression models. RGAM is a three-step algorithm: Step 1 fits the lasso and computes residuals, Step 2 constructs the non-linear features, and Step 3 fits a lasso of the response on both the linear and non-linear features.

Usage

rgam(x, y, lambda = NULL, lambda.min.ratio = ifelse(nrow(x) < ncol(x),
  0.01, 1e-04), standardize = TRUE, family = c("gaussian", "binomial",
  "poisson", "cox"), offset = NULL, init_nz, removeLin = TRUE,
  nfolds = 5, foldid = NULL, df = 4, gamma, tol = 0.01,
  parallel = FALSE, verbose = TRUE)

Arguments

Input matrix, of dimension nobs x nvars; each row is an observation vector.

Response variable. Quantitative for family = "gaussian" or family = "poisson" (non-negative counts). For family="binomial", should be a numeric vector consisting of 0s and 1s. For family="cox", y should be a two-column matrix with columns named 'time' and 'status'. The latter is a binary variable, with '1' indicating death, and '0' indicating right-censored.

lambda

A user-supplied lambda sequence. Typical usage is to have the program compute its own lambda sequence; supplying a value of lambda overrides this.

lambda.min.ratio

Smallest value for lambda as a fraction of the largest lambda value. The default depends on the sample size nobs relative to the number of variables nvars. If nobs > nvars, the default is 0.0001, close to zero. If nobs < nvars, the default is 0.01.

standardize

If TRUE (default), the columns of the input matrix are standardized before the algorithm is run. See details section for more information.

family

Response type. Either "gaussian" (default) for linear regression, "binomial" for logistic regression, "poisson" for Poisson regression or "cox" for Cox regression.

offset

A vector of length nobs. Useful for the "poisson" family (e.g. log of exposure time), or for refining a model by starting at a current fit. Default is NULL. If supplied, then values must also be supplied to the predict function.

init_nz

A vector specifying which features we must include when computing the non-linear features. Default is to construct non-linear features for all given features.

removeLin

When constructing the non-linear features, do we remove the linear component from them? Default is TRUE.

nfolds

Number of folds for CV in Step 1 (default is 5). Although nfolds can be as large as the sample size (leave-one-out CV), it is not recommended for large datasets. Smallest value allowable is nfolds = 3.

foldid

An optional vector of values between 1 and nfolds identifying what fold each observation is in. If supplied, nfolds can be missing.

Degrees of freedom for the non-linear fit in Step 2. Default is 4.

gamma

Scale factor for non-linear features (vs. original features), to be between 0 and 1. Default is 0.8 if init_nz = c(), 0.6 otherwise.

tol

Parameter to be passed to smooth.spline: a tolerance for same-ness or uniqueness of the x values. Default is 0.01. See smooth.spline documentation for more details.

parallel

If TRUE, the cv.glmnet() call in Step 1 is parallelized. Must register parallel before hand, such as doMC or others. Default is FALSE.

verbose

If TRUE (default), model-fitting is tracked with a progress bar.

Value

An object of class "rgam".

full_glmfit

The glmnet object resulting from Step 3: fitting a glmnet model for the response against the linear & non-linear features.

spline_fit

List of spline fits for residual against each response. Needed for predicting on new data.

lin_comp_fit

If removeLin = TRUE, a list of coefficients for simple linear regression of non-linear feature on original feature. Needed for predicting on new data.

init_nz

Column indices for the features which we allow to have non-linear relationship with the response.

step1_nz

Indices of features which CV in Step 1 chose.

removeLin

Did we remove the linear components when constructing the non-linear features? Needed for predicting on new data.

mxf

Means of the features (both linear and non-linear).

sxf

Scale factors of the features (both linear and non-linear).

feat

Column indices of the non-zero features for each value of lambda.

linfeat

Column indices of the non-zero linear components for each value of lambda.

nonlinfeat

Column indices of the non-zero non-linear components for each value of lambda.

nzero_feat

The number of non-zero features for each value of lambda.

nzero_lin

The number of non-zero linear components for each value of lambda.

nzero_nonlin

The number of non-zero non-linear components for each value of lambda.

lambda

The actual sequence of lambda values used.

The number of features in the original data matrix.

family

Response type.

call

The call that produced this object.

Details

If there are variables which the user definitely wants to compute non-linear versions for in Step 2 of the algorithm, they can be specified as a vector for the init_nz option. The algorithm will compute non-linear versions for these features as well as the features suggested by Step 1 of the algorithm.

If standardize = TRUE, the standard deviation of the linear and non-linear features would be 1 and gamma respectively. If standardize = FALSE, linear features will remain on their original scale while non-linear features would have standard deviation gamma times the mean standard deviation of the linear features.

For family="gaussian", rgam standardizes y to have unit variance (using 1/n rather than 1/(n-1) formula).

Examples

Run this code

# NOT RUN {
set.seed(1)
n <- 100; p <- 20
x <- matrix(rnorm(n * p), n, p)
beta <- matrix(c(rep(2, 5), rep(0, 15)), ncol = 1)
y <- x %*% beta + rnorm(n)

fit <- rgam(x, y)

# construct non-linear features for only those selected by Step 1
fit <- rgam(x, y, init_nz = c())

# specify scale factor gamma and degrees of freedom
fit <- rgam(x, y, gamma = 1, df = 6)

# binomial family
bin_y <- ifelse(y > 0, 1, 0)
fit2 <- rgam(x, bin_y, family = "binomial")

# Poisson family
poi_y <- rpois(n, exp(x %*% beta))
fit3 <- rgam(x, poi_y, family = "poisson")
# Poisson with offset
offset <- rnorm(n)
fit3 <- rgam(x, poi_y, family = "poisson", offset = offset)

# }

Run the code above in your browser using DataLab