Fits a reluctant generalized additive model (RGAM) for an entire regularization
path indexed by the parameter lambda
. Fits linear, logistic, Poisson
and Cox regression models. RGAM is a three-step algorithm: Step 1 fits the
lasso and computes residuals, Step 2 constructs the non-linear features, and
Step 3 fits a lasso of the response on both the linear and non-linear features.
rgam(x, y, lambda = NULL, lambda.min.ratio = ifelse(nrow(x) < ncol(x),
0.01, 1e-04), standardize = TRUE, family = c("gaussian", "binomial",
"poisson", "cox"), offset = NULL, init_nz, removeLin = TRUE,
nfolds = 5, foldid = NULL, df = 4, gamma, tol = 0.01,
parallel = FALSE, verbose = TRUE)
Input matrix, of dimension nobs x nvars
; each row is
an observation vector.
Response variable. Quantitative for family = "gaussian"
or
family = "poisson"
(non-negative counts). For family="binomial"
,
should be a numeric vector consisting of 0s and 1s. For family="cox"
,
y should be a two-column matrix with columns named 'time' and 'status'.
The latter is a binary variable, with '1' indicating death, and '0'
indicating right-censored.
A user-supplied lambda
sequence. Typical usage is to
have the program compute its own lambda
sequence; supplying a value of
lambda overrides this.
Smallest value for lambda as a fraction of the largest lambda value. The default depends on the sample size nobs relative to the number of variables nvars. If nobs > nvars, the default is 0.0001, close to zero. If nobs < nvars, the default is 0.01.
If TRUE
(default), the columns of the input matrix
are standardized before the algorithm is run. See details section for more
information.
Response type. Either "gaussian"
(default) for linear
regression, "binomial"
for logistic regression, "poisson"
for
Poisson regression or "cox"
for Cox regression.
A vector of length nobs. Useful for the "poisson" family (e.g. log of exposure time), or for refining a model by starting at a current fit. Default is NULL. If supplied, then values must also be supplied to the predict function.
A vector specifying which features we must include when computing the non-linear features. Default is to construct non-linear features for all given features.
When constructing the non-linear features, do we remove
the linear component from them? Default is TRUE
.
Number of folds for CV in Step 1 (default is 5). Although
nfolds
can be as large as the sample size (leave-one-out CV), it is
not recommended for large datasets. Smallest value allowable is nfolds = 3
.
An optional vector of values between 1 and nfolds
identifying what fold each observation is in. If supplied, nfolds
can
be missing.
Degrees of freedom for the non-linear fit in Step 2. Default is 4.
Scale factor for non-linear features (vs. original features), to
be between 0 and 1. Default is 0.8 if init_nz = c()
, 0.6 otherwise.
Parameter to be passed to smooth.spline
: a tolerance for
same-ness or uniqueness of the x values. Default is 0.01. See
smooth.spline
documentation for more details.
If TRUE, the cv.glmnet()
call in Step 1 is
parallelized. Must register parallel before hand, such as doMC or others.
Default is FALSE.
If TRUE
(default), model-fitting is tracked with a
progress bar.
An object of class "rgam"
.
The glmnet object resulting from Step 3: fitting a glmnet
model for the response against the linear & non-linear features.
List of spline fits for residual against each response. Needed for predicting on new data.
If removeLin = TRUE
, a list of coefficients for
simple linear regression of non-linear feature on original feature. Needed
for predicting on new data.
Column indices for the features which we allow to have non-linear relationship with the response.
Indices of features which CV in Step 1 chose.
Did we remove the linear components when constructing the non-linear features? Needed for predicting on new data.
Means of the features (both linear and non-linear).
Scale factors of the features (both linear and non-linear).
Column indices of the non-zero features for each value of
lambda
.
Column indices of the non-zero linear components for each value of
lambda
.
Column indices of the non-zero non-linear components for each value
of lambda
.
The number of non-zero features for each value of
lambda
.
The number of non-zero linear components for each value of
lambda
.
The number of non-zero non-linear components for each value
of lambda
.
The actual sequence of lambda
values used.
The number of features in the original data matrix.
Response type.
The call that produced this object.
If there are variables which the user definitely wants to compute non-linear
versions for in Step 2 of the algorithm, they can be specified as a vector for
the init_nz
option. The algorithm will compute non-linear versions for
these features as well as the features suggested by Step 1 of the algorithm.
If standardize = TRUE
, the standard deviation of the linear and
non-linear features would be 1 and gamma respectively. If
standardize = FALSE
, linear features will remain on their original
scale while non-linear features would have standard deviation gamma times
the mean standard deviation of the linear features.
For family="gaussian"
, rgam
standardizes y
to have unit
variance (using 1/n
rather than 1/(n-1)
formula).
# NOT RUN {
set.seed(1)
n <- 100; p <- 20
x <- matrix(rnorm(n * p), n, p)
beta <- matrix(c(rep(2, 5), rep(0, 15)), ncol = 1)
y <- x %*% beta + rnorm(n)
fit <- rgam(x, y)
# construct non-linear features for only those selected by Step 1
fit <- rgam(x, y, init_nz = c())
# specify scale factor gamma and degrees of freedom
fit <- rgam(x, y, gamma = 1, df = 6)
# binomial family
bin_y <- ifelse(y > 0, 1, 0)
fit2 <- rgam(x, bin_y, family = "binomial")
# Poisson family
poi_y <- rpois(n, exp(x %*% beta))
fit3 <- rgam(x, poi_y, family = "poisson")
# Poisson with offset
offset <- rnorm(n)
fit3 <- rgam(x, poi_y, family = "poisson", offset = offset)
# }
Run the code above in your browser using DataLab