dglars: dgLARS solution curve for GLM

Description

dglars function is used to estimate the solution curve implicitly defined by the dgLARS method for logistic and Poisson regression model.

Usage

dglars(formula, family = c("binomial", "poisson"), data, 
subset, contrast = NULL, control = list())
dglars.fit(X, y, family = c("binomial", "poisson"), 
control = list())

Arguments

formula

an object of class "formula": a symbolic description of the model to be fitted.

family

a description of the error distribution used in the model (see below for more details).

data

an optional data frame, list or environment (or object coercible by 'as.data.frame' to a data frame) containing the variables in the model. If not found in 'data', the variables are taken from 'environment(formula)'.

subset

an optional vector specifying a subset of observations to be used in the fitting process.

contrast

an optional list. See the 'contrasts.arg' of 'model.matrix.default'.

control

a list of control parameters. See 'Details'.

design matrix of dimension $n\times p$.

response vector.

Value

dglars returns an object with S3 class "dglars", i.e. a list containing the following components:
callthe call that produced this object;
familya description of the error distribution used in the model;
npthe number of points of the dgLARS solution curve;
betathe $(p+1)\times\code{np}$ matrix corresponding to the dgLARS solution curve;
ruthe matrix of the Rao's score test statistics of the variables included in the final model. This component is reported only if the predictor-corrector algorithm is used;
devthe np dimensional vector of the deviance corresponding to the values of the tuning parameter $\gamma$;
dfthe sequence of number of nonzero coefficients for each value of the tuning parameter $\gamma$;
gthe sequence of $\gamma$ values used to compute the solution curve;
Xthe used design matrix;
ythe used response vector;
actiona np dimensional vector of characters used to show how is changed the active set for each value of the tuning parameter $\gamma$;
convan integer value used to encode the warnings and the errors related to the algorithm used to compute the solution curve. The values returned are: [object Object],[object Object],[object Object],[object Object],[object Object]
controlthe list of control parameters used to compute the dgLARS solution curve.

Details

dglars function implements the differential geometric generalization of the least angle regression method (Efron et al., 2004) proposed in Augugliaro et al. (2013). Actual version of the package can be used to estimate the solution curve for a logistic regression model (family = "binomial") and for a Poisson regression model (family = "poisson").

dglars.fit is the workhorse function: it is more efficient when the design matrix have already been calculated. For this reason we suggest to use this function when the dgLARS method is applied in a high-dimensional setting, i.e. when p>n .

The dgLARS solution curve can be estimated using two different algorithms, i.e. the predictor-corrector method and the cyclic coordinate descent method (see below for more details about the control parameter algorithm). The first algorithm is based on two steps. In the first step, called predictor step, an approximation of the point that lies on the solution curve is computed. If the control parameter dg_max is equal to zero, in this step it is also computed an approximation of the optimal step size using a generalization of the method proposed in Efron et al. (2004). The optimal step size is defined as the reduction of the tuning parameter, denoted by $d\gamma$, such that at $\gamma-d\gamma$ there is a change in the active set. In the second step, called corrector step, a Newton-Raphson algorithm is used to correct the approximation to the solution point computed in the previous step. The main problem of this algorithm is that the number of arithmetic operations required to compute the approximation of the point that lies on the solution curve scales as the cube of the variables, this means that such algorithm is cumbersome in a high dimensional setting. To overcome this problem, the second algorithm compute the dgLARS solution curve using an adaptive version of the cyclic coordinate descent method proposed in Friedman et al. (2010).

The control argument is a list that can supply any of the following components: [object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

References

Augugliaro L., Mineo A.M. and Wit E.C. (2013) dgLARS: a differential geometric approach to sparse generalized linear models, Journal of the Royal Statistical Society. Series B., Vol 75(3), 471-498.

Augugliaro L., Mineo A.M. and Wit E.C. (2012) Differential geometric LARS via cyclic coordinate descent method, in Proceeding of COMPSTAT 2012, pp. 67-79. Limassol, Cyprus.

Efron B., Hastie T., Johnstone I. and Tibshirani R. (2004) Least Angle Regression, The Annals of Statistics, Vol. 32(2), 407-499.

Friedman J., Hastie T. and Tibshirani R. (2010) Regularization Paths for Generalized Linear Models via Coordinate Descent, Journal of Statistical Software, Vol. 33(1), 1-22.

Examples

Run this code

#############################
# Logistic regression model #

set.seed(123)

# low dimensional setting
n <- 100
p <- 10
X <- matrix(rnorm(n*p), n, p)
b <- 1:2
eta <- b[1] + X[,1] * b[2]
mu <- binomial()$linkinv(eta)
y <- rbinom(n, 1, mu)
system.time(fit <- dglars.fit(X, y, family = "binomial"))
system.time(fit <- dglars.fit(X, y, family = "binomial", 
control = list(algorithm = "ccd")))

dataset <- data.frame(x = X, y = y)
rm(X, y)
system.time(fit <- dglars(y ~ ., family = "binomial", data=dataset))
system.time(fit <- dglars(y ~ ., family = "binomial", 
control = list(algorithm = "ccd"), data =dataset))

# high dimensional setting
n <- 100
p <- 1000
X <- matrix(rnorm(n*p), n, p)
b <- 1:2
eta <- b[1] + X[,1] * b[2]
mu <- binomial()$linkinv(eta)
y <- rbinom(n, 1, mu)
system.time(fit <- dglars.fit(X, y, family = "binomial"))
system.time(fit <- dglars.fit(X, y, family = "binomial", 
control = list(algorithm = "ccd")))

dataset <- data.frame(x = X, y = y)
rm(X, y)
system.time(fit <- dglars(y ~ ., family = "binomial", data=dataset))
system.time( fit <- dglars(y ~ ., family = "binomial", 
control = list(algorithm = "ccd"), data =dataset))

Run the code above in your browser using DataLab