dglars: dgLARS solution curve for GLM

Description

dglars function is used to estimate the solution curve implicitly defined by the dgLARS method for logistic and Poisson regression model.

Usage

dglars(formula, family = c("binomial", "poisson"), data, 
subset, contrast = NULL, control = list())
dglars.fit(X, y, family = c("binomial", "poisson"), 
control = list())

Arguments

formula

an object of class "formula": a symbolic description of the model to be fitted.

family

a description of the error distribution used in the model (see below for more details).

data

an optional data frame, list or environment (or object coercible by 'as.data.frame' to a data frame) containing the variables in the model. If not found in 'data', the variables are taken from 'environment(formula)'.

subset

an optional vector specifying a subset of observations to be used in the fitting process.

contrast

an optional list. See the 'contrasts.arg' of 'model.matrix.default'.

control

a list of control parameters. See 'Details'.

design matrix of dimension $n\times p$.

response vector.

Value

call

dglars"dglars"the call that produced this object;

family

a description of the error distribution used in the model;

np

the number of points of the dgLARS solution curve;

beta

the $(p+1)\times\code{np}$ matrix corresponding to the dgLARS solution curve;

ru

the matrix of the Rao's score test statistics of the variables included in the final model. This component is reported only if the predictor-corrector algorithm is used;

dev

the np dimensional vector of the deviance corresponding to the values of the tuning parameter $\gamma$;

df

the sequence of number of nonzero coefficients for each value of the tuning parameter $\gamma$;

g

the sequence of $\gamma$ values used to compute the solution curve;

X

the used design matrix;

y

the used response vector;

action

a np dimensional vector of characters used to show how is changed the active set for each value of the tuning parameter $\gamma$;

conv

an integer value used to encode the warnings and the errors related to the algorithm used to compute the solution curve. The values returned are:

0: convergence of the algorithm has been achieved,
1: problems related with the predictor-corrector method: error in predictor step,
2: problems related with the predictor-corrector method: error in corrector step,
3: maximum number of iterations has been reached,
4: error in dynamic allocation memory;

control

the list of control parameters used to compute the dgLARS solution curve.

Details

dglars function implements the differential geometric generalization of the least angle regression method (Efron et al., 2004) proposed in Augugliaro et al. (2013). Actual version of the package can be used to estimate the solution curve for a logistic regression model (family = "binomial") and for a Poisson regression model (family = "poisson").

dglars.fit is the workhorse function: it is more efficient when the design matrix have already been calculated. For this reason we suggest to use this function when the dgLARS method is applied in a high-dimensional setting, i.e. when p>n .

The dgLARS solution curve can be estimated using two different algorithms, i.e. the predictor-corrector method and the cyclic coordinate descent method (see below for more details about the control parameter algorithm). The first algorithm is based on two steps. In the first step, called predictor step, an approximation of the point that lies on the solution curve is computed. If the control parameter dg_max is equal to zero, in this step it is also computed an approximation of the optimal step size using a generalization of the method proposed in Efron et al. (2004). The optimal step size is defined as the reduction of the tuning parameter, denoted by $d\gamma$, such that at $\gamma-d\gamma$ there is a change in the active set. In the second step, called corrector step, a Newton-Raphson algorithm is used to correct the approximation to the solution point computed in the previous step. The main problem of this algorithm is that the number of arithmetic operations required to compute the approximation of the point that lies on the solution curve scales as the cube of the variables, this means that such algorithm is cumbersome in a high dimensional setting. To overcome this problem, the second algorithm compute the dgLARS solution curve using an adaptive version of the cyclic coordinate descent method proposed in Friedman et al. (2010).

The control argument is a list that can supply any of the following components:

algorithm: a string to specify the algorithm used to fit the dgLARS solution curve. If algorithm = "pc" (default) the predictor-corrector method is used, while the cyclic coordinate descent method is used if algorithm = "ccd";

method

a string to specify the method used to define the dgLARS solution curve. If method = "dgLASSO" (default) the algorithm computes the solution curve defined by the differential geometric generalization of the LASSO estimator; otherwise, if method = "dgLAR", the differential geometric generalization of the least angle regression method is used;

nv

control parameter for the pc algorithm. An integer value belonging to the interval $[1;min(n,p)]$ used to specify the maximum number of variables included in the final model. Default is nv = min(n-1,p);

np

control parameter for the pc/ccd algorithm. A non negative integer used to define the maximum number of points of the solution curve. For the predictor-corrector algorithm np is set to $50 \cdot min(n-1,p)$ (default) while for the cyclic coordinate descent method is set to 100 (default), i.e. the number of values of the tuning parameter $\gamma$;

g0

control parameter for the pc/ccd algorithm. Set the smallest value for the tuning parameter $\gamma$. Default is g0 = ifelse(p;

dg_max

control parameter for the pc algorithm. A non negative value used to specify the maximum length of the step size. Setting dg_max = 0 (default) the predictor-corrector algorithm computes an approximation of the optimal step size (see Augugliaro et al. (accepted) for more details);

nNR

control criterion parameter for the pc algorithm. A non negative integer used to specify the maximum number of iterations of the Newton-Raphson algorithm used in the corrector step. Default is nNR = 50;

NReps

control parameter for the pc algorithm. A non negative value used to define the convergence of the Newton-Raphson algorithm. Default is NReps = 1.0e-06;

ncrct

control parameter for the pc algorithm. When one of the following conditions is satisfied

i.: the Newton-Raphson algorithm does not converge

ii.

exists a non active variable such that, at the solution point, the absolute value of the corresponding Rao's score test statistics is greater than $\gamma + $eps

then the step size ($d\gamma$) is reduced by $d\gamma = cf \cdot d\gamma$ and the corrector step is repeated. ncrct is a non negative integer used to specify the maximum number of trials of the corrector step. Default is ncrct = 50;

cf

control parameter for the pc algorithm. The contractor factor is a real value belonging to the interval $[0,1]$ used to reduce the step size as previously described. Default is cf = 0.5;

nccd

control parameter for the ccd algorithm. A non negative integer used to specify the maximum number of steps of the cyclic coordinate descent algorithm. Default is 1.0e+05.

eps

control parameter for the pc/ccd algorithm. The meaning of this parameter is related to the algorithm used to estimate the dgLARS solution curve, namely

i.

when algorithm = "pc", eps is used

a.: to identify a variable that will be included in the active set, i.e. when the absolute value of the corresponding Rao's score test statistic belongs to $[\gamma-\code{eps},\gamma+\code{eps}]$;

b.

as previously described, to establish if the corrector step must be repeated;

c.

to define the convergence of the algorithm, i.e. the actual value of the tuning parameter belongs to the interval $[\code{g0-eps},\code{g0+eps}];$

ii.

when algorithm = "ccd", eps is used to define the convergence of a single solution point, i.e. each inner coordinate-descent loop continues until the maximum change in the Rao's score test statistic, after any coefficient update, is less than eps.

Default is eps = 1.0e-05.

References

Augugliaro L., Mineo A.M. and Wit E.C. (2014) dglars: An R Package to Estimate Sparse Generalized Linear Models, Journal of Statistical Software, Vol 59(8), 1-40. http://www.jstatsoft.org/v59/i08/.

Augugliaro L., Mineo A.M. and Wit E.C. (2013) dgLARS: a differential geometric approach to sparse generalized linear models, Journal of the Royal Statistical Society. Series B., Vol 75(3), 471-498.

Augugliaro L., Mineo A.M. and Wit E.C. (2012) Differential geometric LARS via cyclic coordinate descent method, in Proceeding of COMPSTAT 2012, pp. 67-79. Limassol, Cyprus.

Efron B., Hastie T., Johnstone I. and Tibshirani R. (2004) Least Angle Regression, The Annals of Statistics, Vol. 32(2), 407-499.

Friedman J., Hastie T. and Tibshirani R. (2010) Regularization Paths for Generalized Linear Models via Coordinate Descent, Journal of Statistical Software, Vol. 33(1), 1-22.

Examples

Run this code

#############################
# Logistic regression model #

set.seed(123)

# low dimensional setting
n <- 100
p <- 10
X <- matrix(rnorm(n*p), n, p)
b <- 1:2
eta <- b[1] + X[,1] * b[2]
mu <- binomial()$linkinv(eta)
y <- rbinom(n, 1, mu)
system.time(fit <- dglars.fit(X, y, family = "binomial"))
system.time(fit <- dglars.fit(X, y, family = "binomial", 
control = list(algorithm = "ccd")))

dataset <- data.frame(x = X, y = y)
rm(X, y)
system.time(fit <- dglars(y ~ ., family = "binomial", data=dataset))
system.time(fit <- dglars(y ~ ., family = "binomial", 
control = list(algorithm = "ccd"), data =dataset))

# high dimensional setting
n <- 100
p <- 1000
X <- matrix(rnorm(n*p), n, p)
b <- 1:2
eta <- b[1] + X[,1] * b[2]
mu <- binomial()$linkinv(eta)
y <- rbinom(n, 1, mu)
system.time(fit <- dglars.fit(X, y, family = "binomial"))
system.time(fit <- dglars.fit(X, y, family = "binomial", 
control = list(algorithm = "ccd")))

dataset <- data.frame(x = X, y = y)
rm(X, y)
system.time(fit <- dglars(y ~ ., family = "binomial", data=dataset))
system.time( fit <- dglars(y ~ ., family = "binomial", 
control = list(algorithm = "ccd"), data =dataset))

Run the code above in your browser using DataLab