dglars function is used to estimate the solution curve implicitly defined by the dgLARS method for logistic and Poisson regression model.
dglars(formula, family = c("binomial", "poisson"), data,
subset, contrast = NULL, control = list())
dglars.fit(X, y, family = c("binomial", "poisson"),
control = list())dglars returns an object with S3 class "dglars", i.e. a list containing the following components:
np dimensional vector of the deviance corresponding to the values of the tuning parameter $\gamma$;np dimensional vector of characters used to show how is changed the active set for each value of the tuning parameter $\gamma$;01234dglars function implements the differential geometric generalization of the least angle regression method (Efron et al., 2004) proposed
in Augugliaro et al. (2013). Actual version of the package can be used to estimate the solution curve for a logistic regression model
(family = "binomial") and for a Poisson regression model (family = "poisson").dglars.fit is the workhorse function: it is more efficient when the design matrix have already been calculated. For this reason we suggest to use this function
when the dgLARS method is applied in a high-dimensional setting, i.e. when p>n .
The dgLARS solution curve can be estimated using two different algorithms, i.e. the predictor-corrector method and the cyclic coordinate descent method (see below for
more details about the control parameter algorithm). The first algorithm is based on two steps. In the first step, called predictor step, an approximation of the point
that lies on the solution curve is computed. If the control parameter dg_max is equal to zero, in this step it is also computed an approximation of the optimal
step size using a generalization of the method proposed in Efron et al. (2004). The optimal step size is defined as the reduction of the tuning parameter, denoted by
$d\gamma$, such that at $\gamma-d\gamma$ there is a change in the active set. In the second step, called corrector step, a Newton-Raphson algorithm is used to
correct the approximation to the solution point computed in the previous step. The main problem of this algorithm is that the number of arithmetic operations required to
compute the approximation of the point that lies on the solution curve scales as the cube of the variables, this means that such algorithm is cumbersome in a high dimensional
setting. To overcome this problem, the second algorithm compute the dgLARS solution curve using an adaptive version of the cyclic coordinate descent method proposed
in Friedman et al. (2010).
The control argument is a list that can supply any of the following components:
algorithmalgorithm = "pc" (default)
the predictor-corrector method is used, while the cyclic coordinate descent method is used if algorithm = "ccd";
methodmethod = "dgLASSO" (default)
the algorithm computes the solution curve defined by the differential geometric generalization of the LASSO estimator; otherwise, if method = "dgLAR", the
differential geometric generalization of the least angle regression method is used;
nvpc algorithm. An integer value belonging to the interval $[1;min(n,p)]$ used to specify the maximum number
of variables included in the final model. Default is nv = min(n-1,p);
nppc/ccd algorithm. A non negative integer used to define the maximum number of points of the solution curve. For the
predictor-corrector algorithm np is set to $50 \cdot min(n-1,p)$ (default) while for the cyclic coordinate descent method is set to 100 (default), i.e. the number
of values of the tuning parameter $\gamma$;
g0pc/ccd algorithm. Set the smallest value for the tuning parameter $\gamma$. Default is g0 = ifelse(p;
dg_maxpc algorithm. A non negative value used to specify the maximum length of the step size. Setting dg_max = 0
(default) the predictor-corrector algorithm computes an approximation of the optimal step size (see Augugliaro et al. (accepted) for more details);
nNRpc algorithm. A non negative integer used to specify the maximum number of iterations of the Newton-Raphson algorithm
used in the corrector step. Default is nNR = 50;
NRepspc algorithm. A non negative value used to define the convergence of the Newton-Raphson algorithm. Default is
NReps = 1.0e-06;
ncrctpc algorithm. When one of the following conditions is satisfied
i.
ii.eps then the step size ($d\gamma$) is reduced by $d\gamma = cf \cdot d\gamma$ and the corrector step is repeated. ncrct is a non negative integer used to specify
the maximum number of trials of the corrector step. Default is ncrct = 50;
cfpc algorithm. The contractor factor is a real value belonging to the interval $[0,1]$ used to reduce the step size as previously
described. Default is cf = 0.5;
nccdccd algorithm. A non negative integer used to specify the maximum number of steps of the cyclic coordinate descent algorithm.
Default is 1.0e+05.
epspc/ccd algorithm. The meaning of this parameter is related to the algorithm used to estimate the dgLARS solution curve, namely
i.algorithm = "pc", eps is used
a.
b.
c.
ii.algorithm = "ccd", eps is used to define the convergence of a single solution point, i.e. each inner
coordinate-descent loop continues until the maximum change in the Rao's score test statistic, after any coefficient update, is less than eps.
Default is eps = 1.0e-05.
Augugliaro L., Mineo A.M. and Wit E.C. (2013) dgLARS: a differential geometric approach to sparse generalized linear models, Journal of the Royal Statistical Society. Series B., Vol 75(3), 471-498.
Augugliaro L., Mineo A.M. and Wit E.C. (2012) Differential geometric LARS via cyclic coordinate descent method, in Proceeding of COMPSTAT 2012, pp. 67-79. Limassol, Cyprus.
Efron B., Hastie T., Johnstone I. and Tibshirani R. (2004) Least Angle Regression, The Annals of Statistics, Vol. 32(2), 407-499.
Friedman J., Hastie T. and Tibshirani R. (2010) Regularization Paths for Generalized Linear Models via Coordinate Descent, Journal of Statistical Software, Vol. 33(1), 1-22.
coef.dglars, plot.dglars, print.dglars and summary.dglars methods.
#############################
# Logistic regression model #
set.seed(123)
# low dimensional setting
n <- 100
p <- 10
X <- matrix(rnorm(n*p), n, p)
b <- 1:2
eta <- b[1] + X[,1] * b[2]
mu <- binomial()$linkinv(eta)
y <- rbinom(n, 1, mu)
system.time(fit <- dglars.fit(X, y, family = "binomial"))
system.time(fit <- dglars.fit(X, y, family = "binomial",
control = list(algorithm = "ccd")))
dataset <- data.frame(x = X, y = y)
rm(X, y)
system.time(fit <- dglars(y ~ ., family = "binomial", data=dataset))
system.time(fit <- dglars(y ~ ., family = "binomial",
control = list(algorithm = "ccd"), data =dataset))
# high dimensional setting
n <- 100
p <- 1000
X <- matrix(rnorm(n*p), n, p)
b <- 1:2
eta <- b[1] + X[,1] * b[2]
mu <- binomial()$linkinv(eta)
y <- rbinom(n, 1, mu)
system.time(fit <- dglars.fit(X, y, family = "binomial"))
system.time(fit <- dglars.fit(X, y, family = "binomial",
control = list(algorithm = "ccd")))
dataset <- data.frame(x = X, y = y)
rm(X, y)
system.time(fit <- dglars(y ~ ., family = "binomial", data=dataset))
system.time( fit <- dglars(y ~ ., family = "binomial",
control = list(algorithm = "ccd"), data =dataset))
Run the code above in your browser using DataLab