Learn R Programming

dglars (version 1.0.5)

cvdglars: Cross-validation deviance for dgLARS

Description

Uses the $k$-fold cross-validation deviance to estimate the solution point of the dgLARS solution curve.

Usage

cvdglars(formula, family = c("binomial", "poisson"), data, subset, contrast = NULL, control = list())
cvdglars.fit(X, y, family = c("binomial", "poisson"), control = list())

Arguments

formula
an object of class "formula": a symbolic description of the model to be fitted.
family
a description of the error distribution used in the model (see below for more details).
data
an optional data frame, list or environment (or object coercible by 'as.data.frame' to a data frame) containing the variables in the model. If not found in 'data', the variables are taken from 'environment(formula)'.
subset
an optional vector specifying a subset of observations to be used in the fitting process.
contrast
an optional list. See the 'contrasts.arg' of 'model.matrix.default'.
control
a list of control parameters. See 'Details'.
X
design matrix of dimension $n\times p$.
y
response vector.

Value

cvdglars returns an object with S3 class "cvdglars", i.e. a list containing the following components:
call
the call that produced this object;
family
a description of the error distribution used in the model;
formula_cv
an object of class "formula" used to describe the model estimated by cross-validation (available only cvdglars() method);
var_cv
a character vector with the name of variables selected by cross-validation;
beta
the vector of the coefficients estimated by cross-validation;
dev_m
a vector of length ng used to store the mean cross-validation deviance;
dev_v
a vector of length ng used to store the variance of the mean cross-validation deviance;
g0
the smallest value for the tuning parameter;
g_hat
the value of the tuning parameter corresponding to the minimum of the cross-validation deviance;
g_max
the value of the tuning parameter corresponding to the starting point of the dgLARS solution curve;
X
the used design matrix;
y
the used response vector;
conv
an integer value used to encode the warnings and the errors related to the algorithm used to dgLARS solution curve. The values returned are:
0
convergence of the algorithm has been achieved,
1
problems related with the predictor-corrector method: error in predictor step,
2
problems related with the predictor-corrector method: error in corrector step,
3
maximum number of iterations has been reached,
4
error in dynamic allocation memory;
control
the list of control parameters used to compute the cross-validation deviance.

Details

cvdglars function runs dglars nfold+1 times. The deviance is stored, and the average and its standard deviation over the folds are computed.

cvdglars.fit is the workhorse function: it is more efficient when the design matrix have already been calculated. For this reason we suggest to use this function when the dgLARS method is applied in a high-dimensional setting, i.e. when p>n .

The control argument is a list that can supply any of the following components:

algorithm
a string to specify the algorithm used to fit the dgLARS solution curve. If algorithm = "pc" (default) the predictor-corrector method is used while the cyclic coordinate descent method is used if algorithm = "ccd";

method
a string to specify the method used to define the dgLARS solution curve. If method = "dgLASSO" (default) the algorithm computes the solution curve defined by the differential geometric generalization of the LASSO estimator; otherwise, if method = "dgLAR", the differential geometric generalization of the least angle regression method is computed;

nfold
a non negative integer used to specify the number of folds. Although nfolds can be as large as the sample size (leave-one-out CV), it is not recommended for large datasets. Default is nfold = 10;

foldid
a $n$-dimensional vector of integers, between 1 and $n$, used to define the folds for the cross-validation. By default foldid is randomly generated;

ng
number of values of the tuning parameter used to compute the cross-validation deviance. Default is ng = 100;

nv
control parameter for the pc algorithm. An integer value belonging to the interval $[1;min(n,p)]$ (default is nv = min(n-1,p)) used to specify the maximum number of variables included in the final model;

np
control parameter for the pc/ccd algorithm. A non negative integer used to define the maximum number of points of the solution curve. For the predictor-corrector algorithm np is set to $50 \cdot min(n-1,p)$ (default), while for the cyclic coordinate descent method is set to 100 (default), i.e. the number of values of the tuning parameter $\gamma$;

g0
control parameter for the pc/ccd algorithm. Set the smallest value for the tuning parameter $\gamma$. Default is g0 = ifelse(p;

dg_max
control parameter for the pc algorithm. A non negative value used to specify the maximum length of the step size. Setting dg_max = 0 (default) the predictor-corrector algorithm uses the optimal step size (see Augugliaro et al. (accepted) for more details) to approximate the value of the tuning parameter corresponding to the inclusion/exclusion of a variable from the model;

nNR
control parameter for the pc algorithm. A non negative integer used to specify the maximum number of iterations of the Newton-Raphson algorithm used in the corrector step. Default is nNR = 50;

NReps
control parameter for the pc algorithm. A non negative value used to define the convergence criterion of the Newton-Raphson algorithm. Default is NReps = 1.0e-06;

ncrct
control parameter for the pc algorithm. When one of the following conditions is satisfied
i.
the Newton-Raphson algorithm does not converge

ii.
exist a non active variable such that, at the solution point, the absolute value of the corresponding Rao's score test statistics is greater than $\gamma + $eps

then the step size ($d\gamma$) is reduced by $d\gamma = cf \cdot d\gamma$ and the corrector step is repeated. ncrct is a non negative integer used to specify the maximum number of trials of the corrector step. Default is ncrct = 50;

cf
control parameter for the pc algorithm. The contractor factor is a real value belonging to the interval $[0,1]$ used to reduce the step size as previously described. Default is cf = 0.5;

nccd
control parameter for the ccd algorithm. A non negative integer used to specify the maximum number of steps of the cyclic coordinate descent algorithm. Default is 1.0e+05.

eps
control parameter for the pc/ccd algorithm. The meaning of this parameter is related to the algorithm used to estimate the dgLARS solution curve, namely
i.
when algorithm = "pc", eps is used
a.
to identify a variable that will be included in the active set, i.e. when the absolute value of the corresponding Rao's score test statistic belongs to $[\gamma-\code{eps},\gamma+\code{eps}]$;

b.
as previously described, to establish if the corrector step must be repeated;

c.
to define the convergence of the algorithm, i.e. the actual value of the tuning parameter belongs to the interval $[\code{g0-eps},\code{g0+eps}];$

ii.
when algorithm = "ccd", eps is used to define the convergence of a single solution point, i.e. each inner coordinate-descent loop continues until the maximum change in the Rao's score test statistic, after any coefficient update, is less than eps.

Default is eps = 1.0e-05.

References

Augugliaro L., Mineo A.M. and Wit E.C. (2014) dglars: An R Package to Estimate Sparse Generalized Linear Models, Journal of Statistical Software, Vol 59(8), 1-40. http://www.jstatsoft.org/v59/i08/.

Augugliaro L., Mineo A.M. and Wit E.C. (2013) dgLARS: a differential geometric approach to sparse generalized linear models, Journal of the Royal Statistical Society. Series B., Vol 75(3), 471-498.

Augugliaro L., Mineo A.M. and Wit E.C. (2012) Differential geometric LARS via cyclic coordinate descent method, in Proceeding of COMPSTAT 2012, pp. 67-79. Limassol, Cyprus.

See Also

coef.cvdglars, print.cvdglars, plot.cvdglars methods

Examples

Run this code
###########################
# Logistic regression model

set.seed(123)

n <- 100
p <- 10
X <- matrix(rnorm(n*p), n, p)
b <- 1:2
eta <- b[1] + X[,1] * b[2]
mu <- binomial()$linkinv(eta)
y <- rbinom(n, 1, mu)
fit_cv <- cvdglars.fit(X, y, family = "binomial")
fit <- dglars.fit(X, y, family = "binomial", control = list(g0=fit_cv$g_hat))
fit_cv
fit$beta[,fit$np]

Run the code above in your browser using DataLab