cvdglars: Cross-validation deviance for dgLARS

Description

Uses the $k$-fold cross-validation deviance to estimate the solution point of the dgLARS solution curve.

Usage

cvdglars(formula, family = c("binomial", "poisson"), data, 
subset, contrast = NULL, control = list())
cvdglars.fit(X, y, family = c("binomial", "poisson"), 
control = list())

Arguments

formula

an object of class "formula": a symbolic description of the model to be fitted.

family

a description of the error distribution used in the model (see below for more details).

data

an optional data frame, list or environment (or object coercible by 'as.data.frame' to a data frame) containing the variables in the model. If not found in 'data', the variables are taken from 'environment(formula)'.

subset

an optional vector specifying a subset of observations to be used in the fitting process.

contrast

an optional list. See the 'contrasts.arg' of 'model.matrix.default'.

control

a list of control parameters. See 'Details'.

design matrix of dimension $n\times p$.

response vector.

Value

call

cvdglars"cvdglars"the call that produced this object;

family

a description of the error distribution used in the model;

formula_cv

an object of class "formula" used to describe the model estimated by cross-validation (available only cvdglars() method);

var_cv

a character vector with the name of variables selected by cross-validation;

beta

the vector of the coefficients estimated by cross-validation;

dev_m

a vector of length ng used to store the mean cross-validation deviance;

dev_v

a vector of length ng used to store the variance of the mean cross-validation deviance;

g0

the smallest value for the tuning parameter;

g_hat

the value of the tuning parameter corresponding to the minimum of the cross-validation deviance;

g_max

the value of the tuning parameter corresponding to the starting point of the dgLARS solution curve;

X

the used design matrix;

y

the used response vector;

conv

an integer value used to encode the warnings and the errors related to the algorithm used to dgLARS solution curve. The values returned are:

0: convergence of the algorithm has been achieved,
1: problems related with the predictor-corrector method: error in predictor step,
2: problems related with the predictor-corrector method: error in corrector step,
3: maximum number of iterations has been reached,
4: error in dynamic allocation memory;

control

the list of control parameters used to compute the cross-validation deviance.

Details

cvdglars function runs dglars nfold+1 times. The deviance is stored, and the average and its standard deviation over the folds are computed.

cvdglars.fit is the workhorse function: it is more efficient when the design matrix have already been calculated. For this reason we suggest to use this function when the dgLARS method is applied in a high-dimensional setting, i.e. when p>n .

The control argument is a list that can supply any of the following components:

algorithm: a string to specify the algorithm used to fit the dgLARS solution curve. If algorithm = "pc" (default) the predictor-corrector method is used while the cyclic coordinate descent method is used if algorithm = "ccd";

method

a string to specify the method used to define the dgLARS solution curve. If method = "dgLASSO" (default) the algorithm computes the solution curve defined by the differential geometric generalization of the LASSO estimator; otherwise, if method = "dgLAR", the differential geometric generalization of the least angle regression method is computed;

nfold

a non negative integer used to specify the number of folds. Although nfolds can be as large as the sample size (leave-one-out CV), it is not recommended for large datasets. Default is nfold = 10;

foldid

a $n$-dimensional vector of integers, between 1 and $n$, used to define the folds for the cross-validation. By default foldid is randomly generated;

ng

number of values of the tuning parameter used to compute the cross-validation deviance. Default is ng = 100;

nv

control parameter for the pc algorithm. An integer value belonging to the interval $[1;min(n,p)]$ (default is nv = min(n-1,p)) used to specify the maximum number of variables included in the final model;

np

control parameter for the pc/ccd algorithm. A non negative integer used to define the maximum number of points of the solution curve. For the predictor-corrector algorithm np is set to $50 \cdot min(n-1,p)$ (default), while for the cyclic coordinate descent method is set to 100 (default), i.e. the number of values of the tuning parameter $\gamma$;

g0

control parameter for the pc/ccd algorithm. Set the smallest value for the tuning parameter $\gamma$. Default is g0 = ifelse(p;

dg_max

control parameter for the pc algorithm. A non negative value used to specify the maximum length of the step size. Setting dg_max = 0 (default) the predictor-corrector algorithm uses the optimal step size (see Augugliaro et al. (accepted) for more details) to approximate the value of the tuning parameter corresponding to the inclusion/exclusion of a variable from the model;

nNR

control parameter for the pc algorithm. A non negative integer used to specify the maximum number of iterations of the Newton-Raphson algorithm used in the corrector step. Default is nNR = 50;

NReps

control parameter for the pc algorithm. A non negative value used to define the convergence criterion of the Newton-Raphson algorithm. Default is NReps = 1.0e-06;

ncrct

control parameter for the pc algorithm. When one of the following conditions is satisfied

i.: the Newton-Raphson algorithm does not converge

ii.

exist a non active variable such that, at the solution point, the absolute value of the corresponding Rao's score test statistics is greater than $\gamma + $eps

then the step size ($d\gamma$) is reduced by $d\gamma = cf \cdot d\gamma$ and the corrector step is repeated. ncrct is a non negative integer used to specify the maximum number of trials of the corrector step. Default is ncrct = 50;

cf

control parameter for the pc algorithm. The contractor factor is a real value belonging to the interval $[0,1]$ used to reduce the step size as previously described. Default is cf = 0.5;

nccd

control parameter for the ccd algorithm. A non negative integer used to specify the maximum number of steps of the cyclic coordinate descent algorithm. Default is 1.0e+05.

eps

control parameter for the pc/ccd algorithm. The meaning of this parameter is related to the algorithm used to estimate the dgLARS solution curve, namely

i.

when algorithm = "pc", eps is used

a.: to identify a variable that will be included in the active set, i.e. when the absolute value of the corresponding Rao's score test statistic belongs to $[\gamma-\code{eps},\gamma+\code{eps}]$;

b.

as previously described, to establish if the corrector step must be repeated;

c.

to define the convergence of the algorithm, i.e. the actual value of the tuning parameter belongs to the interval $[\code{g0-eps},\code{g0+eps}];$

ii.

when algorithm = "ccd", eps is used to define the convergence of a single solution point, i.e. each inner coordinate-descent loop continues until the maximum change in the Rao's score test statistic, after any coefficient update, is less than eps.

Default is eps = 1.0e-05.

References

Augugliaro L., Mineo A.M. and Wit E.C. (2014) dglars: An R Package to Estimate Sparse Generalized Linear Models, Journal of Statistical Software, Vol 59(8), 1-40. http://www.jstatsoft.org/v59/i08/.

Augugliaro L., Mineo A.M. and Wit E.C. (2013) dgLARS: a differential geometric approach to sparse generalized linear models, Journal of the Royal Statistical Society. Series B., Vol 75(3), 471-498.

Augugliaro L., Mineo A.M. and Wit E.C. (2012) Differential geometric LARS via cyclic coordinate descent method, in Proceeding of COMPSTAT 2012, pp. 67-79. Limassol, Cyprus.

Examples

Run this code

###########################
# Logistic regression model

set.seed(123)

n <- 100
p <- 10
X <- matrix(rnorm(n*p), n, p)
b <- 1:2
eta <- b[1] + X[,1] * b[2]
mu <- binomial()$linkinv(eta)
y <- rbinom(n, 1, mu)
fit_cv <- cvdglars.fit(X, y, family = "binomial")
fit <- dglars.fit(X, y, family = "binomial", control = list(g0=fit_cv$g_hat))
fit_cv
fit$beta[,fit$np]

Run the code above in your browser using DataLab