Uses the \(k\)-fold cross-validation deviance to estimate the solution point of the dgLARS solution curve.
cvdglars(formula, family = gaussian, g, unpenalized,
b_wght, data, subset, contrasts = NULL, control = list())cvdglars.fit(X, y, family = gaussian, g, unpenalized,
b_wght, control = list())
cvdglars
returns an object with S3 class “cvdglars
”, i.e. a list
containing the following components:
the call that produced this object;
if the model is fitted by cvdglars
, the used formula is returned;
a description of the error distribution used in the model;
a character vector with the name of variables selected by cross-validation;
the vector of the coefficients estimated by cross-validation;
the cross-validation estimate of the disperion parameter;
a vector of length ng
used to store the mean cross-validation
deviance;
a vector of length ng
used to store the variance of the mean
cross-validation deviance;
the value of the tuning parameter corresponding to the minimum of the cross-validation deviance;
the smallest value for the tuning parameter;
the value of the tuning parameter corresponding to the starting point of the dgLARS solution curve;
the used design matrix;
the used response vector;
the vector of weights used to compute the adaptive dglars method;
an integer value used to encode the warnings and the errors related to the algorithm used to fit the dgLARS solution curve. The values returned are:
0
convergence of the algorithm has been achieved,
1
problems related with the predictor-corrector method: error in predictor step,
2
problems related with the predictor-corrector method: error in corrector step,
3
maximum number of iterations has been reached,
4
error in dynamic allocation memory;
the list of control parameters used to compute the cross-validation deviance.
an object of class “formula
”:
a symbolic description of the model to be fitted. When the
binomial
family is used, the responce can be a vector
with entries 0/1 (failure/success) or, alternatively, a
matrix where the first column is the number of “successes”
and the second column is the number of “failures”.
a description of the error distribution and link
function used to specify the model. This can be a character string
naming a family function or the result of a call to a family function
(see family
for details). By default the gaussian family
with identity link function is used.
argument available only for ccd
algorithm. When the ccd
algorithm is used to fit the dgLARS model, this argument can be used to specify
the values of the tuning parameter.
a vector used to specify the unpenalized estimators;
unpenalized
can be a vector of integers or characters specifying
the names of the predictors with unpenalized estimators.
a vector, with length equal to the number of columns of
the matrix X
, used to compute the weights used in the
adaptive dgLARS method. b_wght
is used to specify the
initial estimates of the parameter vector.
an optional data frame, list or environment (or object coercible by ‘as.data.frame’ to a data frame) containing the variables in the model. If not found in ‘data’, the variables are taken from ‘environment(formula)’.
an optional vector specifying a subset of observations to be used in the fitting process.
an optional list. See the ‘contrasts.arg’ of ‘model.matrix.default’.
a list of control parameters. See ‘Details’.
design matrix of dimension \(n\times p\).
response vector. When the binomial
family is used,
this argument can be a vector with entries 0 (failure) or 1
(success). Alternatively, the response can be a matrix where
the first column is the number of “successes” and the second
column is the number of “failures”.
Luigi Augugliaro
Maintainer: Luigi Augugliaro luigi.augugliaro@unipa.it
cvdglars
function runs dglars
nfold
+1 times.
The deviance is stored, and the average and its standard deviation
over the folds are computed.
cvdglars.fit
is the workhorse function: it is more efficient
when the design matrix have already been calculated. For this reason
we suggest to use this function when the dgLARS method is applied in
a high-dimensional setting, i.e. when p>n
.
The control
argument is a list that can supply any of the following components:
algorithm
:a string specifying the algorithm used to
compute the solution curve. The predictor-corrector algorithm is used
when algorithm = ''pc''
(default), while the cyclic coordinate d
escent method is used setting algorithm = ''ccd''
;
method
:a string by means of to specify the kind of solution curve.
If method = ''dgLASSO''
(default) the algorithm computes the solution
curve defined by the differential geometric generalization of the LASSO
estimator; otherwise, if method = ''dgLARS''
, the differential geometric
generalization of the least angle regression method is used;
nfold
:a non negative integer used to specify the number of folds.
Although nfolds
can be as large as the sample size (leave-one-out CV), it
is not recommended for large datasets. Default is nfold = 10
;
foldid
a \(n\)-dimensional vector of integers, between 1 and \(n\),
used to define the folds for the cross-validation. By default foldid
is
randomly generated;
ng
:number of values of the tuning parameter used to compute the
cross-validation deviance. Default is ng = 100
;
nv
:control parameter for the pc
algorithm. An integer value
belonging to the interval \([1;min(n,p)]\) (default is nv = min(n-1,p)
)
used to specify the maximum number of variables included in the final model;
np
:control parameter for the pc/ccd
algorithm. A non negative
integer used to define the maximum number of points of the solution curve. For the
predictor-corrector algorithm np
is set to \(50 \cdot min(n-1,p)\) (default),
while for the cyclic coordinate descent method is set to 100 (default), i.e. the number
of values of the tuning parameter \(\gamma\);
g0
:control parameter for the pc/ccd
algorithm. Set the smallest
value for the tuning parameter \(\gamma\). Default is g0 = ifelse(p<n, 1.0e-06, 0.05)
;
dg_max
:control parameter for the pc
algorithm. A non negative value
used to specify the maximum length of the step size. Setting dg_max = 0
(default)
the predictor-corrector algorithm uses the optimal step size (see Augugliaro et al. (2013)
for more details) to approximate the value of the tuning parameter corresponding to the
inclusion/exclusion of a variable from the model;
nNR
:control parameter for the pc
algorithm. A non negative integer
used to specify the maximum number of iterations of the Newton-Raphson algorithm
used in the corrector step. Default is nNR = 200
;
NReps
:control parameter for the pc
algorithm. A non negative
value used to define the convergence criterion of the Newton-Raphson algorithm.
Default is NReps = 1.0e-06
;
ncrct
:control parameter for the pc
algorithm. When the Newton-Raphson
algorithm does not converge, the step size (\(d\gamma\)) is reduced by
\(d\gamma = cf \cdot d\gamma\) and the corrector step is repeated. ncrct
is a non negative integer used to specify the maximum number of trials for the corrector step.
Default is ncrct = 50
;
cf
:control parameter for the pc
algorithm. The contractor factor
is a real value belonging to the interval \([0,1]\) used to reduce the step size
as previously described. Default is cf = 0.5
;
nccd
:control parameter for the ccd
algorithm. A non negative integer
used to specify the maximum number for steps of the cyclic coordinate descent algorithm.
Default is 1.0e+05
.
eps
control parameter for the pc/ccd
algorithm. The meaning of
this parameter is related to the algorithm used to estimate the solution curve:
i.
if algorithm = ''pc''
then eps
is used
a.
to identify a variable that will be included in the active set (absolute value of the corresponding Rao's score test statistic belongs to \([\gamma - \code{eps}, \gamma + \code{eps}]\));
b.
to establish if the corrector step must be repeated;
c.
to define the convergence of the algorithm, i.e., the actual value of the tuning parameter belongs to the interval \([\code{g0 - eps},\code{g0 + eps}]\);
ii.
if algorithm = ''ccd''
then eps
is used to define the
convergence for a single solution point, i.e., each inner coordinate-descent loop
continues until the maximum change in the Rao's score test statistic, after any
coefficient update, is less than eps
.
Default is eps = 1.0e-05.
Augugliaro L., Mineo A.M. and Wit E.C. (2014) <doi:10.18637/jss.v059.i08> dglars: An R Package to Estimate Sparse Generalized Linear Models, Journal of Statistical Software, Vol 59(8), 1-40. https://www.jstatsoft.org/v59/i08/.
Augugliaro L., Mineo A.M. and Wit E.C. (2013) <doi:10.1111/rssb.12000> dgLARS: a differential geometric approach to sparse generalized linear models, Journal of the Royal Statistical Society. Series B., Vol 75(3), 471-498.
coef.cvdglars
, print.cvdglars
, plot.cvdglars
methods
###########################
# Logistic regression model
# y ~ Binomial
set.seed(123)
n <- 100
p <- 100
X <- matrix(rnorm(n * p), n, p)
b <- 1:2
eta <- b[1] + X[, 1] * b[2]
mu <- binomial()$linkinv(eta)
y <- rbinom(n, 1, mu)
fit_cv <- cvdglars.fit(X, y, family = binomial)
fit_cv
Run the code above in your browser using DataLab