Fit a lasso regression and use the Bayesian Information Criterion (BIC)
to select a subset of selected covariates.
Can deal with very large sparse data matrices.
Intended for binary reponse only (option family = "binomial"
is forced).
Depends on the glmnet
and relax.glmnet
functions from the package glmnet
.
lasso_bic(x, y, maxp = 50, path = TRUE, betaPos = TRUE, ...)
Input matrix, of dimension nobs x nvars. Each row is an observation
vector. Can be in sparse matrix format (inherit from class
"sparseMatrix"
as in package Matrix
).
Binary response variable, numeric.
A limit on how many relaxed coefficients are allowed.
Default is 50, in glmnet
option default is 'n-3', where 'n' is the sample size.
Since glmnet
does not do stepsize optimization, the Newton
algorithm can get stuck and not converge, especially with relaxed fits. With path=TRUE
,
each relaxed fit on a particular set of variables is computed pathwise using the original sequence
of lambda values (with a zero attached to the end). Default is path=TRUE
.
Should the covariates selected by the procedure be
positively associated with the outcome ? Default is TRUE
.
Other arguments that can be passed to glmnet
from package
glmnet
other than family
, maxp
and path
.
An object with S3 class "log.lasso"
.
Numeric vector of regression coefficients in the lasso.
In lasso_bic
function, the regression coefficients are UNPENALIZED.
Length equal to nvars.
Character vector, names of variable(s) selected with the
lasso-bic approach.
If betaPos = TRUE
, this set is the covariates with a positive regression
coefficient in beta
.
Else this set is the covariates with a non null regression coefficient in beta
.
Covariates are ordering according to the p-values (two-sided if betaPos = FALSE
,
one-sided if betaPos = TRUE
) in the classical multiple logistic regression
model that minimzes the BIC.
For each tested penalisation parameter \(\lambda\), a standard version of the BIC is implemented. $$BIC_\lambda = - 2 l_\lambda + df(\lambda) * ln (N)$$ where \(l_\lambda\) is the log-likelihood of the non-penalized multiple logistic regression model that includes the set of covariates with a non-zero coefficient in the penalised regression coefficient vector associated to \(\lambda\), and \(df(\lambda)\) is the number of covariates with a non-zero coefficient in the penalised regression coefficient vector associated to \(\lambda\), The optimal set of covariates according to this approach is the one associated with the classical multiple logistic regression model which minimizes the BIC.
# NOT RUN {
set.seed(15)
drugs <- matrix(rbinom(100*20, 1, 0.2), nrow = 100, ncol = 20)
colnames(drugs) <- paste0("drugs",1:ncol(drugs))
ae <- rbinom(100, 1, 0.3)
lb <- lasso_bic(x = drugs, y = ae, maxp = 20)
# }
Run the code above in your browser using DataLab