lasso_bic: fit a lasso regression and use standard BIC for variable selection

Description

Fit a lasso regression and use the Bayesian Information Criterion (BIC) to select a subset of selected covariates. Can deal with very large sparse data matrices. Intended for binary reponse only (option family = "binomial" is forced). Depends on the glmnet and relax.glmnet functions from the package glmnet.

Usage

lasso_bic(x, y, maxp = 50, path = TRUE, betaPos = TRUE, ...)

Arguments

Input matrix, of dimension nobs x nvars. Each row is an observation vector. Can be in sparse matrix format (inherit from class "sparseMatrix" as in package Matrix).

Binary response variable, numeric.

maxp

A limit on how many relaxed coefficients are allowed. Default is 50, in glmnet option default is 'n-3', where 'n' is the sample size.

path

Since glmnet does not do stepsize optimization, the Newton algorithm can get stuck and not converge, especially with relaxed fits. With path=TRUE, each relaxed fit on a particular set of variables is computed pathwise using the original sequence of lambda values (with a zero attached to the end). Default is path=TRUE.

betaPos

Should the covariates selected by the procedure be positively associated with the outcome ? Default is TRUE.

…

Other arguments that can be passed to glmnet from package glmnet other than family, maxp and path.

Value

An object with S3 class "log.lasso".

beta

Numeric vector of regression coefficients in the lasso. In lasso_bic function, the regression coefficients are UNPENALIZED. Length equal to nvars.

selected_variables

Character vector, names of variable(s) selected with the lasso-bic approach. If betaPos = TRUE, this set is the covariates with a positive regression coefficient in beta. Else this set is the covariates with a non null regression coefficient in beta. Covariates are ordering according to the p-values (two-sided if betaPos = FALSE , one-sided if betaPos = TRUE) in the classical multiple logistic regression model that minimzes the BIC.

Details

For each tested penalisation parameter $\lambda$, a standard version of the BIC is implemented. $$BIC_\lambda = - 2 l_\lambda + df(\lambda) * ln (N)$$ where $l_\lambda$ is the log-likelihood of the non-penalized multiple logistic regression model that includes the set of covariates with a non-zero coefficient in the penalised regression coefficient vector associated to $\lambda$, and $df(\lambda)$ is the number of covariates with a non-zero coefficient in the penalised regression coefficient vector associated to $\lambda$, The optimal set of covariates according to this approach is the one associated with the classical multiple logistic regression model which minimizes the BIC.

Examples

Run this code

# NOT RUN {
set.seed(15)
drugs <- matrix(rbinom(100*20, 1, 0.2), nrow = 100, ncol = 20)
colnames(drugs) <- paste0("drugs",1:ncol(drugs))
ae <- rbinom(100, 1, 0.3)
lb <- lasso_bic(x = drugs, y = ae, maxp = 20)


# }

Run the code above in your browser using DataLab