ci.uniReg: Compute bootstrap confidence intervals for a univariate guided regression model

Description

Fit a univariate-guided sparse regression (lasso), by a two-stage procedure. The first stage fits p separate univariate models to the response. The second stage gives more weight to the more important univariate features, and preserves their signs. Conveniently, it returns an objects that inherits from glmnet, so that all of the methods for glmnet can be applied, such as predict, plot, coef andprint.

Fit a univariate-guided regression, by a two-stage procedure. The first stage fits p separate univariate models to the response. The second stage fits a regression model, preserving the univariate signs.

Usage

ci.uniReg(
  x,
  y,
  family = c("gaussian", "binomial", "cox"),
  weights = NULL,
  B = 500,
  alpha = 0.05,
  ...
)
uniLasso(
  x,
  y,
  family = c("gaussian", "binomial", "cox"),
  weights = NULL,
  loo = TRUE,
  lower.limits = 0,
  standardize = FALSE,
  info = NULL,
  loob.nit = 2,
  loob.ridge = 0,
  loob.eps = 1e-06,
  ...
)
uniReg(
  x,
  y,
  family = c("gaussian", "binomial", "cox"),
  weights = NULL,
  loo = TRUE,
  lower.limits = 0,
  standardize = FALSE,
  info = NULL,
  loob.nit = 2,
  loob.ridge = 0,
  loob.eps = 1e-04,
  hard.zero = TRUE,
  ...
)

Value

An object that inherits from "glmnet". There is one additional parameter returned, which is info and has two components. They are beta0 and beta, the intercepts and slopes for the usual (non-LOO) univariate fits from stage 1.

Arguments

x: Input matrix, of dimension nobs x nvars; each row is an observation vector.
y: Response variable. Quantitative for family = "gaussian" or family = "poisson" (non-negative counts). For family="binomial", should be a numeric vector consisting of 0s and 1s. For family="cox", y should be a two-column matrix with columns named 'time' and 'status'. The latter is a binary variable, with '1' indicating death, and '0' indicating right-censored.
family: one of "gaussian","binomial" or "cox". Currently only these families are implemented. In the future others will be added.
weights: optional vector of non-negative weights, default is NULL which results in all weights = 1.
B: Number of bootstrap samples. Default is 500.
alpha: size of confidence interval.
loo: TRUE (the default) means that uniLasso uses the prevalidated loo fits (approximate loo or 'alo' for "binomial" and "cox") for each univariate model as features to avoid overfitting. loo=FALSE means it uses the univariate fitted predictor.
lower.limits: = 0 (default) means that uniLasso constrains the sign of the coefs produced in the second round to be the same as those in the univariate fits. (Since uniLasso uses the univariate fits as features, a positivity constraint at the second stage is equivalent.)
standardize: input argument to glmnet for final non-negative lasso fit. Strongly recommend standardize=FALSE (default) since the univariate fit determines the correct scale for each variable.
info: Users can supply results of uniInfo on external datasets rather than compute them on the same data used to fit the model.
loob.nit: Number of Newton iterations for GLM or Cox in computing univariate linear predictors. Default is 2.
loob.ridge: A nonnegative number to apply ridge penalization to the slope parameters. This is helpful if some of the variables are near constant or have very small standard deviations. Default is 0.0.
loob.eps: A small number used in regularizing the Hessian for the Cox model. Default is 1e-6.
hard.zero: if TRUE (default), the model fits the unpenalized regression. This is potentially unstable when p > n. In this case hard.zero = FALSE might be preferable, and the model is then fit using the smallest value of lambda in the path.
...: additional arguments passed to glmnet.

Details

Fits a two stage lasso model. First stage replaces each feature by the univariate fit for that feature. Second stage fits a (positive) lasso using the first stage features (which preserves the signs of the first stage model). Hence the second stage selects and modifies the coefficients of the first stage model, similar to the adaptive lasso. Leads to sparser and more interpretable models.

For "binomial" family y is a binary response. For "cox" family, y should be a Surv object for right censored data, or a matrix with columns labeled 'time' and 'status' Although glmnet has more flexible options say for binary responses, and for cox responses, these are not yet implemented (but are possible and will appear in future versions). Likewise, other glm families are possible as well, but not yet implemented.

loo = TRUE means it uses the prevalidated loo fits (approximate loo or 'alo' for binomial and cox) for each univariate model as features to avoid overfitting in the second stage. The coefficients are then multiplied into the original univariate coefficients to get the final model.

loo = FALSE means it uses the univariate fitted predictor, and hence it is a form of adaptive lasso, but tends to overfit. lower.limits = 0 means uniLasso constrains the sign of the coefs in the second round to be that of the univariate fits.

Examples

Run this code

# ci.uniReg usage

sigma =3
set.seed(1)
n <- 100; p <- 20
x <- matrix(rnorm(n * p), n, p)
beta <- matrix(c(rep(2, 5), rep(0, 15)), ncol = 1)
y <- x %*% beta + rnorm(n)*sigma
ci <- ci.uniReg(x, y, B=100)
print(ci)

# uniLasso usage
# Default uniLasso usage for Gaussian data

sigma =3
set.seed(1)
n <- 100; p <- 20
x <- matrix(rnorm(n * p), n, p)
beta <- matrix(c(rep(2, 5), rep(0, 15)), ncol = 1)
y <- x %*% beta + rnorm(n)*sigma
xtest=matrix(rnorm(n * p), n, p)
ytest <- xtest %*% beta + rnorm(n)*sigma

fit <- uniLasso(x, y)
plot(fit)
predict(fit,xtest[1:10,],s=1) #predict on test data

# Two-stage variation where we carve off a small dataset for computing the univariate coefs.

cset=1:20
info = uniInfo(x[cset,],y[cset])
fit_two_stage <- uniLasso(x[-cset,], y[-cset], info = info)
plot(fit_two_stage)

# Binomial response uniLasso

yb =as.numeric(y>0)
fitb = uniLasso(x, y)
predict(fitb, xtest[1:10,], s=1, type="response")

# uniLasso with same positivity constraints, but starting `beta`
# from univariate fits on the same data. With loo=FALSE, does not tend to do as well,
# probably due to overfitting.

 fit_pos_adapt <- uniLasso(x, y, loo = FALSE)
 plot(fit_pos_adapt)

# uniLasso with no constraints, but starting `beta` from univariate fits.
# This is a version of the adaptive lasso, which tends to overfit, and loses interpretability.

 fit_adapt <- uniLasso(x, y, loo = FALSE, lower.limits = -Inf)
 plot(fit_adapt)

# Cox response uniLasso

set.seed(10101)
N = 1000
p = 30
nzc = p/3
x = matrix(rnorm(N * p), N, p)
beta = rnorm(nzc)
fx = x[, seq(nzc)] %*% beta/3
hx = exp(fx)
ty = rexp(N, hx)
tcens = rbinom(n = N, prob = 0.3, size = 1)  # censoring indicator
y = cbind(time = ty, status = 1 - tcens)  # y=Surv(ty,1-tcens) with library(survival)
fitc = uniLasso(x, y, family = "cox")
plot(fitc)

# uniReg usage

sigma =3
set.seed(1)
n <- 100; p <- 20
x <- matrix(rnorm(n * p), n, p)
beta <- matrix(c(rep(2, 5), rep(0, 15)), ncol = 1)
y <- x %*% beta + rnorm(n)*sigma
xtest=matrix(rnorm(n * p), n, p)

fit <- uniReg(x, y)
predict(fit,xtest[1:10,]) #predict on test data
coef(fit)
print(fit)

fita <- uniReg(x, y, hard.zero = FALSE)
print(fita)

fitb <- uniReg(x, y>0, family = "binomial")
coef(fitb)
print(fitb)

Run the code above in your browser using DataLab