stabsel: Stability Selection

Description

Selection of influential variables or model components with error control.

Usage

## generic stability selection funcion
stabsel(x, ...)
## a method to fit models with stability selection
# S3 method for matrix
stabsel(x, y, fitfun = lars.lasso,
        args.fitfun = list(), cutoff, q, PFER,
        folds = subsample(rep(1, nrow(x)), B = B),
        B = ifelse(sampling.type == "MB", 100, 50),
        assumption = c("unimodal", "r-concave", "none"),
        sampling.type = c("SS", "MB"),
        papply = mclapply, mc.preschedule = FALSE,
        verbose = TRUE, FWER, eval = TRUE, ...)
## essentially a wrapper for data.frames (see details)
# S3 method for data.frame
stabsel(x,  y, intercept = FALSE, ...)

Arguments

a matrix or a data.frame containing the predictors.

a vector or matrix containing the outcome.

intercept

logical. If x is a data.frame, this argument determines if the resulting model matrix should contain a separate intercept or not.

fitfun

a function that takes the arguments x, y as above, and additionally the number of variables to include in each model q. The function then needs to fit the model and to return a logical vector that indicates which variable was selected (among the q selected variables).

args.fitfun

a named list containing additional arguments that are passed to the fitting function; see also argument args in do.call.

cutoff

cutoff between 0.5 and 1. Preferably a value between 0.6 and 0.9 should be used.

number of (unique) selected variables (or groups of variables depending on the model) that are selected on each subsample.

PFER

upper bound for the per-family error rate. This specifies the amount of falsely selected base-learners, which is tolerated. See details.

folds

a weight matrix with number of rows equal to the number of observations, see subsample. Usually one should not change the default here as subsampling with a fraction of \(1/2\) is needed for the error bounds to hold. One usage scenario where specifying the folds by hand might be the case when one has dependent data (e.g. clusters) and thus wants to draw clusters (i.e., multiple rows together) not individuals.

assumption

Defines the type of assumptions on the distributions of the selection probabilities and simultaneous selection probabilities. Only applicable for sampling.type = "SS". For sampling.type = "MB" we always use code"none".

sampling.type

use sampling scheme of of Shah & Samworth (2013), i.e., with complementarty pairs (sampling.type = "SS"), or the original sampling scheme of Meinshausen & Buehlmann (2010).

number of subsampling replicates. Per default, we use 50 complementary pairs for the error bounds of Shah & Samworth (2013) and 100 for the error bound derived in Meinshausen & Buehlmann (2010). As we use \(B\) complementray pairs in the former case this leads to \(2B\) subsamples.

papply

(parallel) apply function, defaults to mclapply. Alternatively, parLapply can be used. In the latter case, usually more setup is needed (see example of cvrisk for some details).

mc.preschedule

preschedule tasks if papply = mclapply (default: mc.preschedule = FALSE)? For details see mclapply.

verbose

logical (default: TRUE) that determines wether warnings should be issued.

FWER

deprecated. Only for compatibility with older versions, use PFER instead.

eval

logical. Determines whether stability selection is evaluated (eval = TRUE; default) or if only the parameter combination is returned.

…

additional arguments to parallel apply methods such as mclapply.

Value

An object of class stabsel with a special print method. The object has the following elements:

phat

selection probabilities.

selected

elements with maximal selection probability greater cutoff.

max

maximum of selection probabilities.

cutoff

cutoff used.

average number of selected variables used.

PFER

(realized) upper bound for the per-family error rate.

specifiedPFER

specified upper bound for the per-family error rate.

the number of effects subject to selection.

the number of subsamples.

sampling.type

the sampling type used for stability selection.

assumption

the assumptions made on the selection probabilities.

call

the call.

Details

This function implements the stability selection procedure by Meinshausen and Buehlmann (2010) and the improved error bounds by Shah and Samworth (2013). For details see also Hofner et al. (2014). The error bounds are implemented in the function stabsel_parameters. Two of the three arguments cutoff, q and PFER must be specified. The per-family error rate (PFER), i.e., the expected number of false positives \(E(V)\), where \(V\) is the number of false positives, is bounded by the argument PFER.

As controlling the PFER is more conservative as controlling the family-wise error rate (FWER), the procedure also controlls the FWER, i.e., the probability of selecting at least one non-influential variable (or model component) is less than PFER.

Predefined fitfuns functions exist but more can be easily implemented. Note that stepwise regression methods are usually not advised as they tend to be relatively unstable. See example below.

The function stabsel for data.frames is essentially just a wrapper to the matrix function with the same argments. The only difference is that in a pre-processing step, the data set is converted to a model matrix using the function model.matrix. The additional argument intercept determines if an explicit intercept should be added to the model matrix. This is often not neccessary but depends on the fitfun.

References

B. Hofner, L. Boccuto and M. Goeker (2015), Controlling false discoveries in high-dimensional situations: Boosting with stability selection. BMC Bioinformatics, 16:144. 10.1186/s12859-015-0575-3.

N. Meinshausen and P. Buehlmann (2010), Stability selection. Journal of the Royal Statistical Society, Series B, 72, 417--473.

R.D. Shah and R.J. Samworth (2013), Variable selection with error control: another look at stability selection. Journal of the Royal Statistical Society, Series B, 75, 55--80.

Examples

Run this code

# NOT RUN {
  
  if (require("TH.data")) {
      ## make data set available
      data("bodyfat", package = "TH.data")
  } else {
      ## simulate some data if TH.data not available. 
      ## Note that results are non-sense with this data.
      bodyfat <- matrix(rnorm(720), nrow = 72, ncol = 10)
  }
  
  ## set seed
  set.seed(1234)
  
  ####################################################################
  ### using stability selection with Lasso methods:

  if (require("lars")) {
      (stab.lasso <- stabsel(x = bodyfat[, -2], y = bodyfat[,2],
                             fitfun = lars.lasso, cutoff = 0.75,
                             PFER = 1))
      (stab.stepwise <- stabsel(x = bodyfat[, -2], y = bodyfat[,2],
                                fitfun = lars.stepwise, cutoff = 0.75,
                                PFER = 1))
      par(mfrow = c(2, 1))
      plot(stab.lasso, main = "Lasso")
      plot(stab.stepwise, main = "Stepwise Selection")
      ## --> stepwise selection seems to be quite unstable even in this low
      ##     dimensional example!
  }

  ## set seed (again to make results comparable)
  set.seed(1234)
  if (require("glmnet")) {
      (stab.glmnet <- stabsel(x = bodyfat[, -2], y = bodyfat[,2],
                              fitfun = glmnet.lasso, cutoff = 0.75,
                              PFER = 1))
      par(mfrow = c(2, 1))
      plot(stab.glmnet, main = "Lasso (glmnet)")
      if (exists("stab.lasso"))
          plot(stab.lasso, main = "Lasso (lars)")    
  }
  
  
  ## Select variables with maximum coefficients based on lasso estimate
  
  set.seed(1234) # reset seed
  if (require("glmnet")) {
      ## use cross-validated lambda 
      lambda.min <- cv.glmnet(x = as.matrix(bodyfat[, -2]), y = bodyfat[,2])$lambda.min
      (stab.maxCoef <- stabsel(x = bodyfat[, -2], y = bodyfat[,2],
                               fitfun = glmnet.lasso_maxCoef, 
                               # specify additional parameters to fitfun
                               args.fitfun = list(lambda = lambda.min),
                               cutoff = 0.75, PFER = 1))
                               
      ## WARNING: Using a fixed penalty (lambda) is usually not permitted and 
      ##          not sensible. See ?fitfun for details.
      
      ## now compare standard lasso with "maximal parameter estimates" from lasso
      par(mfrow = c(2, 1))
      plot(stab.maxCoef, main = "Lasso (glmnet; Maximum Coefficients)")
      plot(stab.glmnet, main = "Lasso (glmnet)")
      ## --> very different results.
  }

  ####################################################################
  ### using stability selection directly on computed boosting models
  ### from mboost


  if (require("mboost")) {
      ### low-dimensional example
      mod <- glmboost(DEXfat ~ ., data = bodyfat)

      ## compute cutoff ahead of running stabsel to see if it is a sensible
      ## parameter choice.
      ##   p = ncol(bodyfat) - 1 (= Outcome) + 1 ( = Intercept)
      stabsel_parameters(q = 3, PFER = 1, p = ncol(bodyfat) - 1 + 1,
                         sampling.type = "MB")
      ## the same:
      stabsel(mod, q = 3, PFER = 1, sampling.type = "MB", eval = FALSE)

      
# }
# NOT RUN {
### Do not test the following code per default on CRAN as it takes some time to run:
      ## now run stability selection
      (sbody <- stabsel(mod, q = 3, PFER = 1, sampling.type = "MB"))
      opar <- par(mai = par("mai") * c(1, 1, 1, 2.7))
      plot(sbody)
      par(opar)
      plot(sbody, type = "maxsel", ymargin = 6)
      
# }
# NOT RUN {
  }
# }

Run the code above in your browser using DataLab