resample: Bootstrapping or multi-sample splits for variable selection

Description

Multiple resampling procedures for selecting variables for a final network model. There are three resampling methods that can be parameterized in a variety of different ways. The ultimate goal is to fit models across iterated resamples with variable selection procedures built in so as to home in on the best predictors to include within a given model. The methods available include: bootstrapped resampling, multi-sample splitting, and stability selection.

Usage

resample(
  data,
  m = NULL,
  niter = 10,
  sampMethod = "bootstrap",
  criterion = "AIC",
  method = "glmnet",
  rule = "OR",
  gamma = 0.5,
  nfolds = 10,
  nlam = 50,
  which.lam = "min",
  threshold = FALSE,
  bonf = FALSE,
  alpha = 0.05,
  exogenous = TRUE,
  split = 0.5,
  center = TRUE,
  scale = FALSE,
  varSeed = NULL,
  seed = NULL,
  verbose = TRUE,
  lags = NULL,
  binary = NULL,
  type = "g",
  saveMods = TRUE,
  saveData = FALSE,
  saveVars = FALSE,
  fitit = TRUE,
  nCores = 1,
  cluster = "mclapply",
  block = FALSE,
  beepno = NULL,
  dayno = NULL,
  ...
)

Arguments

data

n x k dataframe. Cannot supply a matrix as input.

Character vector or numeric vector indicating the moderator(s), if any. Can also specify "all" to make every variable serve as a moderator, or 0 to indicate that there are no moderators. If the length of m is k - 1 or longer, then it will not be possible to have the moderators as exogenous variables. Thus, exogenous will automatically become FALSE.

niter

Number of iterations for the resampling procedure.

sampMethod

Character string indicating which type of procedure to use. "bootstrap" is a standard bootstrapping procedure. "split" is the multi-sample split procedure where the data are split into disjoint training and test sets, the variables to be modeled are selected based on the training set, and then the final model is fit to the test set. "stability" is stability selection, where models are fit to each of two disjoint subsamples of the data, and it is calculated how frequently each variable is selected in each subset, as well how frequently they are simultaneously selected in both subsets at each iteration.

criterion

The criterion for the variable selection procedure. Options include: "cv", "aic", "bic", "ebic", "cp", "rss", "adjr2", "rsq", "r2". "CV" refers to cross-validation, the information criteria are "AIC", "BIC", "EBIC", and "Cp", which refers to Mallow's Cp. "RSS" is the residual sum of squares, "adjR2" is adjusted R-squared, and "Rsq" or "R2" is R-squared. Capitalization is ignored. For methods based on the LASSO, only "CV", "AIC", "BIC", "EBIC" are available. For methods based on subset selection, only "Cp", "BIC", "RSS", "adjR2", "R2" are available.

method

Character string to indicate which method to use for variable selection. Options include "lasso" and "glmnet", both of which use the LASSO via the glmnet package (either with glmnet::glmnet or glmnet::cv.glmnet, depending upon the criterion). "subset", "backward", "forward", "seqrep", all call different types of subset selection using the leaps::regsubsets function. Finally "glinternet" is used for applying the hierarchical lasso, and is the only method available for moderated network estimation (either with glinternet::glinternet or glinternet::glinternet.cv, depending upon the criterion). If one or more moderators are specified, then method will automatically default to "glinternet".

rule

Only applies to GGMs (including between-subjects networks) when a threshold is supplied. The "AND" rule will only preserve edges when both corresponding coefficients have p-values below the threshold, while the "OR" rule will preserve an edge so long as one of the two coefficients have a p-value below the supplied threshold.

gamma

Numeric value of the hyperparameter for the "EBIC" criterion. Only relevant if criterion = "EBIC". Recommended to use a value between 0 and .5, where larger values impose a larger penalty on the criterion.

nfolds

Only relevant if criterion = "CV". Determines the number of folds to use in cross-validation.

nlam

if method = "glinternet", determines the number of lambda values to evaluate in the selection path.

which.lam

Character string. Only applies if criterion = "CV". Options include "min", which uses the lambda value that minimizes the objective function, or "1se" which uses the lambda value at 1 standard error above the value that minimizes the objective function.

threshold

Logical or numeric. If TRUE, then a default value of .05 will be set. Indicates whether a threshold should be placed on the models at each iteration of the sampling. A significant choice by the researcher.

bonf

Logical. Determines whether to apply a bonferroni adjustment on the distribution of p-values for each coefficient.

alpha

Type 1 error rate. Defaults to .05.

exogenous

Logical. Indicates whether moderator variables should be treated as exogenous or not. If they are exogenous, they will not be modeled as outcomes/nodes in the network. If the number of moderators reaches k - 1 or k, then exogenous will automatically be FALSE.

split

If sampMethod == "split" or sampMethod = "stability" then this is a value between 0 and 1 that indicates the proportion of the sample to be used for the training set. When sampMethod = "stability" there isn't an important distinction between the labels "training" and "test", although this value will still cause the two samples to be taken of complementary size.

center

Logical. Determines whether to mean-center the variables.

scale

Logical. Determines whether to standardize the variables.

varSeed

Numeric value providing a seed to be set at the beginning of the selection procedure. Recommended for reproducible results. Importantly, this seed will be used for the variable selection models at each iteration of the resampler. Caution this means that while each model is run with a different sample, it will always have the same seed.

seed

Can be a single value, to set a seed before drawing random seeds of length niter to be used across iterations. Alternatively, one can supply a vector of seeds of length niter. It is recommended to use this argument for reproducibility over the varSeed argument.

verbose

Logical. Determines whether information about the modeling progress should be displayed in the console.

lags

Numeric or logical. Can only be 0, 1 or TRUE or FALSE. NULL is interpreted as FALSE. Indicates whether to fit a time-lagged network or a GGM.

binary

Numeric vector indicating which columns of the data contain binary variables.

type

Determines whether to use gaussian models "g" or binomial models "c". Can also just use "gaussian" or "binomial". Moreover, a vector of length k can be provided such that a value is given to every variable. Ultimately this is not necessary, though, as such values are automatically detected.

saveMods

Logical. Indicates whether to save the models fit to the samples at each iteration or not.

saveData

Logical. Determines whether to save the data from each subsample across iterations or not.

saveVars

Logical. Determines whether to save the variable selection models at each iteration.

fitit

Logical. Determines whether to fit the final selected model on the original sample. If FALSE, then this can still be done with fitNetwork and modSelect.

nCores

Numeric value indicating the number of CPU cores to use for the resampling. If TRUE, then the parallel::detectCores function will be used to maximize the number of cores available.

cluster

Character vector indicating which type of parallelization to use, if nCores > 1. Options include "mclapply" and "SOCK".

block

Logical or numeric. If specified, then this indicates that lags != 0 or lags != NULL. If numeric, then this indicates that block bootstrapping will be used, and the value specifies the block size. If TRUE then an appropriate block size will be estimated automatically.

beepno

Character string or numeric value to indicate which variable (if any) encodes the survey number within a single day. Must be used in conjunction with dayno argument.

dayno

Character string or numeric value to indiciate which variable (if any) encodes the survey number within a single day. Must be used in conjunction with beepno argument.

...

Additional arguments.

Value

resample output

Details

Sampling methods can be specified via the sampMethod argument.

Bootstrapped resampling: Standard bootstrapped resampling, wherein a bootstrapped sample of size n is drawn with replacement at each iteration. Then, a variable selection procedure is applied to the sample, and the selected model is fit to obtain the parameter values. P-values and confidence intervals for the parameter distributions are then estimated.
Multi-sample splitting: Involves taking two disjoint samples from the original data -- a training sample and a test sample. At each iteration the variable selection procedure is applied to the training sample, and then the resultant model is fit on the test sample. Parameters are then aggregated based on the coefficients in the models fit to the test samples.
Stability selection: Stability selection begins the same as multi-sample splitting, in that two disjoint samples are drawn from the data at each iteration. However, the variable selection procedure is then applied to each of the two subsamples at each iteration. The objective is to compute the proportion of times that each predictor was selected in each subsample across iterations, as well as the proportion of times that it was simultaneously selected in both disjoint samples. At the end of the resampling, the final model is selected by setting a frequency threshold between 0 and 1, indicating the minimum proportion of samples that a variable would have to have been selected to be retained in the final model.

For the bootstrapping and multi-sample split methods, p-values are aggregated for each parameter using a method developed by Meinshausen, Meier, & Buhlmann (2009) that employs error control based on the false-discovery rate. The same procedure is employed for creating adjusted confidence intervals.

A key distinguishing feature of the bootstrapping procedure implemented in this function versus the bootNet function is that the latter is designed to estimate the parameter distributions of a single model, whereas the version here is aimed at using the bootstrapped resamples to select a final model. In a practical sense, this boils down to using the bootstrapping method in the resample function to perform variable selection at each iteration of the resampling, rather than taking a single constrained model and applying it equally at all iterations.

References

Meinshausen, N., Meier, L., & Buhlmann, P. (2009). P-values for high-dimensional regression. Journal of the American Statistical Association. 104, 1671-1681.

Meinshausen, N., & Buhlmann, P. (2010). Stability selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology). 72, 417-423

Examples

Run this code

# NOT RUN {
fit1 <- resample(ggmDat, m = 'M', niter = 10)

net(fit1)
netInts(fit1)

plot(fit1)
plot(fit1, what = 'coefs')
plot(fit1, what = 'bootstrap', multi = TRUE)
plot(fit1, what = 'pvals', outcome = 2, predictor = 4)

fit2 <- resample(gvarDat, m = 'M', niter = 10, lags = 1, sampMethod = 'stability')

plot(fit2, what = 'stability', outcome = 3)
# }

Run the code above in your browser using DataLab