Multiple resampling procedures for selecting variables for a final network model. There are three resampling methods that can be parameterized in a variety of different ways. The ultimate goal is to fit models across iterated resamples with variable selection procedures built in so as to home in on the best predictors to include within a given model. The methods available include: bootstrapped resampling, multi-sample splitting, and stability selection.
resample(
data,
m = NULL,
niter = 10,
sampMethod = "bootstrap",
criterion = "AIC",
method = "glmnet",
rule = "OR",
gamma = 0.5,
nfolds = 10,
nlam = 50,
which.lam = "min",
threshold = FALSE,
bonf = FALSE,
alpha = 0.05,
exogenous = TRUE,
split = 0.5,
center = TRUE,
scale = FALSE,
varSeed = NULL,
seed = NULL,
verbose = TRUE,
lags = NULL,
binary = NULL,
type = "g",
saveMods = TRUE,
saveData = FALSE,
saveVars = FALSE,
fitit = TRUE,
nCores = 1,
cluster = "mclapply",
block = FALSE,
beepno = NULL,
dayno = NULL,
...
)
n x k
dataframe. Cannot supply a matrix as input.
Character vector or numeric vector indicating the moderator(s), if
any. Can also specify "all"
to make every variable serve as a
moderator, or 0
to indicate that there are no moderators. If the
length of m
is k - 1
or longer, then it will not be possible
to have the moderators as exogenous variables. Thus, exogenous
will
automatically become FALSE
.
Number of iterations for the resampling procedure.
Character string indicating which type of procedure to use.
"bootstrap"
is a standard bootstrapping procedure. "split"
is
the multi-sample split procedure where the data are split into disjoint
training and test sets, the variables to be modeled are selected based on
the training set, and then the final model is fit to the test set.
"stability"
is stability selection, where models are fit to each of
two disjoint subsamples of the data, and it is calculated how frequently
each variable is selected in each subset, as well how frequently they are
simultaneously selected in both subsets at each iteration.
The criterion for the variable selection procedure. Options
include: "cv", "aic", "bic", "ebic", "cp", "rss", "adjr2", "rsq",
"r2"
. "CV"
refers to cross-validation, the information criteria are
"AIC", "BIC", "EBIC"
, and "Cp"
, which refers to Mallow's Cp.
"RSS"
is the residual sum of squares, "adjR2"
is adjusted
R-squared, and "Rsq"
or "R2"
is R-squared. Capitalization is
ignored. For methods based on the LASSO, only "CV", "AIC", "BIC",
"EBIC"
are available. For methods based on subset selection, only
"Cp", "BIC", "RSS", "adjR2", "R2"
are available.
Character string to indicate which method to use for variable
selection. Options include "lasso"
and "glmnet"
, both of
which use the LASSO via the glmnet
package (either with
glmnet::glmnet
or
glmnet::cv.glmnet
, depending upon the
criterion). "subset", "backward", "forward", "seqrep"
, all call
different types of subset selection using the
leaps::regsubsets
function. Finally
"glinternet"
is used for applying the hierarchical lasso, and is the
only method available for moderated network estimation (either with
glinternet::glinternet
or
glinternet::glinternet.cv
,
depending upon the criterion). If one or more moderators are specified,
then method
will automatically default to "glinternet"
.
Only applies to GGMs (including between-subjects networks) when a
threshold is supplied. The "AND"
rule will only preserve edges when
both corresponding coefficients have p-values below the threshold, while
the "OR"
rule will preserve an edge so long as one of the two
coefficients have a p-value below the supplied threshold.
Numeric value of the hyperparameter for the "EBIC"
criterion. Only relevant if criterion = "EBIC"
. Recommended to use a
value between 0 and .5, where larger values impose a larger penalty on the
criterion.
Only relevant if criterion = "CV"
. Determines the number
of folds to use in cross-validation.
if method = "glinternet"
, determines the number of lambda
values to evaluate in the selection path.
Character string. Only applies if criterion = "CV"
.
Options include "min"
, which uses the lambda value that minimizes
the objective function, or "1se"
which uses the lambda value at 1
standard error above the value that minimizes the objective function.
Logical or numeric. If TRUE
, then a default value of
.05 will be set. Indicates whether a threshold should be placed on the
models at each iteration of the sampling. A significant choice by the
researcher.
Logical. Determines whether to apply a bonferroni adjustment on the distribution of p-values for each coefficient.
Type 1 error rate. Defaults to .05.
Logical. Indicates whether moderator variables should be
treated as exogenous or not. If they are exogenous, they will not be
modeled as outcomes/nodes in the network. If the number of moderators
reaches k - 1
or k
, then exogenous
will automatically
be FALSE
.
If sampMethod == "split"
or sampMethod =
"stability"
then this is a value between 0 and 1 that indicates the
proportion of the sample to be used for the training set. When
sampMethod = "stability"
there isn't an important distinction
between the labels "training" and "test", although this value will still
cause the two samples to be taken of complementary size.
Logical. Determines whether to mean-center the variables.
Logical. Determines whether to standardize the variables.
Numeric value providing a seed to be set at the beginning of the selection procedure. Recommended for reproducible results. Importantly, this seed will be used for the variable selection models at each iteration of the resampler. Caution this means that while each model is run with a different sample, it will always have the same seed.
Can be a single value, to set a seed before drawing random seeds
of length niter
to be used across iterations. Alternatively, one can
supply a vector of seeds of length niter
. It is recommended to use
this argument for reproducibility over the varSeed
argument.
Logical. Determines whether information about the modeling progress should be displayed in the console.
Numeric or logical. Can only be 0, 1 or TRUE
or
FALSE
. NULL
is interpreted as FALSE
. Indicates whether
to fit a time-lagged network or a GGM.
Numeric vector indicating which columns of the data contain binary variables.
Determines whether to use gaussian models "g"
or binomial
models "c"
. Can also just use "gaussian"
or
"binomial"
. Moreover, a vector of length k
can be provided
such that a value is given to every variable. Ultimately this is not
necessary, though, as such values are automatically detected.
Logical. Indicates whether to save the models fit to the samples at each iteration or not.
Logical. Determines whether to save the data from each subsample across iterations or not.
Logical. Determines whether to save the variable selection models at each iteration.
Logical. Determines whether to fit the final selected model on
the original sample. If FALSE
, then this can still be done with
fitNetwork
and modSelect
.
Numeric value indicating the number of CPU cores to use for the
resampling. If TRUE
, then the
parallel::detectCores
function will be
used to maximize the number of cores available.
Character vector indicating which type of parallelization to
use, if nCores > 1
. Options include "mclapply"
and
"SOCK"
.
Logical or numeric. If specified, then this indicates that
lags != 0
or lags != NULL
. If numeric, then this indicates
that block bootstrapping will be used, and the value specifies the block
size. If TRUE
then an appropriate block size will be estimated
automatically.
Character string or numeric value to indicate which variable
(if any) encodes the survey number within a single day. Must be used in
conjunction with dayno
argument.
Character string or numeric value to indiciate which variable
(if any) encodes the survey number within a single day. Must be used in
conjunction with beepno
argument.
Additional arguments.
resample
output
Sampling methods can be specified via the sampMethod
argument.
Standard bootstrapped resampling,
wherein a bootstrapped sample of size n
is drawn with replacement at
each iteration. Then, a variable selection procedure is applied to the
sample, and the selected model is fit to obtain the parameter values.
P-values and confidence intervals for the parameter distributions are then
estimated.
Involves taking two disjoint samples from the original data -- a training sample and a test sample. At each iteration the variable selection procedure is applied to the training sample, and then the resultant model is fit on the test sample. Parameters are then aggregated based on the coefficients in the models fit to the test samples.
Stability selection begins the same as multi-sample splitting, in that two disjoint samples are drawn from the data at each iteration. However, the variable selection procedure is then applied to each of the two subsamples at each iteration. The objective is to compute the proportion of times that each predictor was selected in each subsample across iterations, as well as the proportion of times that it was simultaneously selected in both disjoint samples. At the end of the resampling, the final model is selected by setting a frequency threshold between 0 and 1, indicating the minimum proportion of samples that a variable would have to have been selected to be retained in the final model.
For the bootstrapping and multi-sample split methods, p-values are aggregated for each parameter using a method developed by Meinshausen, Meier, & Buhlmann (2009) that employs error control based on the false-discovery rate. The same procedure is employed for creating adjusted confidence intervals.
A key distinguishing feature of the bootstrapping procedure implemented in
this function versus the bootNet
function is that the latter is
designed to estimate the parameter distributions of a single model, whereas
the version here is aimed at using the bootstrapped resamples to select a
final model. In a practical sense, this boils down to using the bootstrapping
method in the resample
function to perform variable selection
at each iteration of the resampling, rather than taking a single constrained
model and applying it equally at all iterations.
Meinshausen, N., Meier, L., & Buhlmann, P. (2009). P-values for high-dimensional regression. Journal of the American Statistical Association. 104, 1671-1681.
Meinshausen, N., & Buhlmann, P. (2010). Stability selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology). 72, 417-423
plot.resample, modSelect, fitNetwork,
bootNet, mlGVAR, plotNet, plotCoefs,
plotBoot, plotPvals, plotStability, net,
netInts, glinternet::glinternet,
glinternet::glinternet.cv,
glmnet::glmnet,
glmnet::cv.glmnet,
leaps::regsubsets
# NOT RUN {
fit1 <- resample(ggmDat, m = 'M', niter = 10)
net(fit1)
netInts(fit1)
plot(fit1)
plot(fit1, what = 'coefs')
plot(fit1, what = 'bootstrap', multi = TRUE)
plot(fit1, what = 'pvals', outcome = 2, predictor = 4)
fit2 <- resample(gvarDat, m = 'M', niter = 10, lags = 1, sampMethod = 'stability')
plot(fit2, what = 'stability', outcome = 3)
# }
Run the code above in your browser using DataLab