Runs stability selection regression models with different combinations of parameters controlling the sparsity of the underlying selection algorithm (e.g. penalty parameter for regularised models) and thresholds in selection proportions. These two parameters are jointly calibrated by maximising the stability score of the model (possibly under a constraint on the expected number of falsely stably selected features). This function uses a serial implementation and requires the grid of parameters controlling the underlying algorithm as input (for internal use only).
SerialRegression(
xdata,
ydata = NULL,
Lambda,
pi_list = seq(0.6, 0.9, by = 0.01),
K = 100,
tau = 0.5,
seed = 1,
n_cat = 3,
family = "gaussian",
implementation = PenalisedRegression,
resampling = "subsampling",
cpss = FALSE,
PFER_method = "MB",
PFER_thr = Inf,
FDP_thr = Inf,
group_x = NULL,
group_penalisation = FALSE,
output_data = FALSE,
verbose = TRUE,
...
)
A list with:
a matrix of the best stability scores for different parameters controlling the level of sparsity in the underlying algorithm.
a matrix of parameters controlling the level of sparsity in the underlying algorithm.
a matrix of the average number of selected features by the underlying algorithm with different parameters controlling the level of sparsity.
a matrix of the calibrated number of stably selected features with different parameters controlling the level of sparsity.
a matrix of calibrated thresholds in selection proportions for different parameters controlling the level of sparsity in the underlying algorithm.
a matrix of upper-bounds in PFER of calibrated stability selection models with different parameters controlling the level of sparsity.
a matrix of upper-bounds in FDP of calibrated stability selection models with different parameters controlling the level of sparsity.
a matrix of stability scores obtained with different combinations of parameters. Columns correspond to different thresholds in selection proportions.
a matrix of upper-bounds in FDP obtained with different combinations of parameters. Columns correspond to different thresholds in selection proportions.
a matrix of upper-bounds in PFER obtained with different combinations of parameters. Columns correspond to different thresholds in selection proportions.
a matrix of selection proportions. Columns correspond to
predictors from xdata
.
an array of model coefficients.
Columns correspond to predictors from xdata
. Indices along the third
dimension correspond to different resampling iterations. With multivariate
outcomes, indices along the fourth dimension correspond to outcome-specific
coefficients.
a list with type="variable_selection"
and values used for arguments implementation
, family
,
resampling
, cpss
and PFER_method
.
a
list with values used for arguments K
, pi_list
, tau
,
n_cat
, pk
, n
(number of observations),
PFER_thr
, FDP_thr
and seed
. The datasets xdata
and ydata
are also included if output_data=TRUE
.
For all
matrices and arrays returned, the rows are ordered in the same way and
correspond to parameter values stored in Lambda
.
matrix of predictors with observations as rows and variables as columns.
optional vector or matrix of outcome(s). If family
is set
to "binomial"
or "multinomial"
, ydata
can be a vector
with character/numeric values or a factor.
matrix of parameters controlling the level of sparsity in the
underlying feature selection algorithm specified in implementation
.
With implementation="glmnet"
, Lambda
contains penalty
parameters.
vector of thresholds in selection proportions. If
n_cat=NULL
or n_cat=2
, these values must be >0
and
<1
. If n_cat=3
, these values must be >0.5
and
<1
.
number of resampling iterations.
subsample size. Only used if resampling="subsampling"
and
cpss=FALSE
.
value of the seed to initialise the random number generator and
ensure reproducibility of the results (see set.seed
).
computation options for the stability score. Default is
NULL
to use the score based on a z test. Other possible values are 2
or 3 to use the score based on the negative log-likelihood.
type of regression model. This argument is defined as in
glmnet
. Possible values include "gaussian"
(linear regression), "binomial"
(logistic regression),
"multinomial"
(multinomial regression), and "cox"
(survival
analysis).
function to use for variable selection. Possible
functions are: PenalisedRegression
, SparsePLS
,
GroupPLS
and SparseGroupPLS
. Alternatively, a user-defined
function can be provided.
resampling approach. Possible values are:
"subsampling"
for sampling without replacement of a proportion
tau
of the observations, or "bootstrap"
for sampling with
replacement generating a resampled dataset with as many observations as in
the full sample. Alternatively, this argument can be a function to use for
resampling. This function must use arguments named data
and
tau
and return the IDs of observations to be included in the
resampled dataset.
logical indicating if complementary pair stability selection
should be done. For this, the algorithm is applied on two non-overlapping
subsets of half of the observations. A feature is considered as selected if
it is selected for both subsamples. With this method, the data is split
K/2
times (K
models are fitted). Only used if
PFER_method="MB"
.
method used to compute the upper-bound of the expected
number of False Positives (or Per Family Error Rate, PFER). If
PFER_method="MB"
, the method proposed by Meinshausen and Bühlmann
(2010) is used. If PFER_method="SS"
, the method proposed by Shah and
Samworth (2013) under the assumption of unimodality is used.
threshold in PFER for constrained calibration by error
control. If PFER_thr=Inf
and FDP_thr=Inf
, unconstrained
calibration is used (the default).
threshold in the expected proportion of falsely selected
features (or False Discovery Proportion) for constrained calibration by
error control. If PFER_thr=Inf
and FDP_thr=Inf
, unconstrained
calibration is used (the default).
vector encoding the grouping structure among predictors. This
argument indicates the number of variables in each group. Only used for
models with group penalisation (e.g. implementation=GroupPLS
or
implementation=SparseGroupPLS
).
logical indicating if a group penalisation should
be considered in the stability score. The use of
group_penalisation=TRUE
strictly applies to group (not sparse-group)
penalisation.
logical indicating if the input datasets xdata
and
ydata
should be included in the output.
logical indicating if a loading bar and messages should be printed.
additional parameters passed to the functions provided in
implementation
or resampling
.