Performs stability selection for dimensionality reduction. The underlying variable selection algorithm (e.g. sparse PLS) is run with different combinations of parameters controlling the sparsity (e.g. number of selected variables per component) and thresholds in selection proportions. These hyper-parameters are jointly calibrated by maximisation of the stability score.
BiSelection(
xdata,
ydata = NULL,
group_x = NULL,
group_y = NULL,
LambdaX = NULL,
LambdaY = NULL,
AlphaX = NULL,
AlphaY = NULL,
ncomp = 1,
scale = TRUE,
pi_list = seq(0.01, 0.99, by = 0.01),
K = 100,
tau = 0.5,
seed = 1,
n_cat = NULL,
family = "gaussian",
implementation = SparsePLS,
resampling = "subsampling",
cpss = FALSE,
PFER_method = "MB",
PFER_thr = Inf,
FDP_thr = Inf,
n_cores = 1,
output_data = FALSE,
verbose = TRUE,
beep = NULL,
...
)
An object of class bi_selection
. A list with:
a
matrix of the best stability scores and corresponding parameters
controlling the level of sparsity in the underlying algorithm for different
numbers of components. Possible columns include: comp
(component
index), nx
(number of predictors to include, parameter of the
underlying algorithm), alphax
(sparsity within the predictor groups,
parameter of the underlying algorithm), pix
(threshold in selection
proportion for predictors), ny
(number of outcomes to include,
parameter of the underlying algorithm), alphay
(sparsity within the
outcome groups, parameter of the underlying algorithm), piy
(threshold in selection proportion for outcomes), S
(stability
score). Columns that are not relevant to the model are not reported (e.g.
alpha_x
and alpha_y
are not returned for sparse PLS models).
a matrix of the best stability scores for different combinations of parameters controlling the sparsity and components.
a binary matrix encoding stably selected predictors.
a matrix of calibrated selection proportions for predictors.
a binary matrix encoding stably selected outcomes. Only returned for PLS models.
a matrix of calibrated selection proportions for outcomes. Only returned for PLS models.
a binary matrix encoding stable relationships between predictor and outcome variables. Only returned for PLS models.
a binary matrix encoding stably selected predictors.
a matrix of selection proportions for predictors.
a binary matrix encoding stably selected outcomes. Only returned for PLS models.
a matrix of selection proportions for outcomes. Only returned for PLS models.
an
array of estimated loadings coefficients for the different components
(rows), for the predictors (columns), as obtained across the K
visited models (along the third dimension).
an array of
estimated loadings coefficients for the different components (rows), for
the outcomes (columns), as obtained across the K
visited models
(along the third dimension). Only returned for PLS models.
a
list with type="bi_selection"
and values used for arguments
implementation
, family
, scale
, resampling
,
cpss
and PFER_method
.
a list with values used
for arguments K
, group_x
, group_y
, LambdaX
,
LambdaY
, AlphaX
, AlphaY
, pi_list
, tau
,
n_cat
, pk
, n
(number of observations),
PFER_thr
, FDP_thr
and seed
. The datasets xdata
and ydata
are also included if output_data=TRUE
.
The rows of
summary
and columns of selectedX
, selectedY
,
selpropX
, selpropY
, selected
, coefX
and
coefY
are ordered in the same way and correspond to components and
parameter values stored in summary
. The rows of summary_full
and columns of selectedX_full
, selectedY_full
,
selpropX_full
and selpropY_full
are ordered in the same way
and correspond to components and parameter values stored in
summary_full
.
matrix of predictors with observations as rows and variables as columns.
optional vector or matrix of outcome(s). If family
is set
to "binomial"
or "multinomial"
, ydata
can be a vector
with character/numeric values or a factor.
vector encoding the grouping structure among predictors. This
argument indicates the number of variables in each group. Only used for
models with group penalisation (e.g. implementation=GroupPLS
or
implementation=SparseGroupPLS
).
optional vector encoding the grouping structure among
outcomes. This argument indicates the number of variables in each group.
Only used if implementation=GroupPLS
or
implementation=SparseGroupPLS
.
matrix of parameters controlling the number of selected variables (for sparse PCA/PLS) or groups (for group and sparse group PLS) in X.
matrix of parameters controlling the number of selected
variables (for sparse PLS) or groups (for group or sparse group PLS) in Y.
Only used if family="gaussian"
.
matrix of parameters controlling the level of sparsity within
groups in X. Only used if implementation=SparseGroupPLS
.
matrix of parameters controlling the level of sparsity within
groups in X. Only used if implementation=SparseGroupPLS
and
family="gaussian"
.
number of components.
logical indicating if the data should be scaled (i.e. transformed so that all variables have a standard deviation of one).
vector of thresholds in selection proportions. If
n_cat=NULL
or n_cat=2
, these values must be >0
and
<1
. If n_cat=3
, these values must be >0.5
and
<1
.
number of resampling iterations.
subsample size. Only used if resampling="subsampling"
and
cpss=FALSE
.
value of the seed to initialise the random number generator and
ensure reproducibility of the results (see set.seed
).
computation options for the stability score. Default is
NULL
to use the score based on a z test. Other possible values are 2
or 3 to use the score based on the negative log-likelihood.
type of PLS model. This parameter must be set to
family="gaussian"
for continuous outcomes, or to
family="binomial"
for categorical outcomes. Only used if
ydata
is provided.
function to use for feature selection. Possible
functions are: SparsePCA
, SparsePLS
, GroupPLS
,
SparseGroupPLS
.
resampling approach. Possible values are:
"subsampling"
for sampling without replacement of a proportion
tau
of the observations, or "bootstrap"
for sampling with
replacement generating a resampled dataset with as many observations as in
the full sample. Alternatively, this argument can be a function to use for
resampling. This function must use arguments named data
and
tau
and return the IDs of observations to be included in the
resampled dataset.
logical indicating if complementary pair stability selection
should be done. For this, the algorithm is applied on two non-overlapping
subsets of half of the observations. A feature is considered as selected if
it is selected for both subsamples. With this method, the data is split
K/2
times (K
models are fitted). Only used if
PFER_method="MB"
.
method used to compute the upper-bound of the expected
number of False Positives (or Per Family Error Rate, PFER). If
PFER_method="MB"
, the method proposed by Meinshausen and Bühlmann
(2010) is used. If PFER_method="SS"
, the method proposed by Shah and
Samworth (2013) under the assumption of unimodality is used.
threshold in PFER for constrained calibration by error
control. If PFER_thr=Inf
and FDP_thr=Inf
, unconstrained
calibration is used (the default).
threshold in the expected proportion of falsely selected
features (or False Discovery Proportion) for constrained calibration by
error control. If PFER_thr=Inf
and FDP_thr=Inf
, unconstrained
calibration is used (the default).
number of cores to use for parallel computing (see argument
workers
in multisession
). Using
n_cores>1
is only supported with optimisation="grid_search"
.
logical indicating if the input datasets xdata
and
ydata
should be included in the output.
logical indicating if a loading bar and messages should be printed.
sound indicating the end of the run. Possible values are:
NULL
(no sound) or an integer between 1 and 11 (see argument
sound
in beep
).
additional parameters passed to the functions provided in
implementation
or resampling
.
In stability selection, a feature selection algorithm is fitted on
K
subsamples (or bootstrap samples) of the data with different
parameters controlling the sparsity (LambdaX
, LambdaY
,
AlphaX
, and/or AlphaY
). For a given (set of) sparsity
parameter(s), the proportion out of the K
models in which each
feature is selected is calculated. Features with selection proportions
above a threshold pi are considered stably selected. The stability
selection model is controlled by the sparsity parameter(s) (denoted by
\(\lambda\)) for the underlying algorithm, and the threshold in selection
proportion:
\(V_{\lambda, \pi} = \{ j: p_{\lambda}(j) \ge \pi \} \)
For sparse and sparse group dimensionality reduction, "feature" refers to
variable (variable selection model). For group PLS, "feature" refers to
group (group selection model). For (sparse) group PLS, groups need to be
defined a priori and specified in arguments group_x
and/or
group_y
.
These parameters can be calibrated by maximisation of a stability score
(see ConsensusScore
if n_cat=NULL
or
StabilityScore
otherwise) calculated under the null
hypothesis of equiprobability of selection.
It is strongly recommended to examine the calibration plot carefully to
check that the grids of parameters Lambda
and pi_list
do not
restrict the calibration to a region that would not include the global
maximum (see CalibrationPlot
). In particular, the grid
Lambda
may need to be extended when the maximum stability is
observed on the left or right edges of the calibration heatmap. In some
instances, multiple peaks of stability score can be observed. Simulation
studies suggest that the peak corresponding to the largest number of
selected features tend to give better selection performances. This is not
necessarily the highest peak (which is automatically retained by the
functions in this package). The user can decide to manually choose another
peak.
To control the expected number of False Positives (Per Family Error Rate)
in the results, a threshold PFER_thr
can be specified. The
optimisation problem is then constrained to sets of parameters that
generate models with an upper-bound in PFER below PFER_thr
(see
Meinshausen and Bühlmann (2010) and Shah and Samworth (2013)).
Possible resampling procedures include defining (i) K
subsamples of
a proportion tau
of the observations, (ii) K
bootstrap samples
with the full sample size (obtained with replacement), and (iii) K/2
splits of the data in half for complementary pair stability selection (see
arguments resampling
and cpss
). In complementary pair
stability selection, a feature is considered selected at a given resampling
iteration if it is selected in the two complementary subsamples.
For categorical outcomes (argument family
is "binomial"
or
"multinomial"
), the proportions of observations from each category
in all subsamples or bootstrap samples are the same as in the full sample.
To ensure reproducibility of the results, the starting number of the random
number generator is set to seed
.
For parallelisation, stability selection with different sets of parameters
can be run on n_cores
cores. Using n_cores > 1
creates a
multisession
.
ourstabilityselectionsharp
stabilityselectionSSsharp
stabilityselectionMBsharp
sparsegroupPLSsharp
sparsePLSsharp
sparsePCASVDsharp
sparsePCAsharp
SparsePCA
, SparsePLS
,
GroupPLS
, SparseGroupPLS
,
VariableSelection
, Resample
,
StabilityScore
Other stability functions:
Clustering()
,
GraphicalModel()
,
StructuralModel()
,
VariableSelection()
# \donttest{
if (requireNamespace("sgPLS", quietly = TRUE)) {
oldpar <- par(no.readonly = TRUE)
par(mar = c(12, 5, 1, 1))
## Sparse Principal Component Analysis
# Data simulation
set.seed(1)
simul <- SimulateComponents(pk = c(5, 3, 4))
# sPCA: sparsity on X (unsupervised)
stab <- BiSelection(
xdata = simul$data,
ncomp = 2,
LambdaX = seq_len(ncol(simul$data) - 1),
implementation = SparsePCA
)
print(stab)
# Calibration plot
CalibrationPlot(stab)
# Visualisation of the results
summary(stab)
plot(stab)
SelectedVariables(stab)
## Sparse (Group) Partial Least Squares
# Data simulation (continuous outcomes)
set.seed(1)
simul <- SimulateRegression(n = 100, pk = 15, q = 3, family = "gaussian")
x <- simul$xdata
y <- simul$ydata
# sPLS: sparsity on X
stab <- BiSelection(
xdata = x, ydata = y,
family = "gaussian", ncomp = 3,
LambdaX = seq_len(ncol(x) - 1),
implementation = SparsePLS
)
CalibrationPlot(stab)
summary(stab)
plot(stab)
# sPLS: sparsity on both X and Y
stab <- BiSelection(
xdata = x, ydata = y,
family = "gaussian", ncomp = 3,
LambdaX = seq_len(ncol(x) - 1),
LambdaY = seq_len(ncol(y) - 1),
implementation = SparsePLS,
n_cat = 2
)
CalibrationPlot(stab)
summary(stab)
plot(stab)
# sgPLS: sparsity on X
stab <- BiSelection(
xdata = x, ydata = y, K = 10,
group_x = c(2, 8, 5),
family = "gaussian", ncomp = 3,
LambdaX = seq_len(2), AlphaX = seq(0.1, 0.9, by = 0.1),
implementation = SparseGroupPLS
)
CalibrationPlot(stab)
summary(stab)
par(oldpar)
}
# }
Run the code above in your browser using DataLab