smle_select: Elaborative feature selection with SMLE

Description

Given a response and a set of K features, this function first runs SMLE (fast=TRUE) to generate a series of sub-models with sparsity k varying from k_min to k_max. It then selects the best model from the series based on a selection criterion. When criterion EBIC is used, users can choose to repeat the selection with different values of the tuning parameter, \(\gamma\), and conduct importance voting for each feature.

Usage

smle_select(x, ...)
# S3 method for smle
smle_select(x, ...)
# S3 method for sdata
smle_select(
  x,
  k_min = 1,
  k_max = 10,
  sub_model = NULL,
  gamma_ebic = 0.5,
  vote = FALSE,
  tune = c("ebic", "aic", "bic"),
  codingtype = NULL,
  gamma_seq = c(seq(0, 1, 0.2)),
  vote_threshold = NULL,
  para = FALSE,
  num_cores = NULL,
  ...
)
# S3 method for default
smle_select(x, X = NULL, family = "gaussian", ...)

Arguments

Object of class 'smle' or 'sdata'. Users can also input a response vector and a feature matrix. See examples

...

Other parameters.

k_min

The lower bound of candidate model sparsity. Default is 1.

k_max

The upper bound of candidate model sparsity. Default is as same as the number of columns in input.

sub_model

A index vector indicating which features (columns of the feature matrix) are to be selected. Not applicable if a 'smle' object is the input.

gamma_ebic

The EBIC parameter in \([0 , 1]\). Default is 0.5.

vote

The logical flag for whether to perform the voting procedure. Only available when tune ='ebic'.

tune

Selection criterion. Default is ebic.

codingtype

Coding types for categorical features; details see SMLE.

gamma_seq

The sequence of values for gamma_ebic when vote =TRUE.

vote_threshold

A relative voting threshold in percentage. A feature is considered to be important when it receives votes passing the threshold.

para

Logical flag to use parallel computing to do voting selection. Default is FALSE. See Details.

num_cores

The number of cores to use. The default will be all cores detected.

Input features matrix. When feature matrix input by users.

family

Model assumption; see SMLE. Default is Gaussian linear.

When input is 'smle' or 'sdata', the same model will be used in the selection.

Value

Returns a 'selection' object with

ID_Selected

A list of selected features.

Coef_Selected

Fitted model coefficients based on the selected features.

Criterion_value

Values of selection criterion for the candidate models with various sparsity.

ID_Voted

A list of Voting selection results; item returned only when vote==T.

Details

This functions accepts three types of input for GLMdata; 1. 'smle' object, as the output from SMLE; 2. 'sdata' object, as the output from Gen_Data; 3. Other response and feature matrix input by users.

Note that this function is mainly design to conduct an elaborative selection after feature screening. We do not recommend using it directly for ultra-high-dimensional data without screening.

References

Chen. J. and Chen. Z. (2012). "Extended BIC for small-n-large-P sparse GLM." Statistica Sinica: 555-574.

Examples

Run this code

# NOT RUN {
# This a simple example for Gaussian assumption.
Data<-Gen_Data(correlation="MA",family = "gaussian")
fit<-SMLE(Data$Y,Data$X,k=20,family = "gaussian")
E<-smle_select(fit)
plot(E)
# }

Run the code above in your browser using DataLab