smle_select: Elaborative feature selection with SMLE

Description

Given a response and a set of K features, this function first runs SMLE (fast=TRUE) to generate a series of sub-models with sparsity k varying from k_min to k_max. It then selects the best model from the series based on a selection criterion. When criterion EBIC is used, users can choose to repeat the selection with different values of the tuning parameter, gamma, and conduct importance voting for each feature.

Usage

smle_select(x, ...)
# S3 method for smle
smle_select(x, ...)
# S3 method for sdata
smle_select(x, k_min = 1, k_max = 10,
  sub_model = NULL, gamma_ebic = 0.5, vote = FALSE,
  tune = c("ebic", "aic", "bic"), gamma_seq = c(seq(0, 1, 0.2)),
  vote_threshold = NULL, para = FALSE, num_cores = NULL, ...)
# S3 method for default
smle_select(x, X = NULL, family = "gaussian", ...)

Arguments

Object of class "smle" or "sdata", or directly input data pair (Y,X).

...

Other parameters.

k_min

The lower bound of target model sparsity. Default is 1.

k_max

The upper bound of target model sparsity. Default is as same as the number of columns in input.

sub_model

A subset of columns indicating that which columns are able to be selected.(Only for object of "sdata" and (Y,X) pair)

gamma_ebic

Parameter for Extended Bayesian Information Criteria. Must be v between (0, 1). Default is 0.5.

vote

The logical flog for whether to perform the voting procedure. Only available when tune ='ebic'.

tune

Selection criterion, must bu one of 'aic','bic', or 'ebic'. Default is 'ebic'.

gamma_seq

The sequence of values for gamma_ebic when vote =TRUE.

vote_threshold

A relative voting threshold in percentage. A feature is considered to be important when it receives votes passing the threshold.

para

Logical flag to use parallel computing to do voting selection. Default is FALSE. see Details.

num_cores

The number of cores to use. The default will be all cores detected.

Input features matrix.

family

Response type (see SMLE); default is gaussian. When input object is smle or sdata, the same model will be used in the selection step.

Value

Returns a "selection" object with

Retained_Feature_IDs

A list of varible IDs selected.

Coeffients_of_Retained_Features

A list of coefficients for selected features fit by glmnet

Criterion_value

A list of value according to selected criteria and model sparisity.

Voting_Retained_Feature_IDs

A list of Voting selection results; item returned only when vote==T

Details

There are three types of input allowed: Object with class "smle", the output from main function SMLE; Object with class "sdata", the ouput from Gen_Data; Input data pair directly by Y, X. It is not recommender to use object of type sdata or the data matrices X,Y for high demensional data.

References

Chen. J. and Chen. Z. (2012). "Extended BIC for small-n-large-P sparse GLM." Statistica Sinica: 555-574.

Chen. J. and Chen. Z. (2008). "Extended Bayesian information criteria for model selection with large model spaces." Biometrika 95.3: 759-771.

Chen, Z. and Chen. J. (2009). "Tournament screening cum EBIC for feature selection with high-dimensional feature spaces." Science in China Series A: Mathematics 52.6 : 1327-1341.

Examples

Run this code

# NOT RUN {
# This a simple example for Gaussian assumption.
Data<-Gen_Data(correlation="MA",family = "gaussian")
fit<-SMLE(Data$Y,Data$X,k=20,family = "gaussian")
E<-smle_select(fit)
plot(E)
# }

Run the code above in your browser using DataLab