edrSelec: Variable selection based on sliced inverse regression

Description

Gathers several procedures to determine which explanatory variables have an effect on a dependent variable. Works whether there are more explanatory variables than observations or not. Creates an object of class edrSelec.

Usage

edrSelec(Y, X, H, K, method, pZero=NULL, NZero=NULL, zeta=NULL,
 rho=NULL, baseEst=NULL, btspSamp=NULL, lassoParam=NULL)

Arguments

A numeric vector representing the dependent variable (a response vector).

A matrix representing the quantitative explanatory variables (bind by column).

When method="SR-SIR" or method="RSIR", the chosen number of slices. When method="CSS", a vector with various numbers of slices.

The chosen dimension K.

method

This character string specifies the selection method. It should be either "CSS", "RSIR" or "SR-SIR".

pZero

When method="CSS", the number of variables to pick when creating a submodel.

NZero

When method="CSS", the number of submodels to create.

zeta

When method="CSS", the proportion of 'best' submodels selected from the NZero submodels.

rho

When method="CSS", and if zeta is not provided, the threshold above which a submodel is considered as 'best'. It must be a real in ]0,1[.

baseEst

An initial estimate of the EDR space on which each method relies.

btspSamp

When method="RSIR", the bootstrap sample size for estimating the asymptotic distribution of the estimated EDR directions.

lassoParam

When method="SR-SIR", a vector of lasso parameters from which the optimal one is chosen, using the RIC criterion.

Value

edrSelec returns an object of class edrSelec, with some of the following attributes, depending on the value of method:

scoreVar

A numeric vector filled with a score for each explanatory variable. Variables that have a high score should be kept. For the "CSS" method, the score is the presence of the variable in the 'best' submodels. For "RSIR", it is one minus the p-value of the test. For the "SR-SIR" procedure, it is a boolean that indicates if the variable should be kept when using the optimal lasso parameter.

The chosen dimension.

The chosen number(s) of slices.

The sample size.

method

The variable selection method used.

The matrix of the quantitative explanatory variables (bind by column).

The numeric vector of the dependent variable (a response vector).

matModels

A NZero x pZero matrix that contains the variables of each created submodel, for the "CSS" method.

matModTop

A matrix with pZero columns made of the variables of each 'best' submodel, for the "CSS" method.

vectSqCor

A vector containing the squared correlation between indices for each submodel, for the "CSS" method.

aic

A vector made of values of the Aka<U+00EF>ke information criterion for every lasso parameter considered by the "SR-SIR" procedure.

bic

A vector made of values of the Bayesian information criterion for every lasso parameter considered by the "SR-SIR" procedure.

ric

A vector made of values of the residual information criterion for every lasso parameter considered by the "SR-SIR" procedure.

matEDR

A list which gives, for each lasso parameter studied with the "SR-SIR" procedure, a matrix spanning the estimated EDR space.

Details

The "CSS" method builds NZero submodels using only pZero explanatory variables. It estimates the indices for each of them. The squared correlation between these indices and those found with the whole set of explanatory variables is computed. Only the submodels with the highest squared correlation are kept. The method then counts how many times each explanatory variable appears in these 'best' submodels. The "RSIR" procedure uses an asymptotic test on each element of the estimated EDR directions. It was translated from a Matlab code made by Peng Zeng. The "SR-SIR" procedure relies on a lasso penalty. The underlying parameter is chosen using the residual information criterion (RIC). It was written using a R code made by Lexin Li.

References

Coudret, R., Liquet, B. and Saracco, J. Comparison of sliced inverse regression approaches for underdetermined cases. Journal de la Soci<U+00E9>t<U+00E9> Fran<U+00E7>aise de Statistique, in press.

Li, L. and Yin, X. (2008). Sliced inverse regression with regularizations. Biometrics, 64(1):124-131.

Zhong, W., Zeng, P., Ma, P., Liu, J. S., and Zhu, Y. (2005). RSIR: regularized sliced inverse regression for motif discovery. Bioinformatics, 21(22):4169-4175.

Examples

Run this code

# NOT RUN {
	
# }
# NOT RUN {
n <- 100
p <- 110
K <- 1
H <- 5:12
NZero <- 1000
pZero <- 10
zeta <- 0.1
beta <- c(1,1,1,1,rep(0,p-4))
U <- matrix(runif(p^2,-0.05,0.05),ncol=p) 
X <- rmvnorm(n,sigma=diag(p) + U %*% t(U))
eps <- rnorm(n,sd=10)
Y <- (X%*%beta)^3+eps
result <- edrSelec(Y,X,H,K,"CSS",NZero=NZero, pZero=pZero, zeta=zeta)
summary(result)
plot(result)
# }
# NOT RUN {
# }

Run the code above in your browser using DataLab