SelectV: Variable Selection for High-Dimensional Discriminant Analysis.

Description

Selects variables to be used in a Discriminant Analysis classification rule.

Usage

SelectV(data, grouping, Selmethod=c("ExpHC","HC","Fdr","Fair","fixedp"),
NullDist=c("locfdr","Theoretical"), uselocfdr=c("onlyHC","always"), 
minlocfdrp=200, comvar=TRUE, Fdralpha=0.5, ExpHCalpha=0.5, HCalpha0=0.1,
maxp=ncol(data), tol=1E-12, ...)

Arguments

data

Matrix or data frame of observations.

grouping

Factor specifying the class for each observation.

Selmethod

The method used to choose the number of variables selected. Current alternatives are:

ExpHC (default) for the Expanded Higher Criticism scheme of Duarte Silva (2011)

HC for the Higher Criticism (HC) approach of Donoho

NullDist

The Null distribution used to compute pvalues from t-scores or F-scores. Current alternatives are Theoretical for the corresponding theoretical distributions, and locfdr for an empirical Null of z-scores estimated by the

uselocfdr

Flag indicating the statistics for which the Null empirical distribution estimated by the locfdr approach should be used. Current alternatives are onlyHC (default) and always.

minlocfdrp

Minimum number of variables required to estimate empirical Null distributions by the locfdr method. When the number of variables is below minlocfdrp, theoretical Nulls are always employed.

comvar

Boolean flag indicating if a common group variance is to be assumed (default) in the computation of the t-scores used for problems with two groups.

Fdralpha

Control level for variable selection based on False Discovery Rate Control (see Benjamini and Hochberg (1995)).

ExpHCalpha

Control level for the first step of the Extended Higher Criticism scheme (see Duarte Silva (2011)).

HCalpha0

Proportion of pvalues used to compute the HC statistic (see Donoho and Jin (2004, 2008)).

maxp

Maximum number of variables to be used in the discriminant rule.

tol

Numerical precision for distinguishing pvalues from 0 and 1. Computed pvalues below tol are set to tol, and those above 1-tol are set to 1-tol.

...

Arguments passed from other methods.

Value

A list with two components:
nvkptthe number of variables to be used in the Discriminant rule
vkptIndthe indices of the variables to be used in the Discriminant rule

Details

The function SelectV selects variables to be used in a Discriminant classification rule by the Higher Criticism (HC) approach of Donoho and Jin (2004, 2008), the Expanded Higher Criticism scheme proposed by Duarte Silva (2011), False Discovery Rate (Fdr) control as suggested by Benjamini and Hochberg (1995), the FAIR (Features Annealed Independence Rules) approach of Fan and Fan (2008), or simply by fixing the number of selected variables to some pre-defined constant.

The Fdr method is, by default, based on simple p-values derived from t-scores (problems with two groups) or ANOVA F-scores (problems with more than two groups). When the argument NullDist is set to Theoretical these values are also used in the HC method. Otherwise, the HC p-values are derived from an empirical Null of z-scores estimated by the maximum likelihood approach of Efron (2004).

The variable rankings are based on absolute-value t-scores or ANOVA F-scores.

References

Benjamini, Y. and Hochberg, Y. (1995) Controling the false discovery rate: A practical and powerful approach to multiple testing, Journal of the Royal Statistical Society B, 57, 289-300.

Donoho, D. and Jin, J. (2004) Higher criticism for detecting sparse heterogeneous mixtures, Annals of Statistics 32, 962-964.

Donoho, D. and Jin, J. (2008) Higher criticism thresholding: Optimal feature selection when useful features are rare and weak, In: Proceedings National Academy of Sciences, USA 105, 14790-14795.

Efron, B. (2004) Large-scale simultaneous hypothesis testing: the choice of a null hypothesis, Journal of the American Statistical Association 99, 96-104.

Fan, J. and Fan, Y. (2008) High-dimensional classification using features annealed independence rules, Annals of Statistics, 36 (6), 2605-2637.

Pedro Duarte Silva, A. (2011) Two Group Classification with High-Dimensional Correlated Data: A Factor Model Approach, Computational Statistics and Data Analysis, 55 (1), 2975-2990.

Examples

Run this code

# Compare the number of variables selected by the four methods 
# currently available  on Alon's Colon Cancer Data set 
# (after a logarithmic transformation). 
# Use classical pvalues in the original HC approach

log10genes <- log10(AlonDS[,-1])

Res <- array(dim=4)
names(Res) <- c("ExpHC","HC","Fdr","Fair")
Res[1] <- SelectV(log10genes,AlonDS[,1])$nvkpt
Res[2] <- SelectV(log10genes,AlonDS[,1],
Selmethod="HC",NullDist="Theoretical")$nvkpt
Res[3] <- SelectV(log10genes,AlonDS[,1],Selmethod="Fdr")$nvkpt
Res[4] <- SelectV(log10genes,AlonDS[,1],Selmethod="Fair")$nvkpt

print(Res)

Run the code above in your browser using DataLab