Learn R Programming

HiDimDA (version 0.2-0)

SelectV: Variable Selection for High-Dimensional Discriminant Analysis.

Description

Selects variables to be used in a Discriminant Analysis classification rule.

Usage

SelectV(data, grouping, Selmethod=c("ExpHC","HC","Fdr","Fair","fixedp"),
NullDist=c("locfdr","Theoretical"), uselocfdr=c("onlyHC","always"), 
minlocfdrp=200, comvar=TRUE, Fdralpha=0.5, ExpHCalpha=0.5, HCalpha0=0.1,
maxp=ncol(data), tol=1E-12, ...)

Arguments

data
Matrix or data frame of observations.
grouping
Factor specifying the class for each observation.
Selmethod
The method used to choose the number of variables selected. Current alternatives are:

ExpHC (default) for the Expanded Higher Criticism scheme of Duarte Silva (2011)

HC for the Higher Criticism (HC) approach of Donoho

NullDist
The Null distribution used to compute pvalues from t-scores or F-scores. Current alternatives are Theoretical for the corresponding theoretical distributions, and locfdr for an empirical Null of z-scores estimated by the
uselocfdr
Flag indicating the statistics for which the Null empirical distribution estimated by the locfdr approach should be used. Current alternatives are onlyHC (default) and always.
minlocfdrp
Minimum number of variables required to estimate empirical Null distributions by the locfdr method. When the number of variables is below minlocfdrp, theoretical Nulls are always employed.
comvar
Boolean flag indicating if a common group variance is to be assumed (default) in the computation of the t-scores used for problems with two groups.
Fdralpha
Control level for variable selection based on False Discovery Rate Control (see Benjamini and Hochberg (1995)).
ExpHCalpha
Control level for the first step of the Extended Higher Criticism scheme (see Duarte Silva (2011)).
HCalpha0
Proportion of pvalues used to compute the HC statistic (see Donoho and Jin (2004, 2008)).
maxp
Maximum number of variables to be used in the discriminant rule.
tol
Numerical precision for distinguishing pvalues from 0 and 1. Computed pvalues below tol are set to tol, and those above 1-tol are set to 1-tol.
...
Arguments passed from other methods.

Value

  • A list with two components:
  • nvkptthe number of variables to be used in the Discriminant rule
  • vkptIndthe indices of the variables to be used in the Discriminant rule

Details

The function SelectV selects variables to be used in a Discriminant classification rule by the Higher Criticism (HC) approach of Donoho and Jin (2004, 2008), the Expanded Higher Criticism scheme proposed by Duarte Silva (2011), False Discovery Rate (Fdr) control as suggested by Benjamini and Hochberg (1995), the FAIR (Features Annealed Independence Rules) approach of Fan and Fan (2008), or simply by fixing the number of selected variables to some pre-defined constant.

The Fdr method is, by default, based on simple p-values derived from t-scores (problems with two groups) or ANOVA F-scores (problems with more than two groups). When the argument NullDist is set to Theoretical these values are also used in the HC method. Otherwise, the HC p-values are derived from an empirical Null of z-scores estimated by the maximum likelihood approach of Efron (2004).

The variable rankings are based on absolute-value t-scores or ANOVA F-scores.

References

Benjamini, Y. and Hochberg, Y. (1995) Controling the false discovery rate: A practical and powerful approach to multiple testing, Journal of the Royal Statistical Society B, 57, 289-300.

Donoho, D. and Jin, J. (2004) Higher criticism for detecting sparse heterogeneous mixtures, Annals of Statistics 32, 962-964.

Donoho, D. and Jin, J. (2008) Higher criticism thresholding: Optimal feature selection when useful features are rare and weak, In: Proceedings National Academy of Sciences, USA 105, 14790-14795.

Efron, B. (2004) Large-scale simultaneous hypothesis testing: the choice of a null hypothesis, Journal of the American Statistical Association 99, 96-104.

Fan, J. and Fan, Y. (2008) High-dimensional classification using features annealed independence rules, Annals of Statistics, 36 (6), 2605-2637.

Pedro Duarte Silva, A. (2011) Two Group Classification with High-Dimensional Correlated Data: A Factor Model Approach, Computational Statistics and Data Analysis, 55 (1), 2975-2990.

See Also

Dlda, Mlda, Slda, RFlda, AlonDS

Examples

Run this code
# Compare the number of variables selected by the four methods 
# currently available  on Alon's Colon Cancer Data set 
# (after a logarithmic transformation). 
# Use classical pvalues in the original HC approach

log10genes <- log10(AlonDS[,-1])

Res <- array(dim=4)
names(Res) <- c("ExpHC","HC","Fdr","Fair")
Res[1] <- SelectV(log10genes,AlonDS[,1])$nvkpt
Res[2] <- SelectV(log10genes,AlonDS[,1],
Selmethod="HC",NullDist="Theoretical")$nvkpt
Res[3] <- SelectV(log10genes,AlonDS[,1],Selmethod="Fdr")$nvkpt
Res[4] <- SelectV(log10genes,AlonDS[,1],Selmethod="Fair")$nvkpt

print(Res)

Run the code above in your browser using DataLab