selectFeatures(elist = NULL, n1 = NULL, n2 = NULL, label1 = "A", label2 = "B", log=NULL, cutoff = 10, selection.method = "rf.rfe", preselection.method = "mMs", subruns = 100, k = 10, subsamples = 10, bootstraps = 10, candidate.number = 300, above=1500, between=400, panel.selection.criterion="accuracy", importance.measure="MDA", ntree = 500, mtry = NULL, plot = FALSE, output.path = NULL, verbose = FALSE, method = "frequency")
EListRaw
or EList
object containing all microarray
data (mandatory). "rf.rfe"
(default), "svm.rfe"
or "rj.rfe"
. Has no
effect when method="ensemble"
. "mMs"
(default), "tTest"
, "mrmr"
or
"none"
. Has no effect when method="ensemble"
. method="ensemble"
. method="frequency"
. method="frequency"
only. "300"
. Has no effect when
method="ensemble"
. "1500"
. There
will be no effect when method="ensemble"
. "400"
.
There will be no effect when method="ensemble"
. "accuracy"
(default), "sensitivity"
or
"specificity"
. No effect for method="ensemble"
. "MDA"
(default) or "MDG"
. Has no effect when
method="ensemble"
. "500"
). There
will be no effect when method="ensemble"
. sqrt(p)
where p
is the number of predictors). Has no effect when method="ensemble"
. method
is "frequency"
, the results list contains the following
elements:
method
is "ensemble"
, the results list contains the following
elements:
EListRaw
or EList
object, group-specific
sample numbers, group labels and parameters choosing and configuring a
multivariate feature selection method (frequency-based or ensemble feature
selection) to select a panel of differential features. When an output path is
defined (via output.path
) results will be saved on the hard disk and
when verbose
is TRUE additional information will be printed to the
console.Frequency-based feature selection (method="frequency"
): The whole data is
splitted in k cross validation training and test set pairs. For each training
set a multivariate feature selection procedure is performed. The resulting k
feature subsets are tested using the corresponding test sets (via
classification). As a result, selectFeatures()
returns the average k-fold
cross validation classification accuracy as well as the selected feature panel
(i.e., the union set of the k particular feature subsets). As multivariate
feature selection methods random forest recursive feature elimination (RF-RFE),
random jungle recursive feature elimination (RJ-RFE) and support vector machine
recursive feature elimination (SVM-RFE) are supported. To reduce running times,
optionally, univariate feature preselection can be performed (control via
preselection.method
). As univariate preselection methods mMs
("mMs"
), Student's t-test ("tTest"
) and mRMR ("mrmr"
) are
supported. Alternatively, no preselection can be chosen ("none"
). This
approach is similar to the method proposed in Baek et al.
Ensemble feature selection (method="ensemble"
): From the whole data the
previously defined number of subsamples is drawn defining pairs of training and
test sets. Moreover, for each training set a previously defined number of
bootstrap samples is drawn. Then, for each bootstrap sample SVM-RFE is performed
and a feature ranking is obtained. To obtain a final ranking for a particular
training set, all associated bootstrap rankings are aggregated to a single
ranking. To score the cutoff
best features, for each subsample a
classification of the test set is performed (using a svm trained with the
cutoff
best features from the training set) and the classification
accuracy is determined. Finally, the stability of the subsample-specific panels
is assessed (via Kuncheva index, Kuncheva LI, 2007), all subsample-specific
rankings are aggregated, the top n features (defined by cutoff
) are
selected, the average classification accuracy is computed, and all these results
are returned in a list. This approach has been proposed in Abeel et al.
Abeel T, Helleputte T, Van de Peer Y, Dupont P, Saeys Y: Robust biomarker identification for cancer diagnosis with ensemble feature selection methods. Bioinformatics. 2010 Feb 1;26(3):392-8.
Kuncheva, LI: A stability index for feature selection. Proceedings of the IASTED International Conference on Artificial Intelligence and Applications. February 12-14, 2007. Pages: 390-395.
cwd <- system.file(package="PAA")
load(paste(cwd, "/extdata/Alzheimer.RData", sep=""))
elist <- elist[elist$genes$Block < 10,]
c1 <- paste(rep("AD",20), 1:20, sep="")
c2 <- paste(rep("NDC",20), 1:20, sep="")
pre.sel.results <- preselect(elist=elist, columns1=c1, columns2=c2, label1="AD",
label2="NDC", log=FALSE, discard.threshold=0.1, fold.thresh=1.9,
discard.features=TRUE, method="tTest")
elist <- elist[-pre.sel.results$discard,]
selectFeatures.results <- selectFeatures(elist, n1=20, n2=20, label1="AD",
label2="NDC", log=FALSE, subsamples=2, bootstraps=1, candidate.number=20,
method="ensemble")
Run the code above in your browser using DataLab