fs.ensembl.stability: Ensemble Classification & Feature Selection

Description

Applies ensembles of models to high-dimensional data to both classify and determine important features for classification. The function bootstraps a user-specified number of times to facilitate stability metrics of features selected thereby providing an important metric for biomarker investigations, namely whether the important variables can be identified if the models are refit on 'different' data.

Usage

fs.ensembl.stability(X, Y, method, k = 10, p = 0.9, f = ceiling(ncol(X)/10), bags = 40, aggregation.metric = "CLA", stability.metric = "jaccard", optimize = TRUE, optimize.resample = FALSE, tuning.grid = NULL, k.folds = if (optimize) 10 else NULL, repeats = if (k.folds == "LOO") NULL else if (optimize) 3 else NULL, resolution = if (optimize) 3 else NULL, metric = "Accuracy", model.features = FALSE, allowParallel = FALSE, verbose = "none", ...)

Arguments

A matrix containing numeric values of each feature

A factor vector containing group membership of samples

method

A vector listing models to be fit. Available options are "plsda" (Partial Least Squares Discriminant Analysis), "rf" (Random Forest), "gbm" (Gradient Boosting Machine), "svm" (Support Vector Machines), "glmnet" (Elastic-net Generalized Linear Model), and "pam" (Prediction Analysis of Microarrays)

Number of bootstrapped interations

Percent of data to by 'trained'

Number of features desired. Default is top 10 "f = ceiling(ncol(variables)/10)". If rank correlation is desired, set "f = NULL"

bags

Number of iterations for ensemble bagging. Default "bags = 40"

aggregation.metric

String indicating which aggregation metric for features selected during bagging. Avialable options are "CLA" (Complete Linear), "EM" (Ensemble Mean), "ES" (Ensemble Stability), and "EE" (Ensemble Exponential)

stability.metric

string indicating the type of stability metric. Avialable options are "jaccard" (Jaccard Index/Tanimoto Distance), "sorensen" (Dice-Sorensen's Index), "ochiai" (Ochiai's Index), "pof" (Percent of Overlapping Features), "kuncheva" (Kuncheva's Stability Measures), "spearman" (Spearman Rank Correlation), and "canberra" (Canberra Distance)

optimize

Logical argument determining if each model should be optimized. Default "optimize = TRUE"

optimize.resample

Logical argument determining if each resample should be re-optimized. Default "optimize.resample = FALSE" - Only one optimization run, subsequent models use initially determined parameters

tuning.grid

Optional list of grids containing parameters to optimize for each algorithm. Default "tuning.grid = NULL" lets function create grid determined by "res"

k.folds

Number of folds generated during cross-validation. May optionally be set to "LOO" for leave-one-out cross-validation. Default "k.folds = 10"

repeats

Number of times cross-validation repeated. Default "repeats = 3"

resolution

Optional - Resolution of model optimization grid. Default "res = 3"

metric

Criteria for model optimization. Available options are "Accuracy" (Predication Accuracy), "Kappa" (Kappa Statistic), and "AUC-ROC" (Area Under the Curve - Receiver Operator Curve)

model.features

Logical argument if should have number of features selected to be determined by the individual model runs. Default "model.features = FALSE"

allowParallel

Logical argument dictating if parallel processing is allowed via foreach package. Default allowParallel = FALSE

verbose

Character argument specifying how much output progress to print. Options are 'none', 'minimal' or 'full'.

...

Extra arguments that the user would like to apply to the models

Value

Methods: Vector of models fit to data
performance: Performance metrics of each model and bootstrap iteration
RPT: Robustness-Performance Trade-Off as defined in Saeys 2008
features: List concerning features determined via each algorithms feature selection criteria.
stability.models: Function perturbation metric - i.e. how similar are the features selected by each model.
all.tunes: If "optimize.resample = TRUE" then returns list of optimized parameters for each bagging and bootstrap interation.
final.best.tunes: If "optimize.resample = TRUE" then returns list of optimized parameters for each bootstrap of the bagged models refit to aggregated selected features.
specs: List with the following elements:

References

Saeys Y., Abeel T., et. al. (2008) Machine Learning and Knowledge Discovery in Databases. 313-325. http://link.springer.com/chapter/10.1007/978-3-540-87481-2_21

Examples

Run this code

## Not run: 
# fits <- fs.ensembl.stability(vars,
# groups,
# method = c("plsda", "rf"),
# f = 10,
# k = 3,
# k.folds = 10,
# verbose = 'none')
# ## End(Not run)

Run the code above in your browser using DataLab