Learn R Programming

pheble (version 0.1.0)

ph_ensemble: Classify phenotypes via ensemble learning.

Description

The ph_ensemble function uses classification predictions from a list of algorithms to train an ensemble model. This can be a list of manually trained algorithms from train or, more conveniently, the output from ph_train. The hyperparameter tuning and model evaluations are handled internally to simplify the ensembling process. This function assumes some preprocessing has been performed, hence the training, validation, and test set requirements.

Usage

ph_ensemble(
  train_models,
  train_df,
  vali_df,
  test_df,
  class_col,
  ctrl,
  train_seed = 123,
  n_cores = 2,
  task = "multi",
  metric = ifelse(task == "multi", "Kappa", "ROC"),
  top_models = 3,
  metalearner = ifelse(task == "multi", "glmnet", "rf"),
  tune_length = 10,
  quiet = FALSE
)

Value

A list containing the following components:

ensemble_test_predsThe ensemble predictions for the test set.
vali_predsThe validation predictions for the top models.
test_predsThe test predictions for the top models.
all_test_predsThe test predictions for every successfully trained model.
all_test_resultsThe confusion matrix results obtained from comparing the model test predictions (i.e., original models and ensemble) against the actual test classes.
ensemble_modelThe ensemble train object.
var_impsThe ensemble variable importances obtained via weighted averaging. The original train importances are multiplied by the model's importance in the ensemble, then averaged across models and normalized.
train_dfThe training data frame.
vali_dfThe validation data frame.
test_dfThe test data frame.
train_modelsThe train models for the ensemble.
ctrlA trainControl object.
metricThe summary metric used to select the optimal model.
taskThe type of classification task.
tune_lengthThe maximum number of hyperparameter combinations ("random") or individual hyperparameter depth ("grid").
top_modelsThe number of top methods selected for the ensemble.
metalearnerThe algorithm used to train the ensemble.

Arguments

train_models

A list of at least two train models.

train_df

A data.frame containing a class column and the training data.

vali_df

A data.frame containing a class column and the validation data.

test_df

A data.frame containing a class column and the test data.

class_col

A character value for the name of the class column. This should be consistent across data frames.

ctrl

A list containing the resampling strategy (e.g., "boot") and other parameters for trainControl. Automatically create one via ph_ctrl or manually create it with trainControl.

train_seed

A numeric value to set the training seed and control the randomness of creating resamples: 123 (default).

n_cores

An integer value for the number of cores to include in the cluster: 2 (default). We highly recommend increasing this value to, e.g., parallel::detectCores() - 1.

task

A character value for the type of classification task: "multi" (default), "binary".

metric

A character value for which summary metric should be used to select the optimal model: "ROC" (default for "binary") and "Kappa" (default for "multi"). Other options include "logLoss", "Accuracy", "Mean_Balanced_Accuracy", and "Mean_F1".

top_models

A numeric value for the top n training models to ensemble: 3 (default). Every training model is ordered according to their final metric value (e.g., "ROC" or "Kappa") and the top n models are selected.

metalearner

A character value for the algorithm used to train the ensemble: "glmnet" (default), "rf". Other methods, such as those listed in ph_train methods, may also be used.

tune_length

If search = "random" (default), this is an integer value for the maximum number of hyperparameter combinations to test for each training model in the ensemble; if search = "grid", this is an integer value for the number of levels of each hyperparameter to test for each model.

quiet

A logical value for whether progress should be printed: TRUE (default), FALSE.

Examples

Run this code
## Import data.
data(ph_crocs)
# \donttest{
## Remove anomalies with autoencoder.
rm_outs <- ph_anomaly(df = ph_crocs, ids_col = "Biosample",
                      class_col = "Species", method = "ae")
## Preprocess anomaly-free data frame into train, validation, and test sets
## with PCs as predictors.
pc_dfs <- ph_prep(df = rm_outs$df, ids_col = "Biosample",
                  class_col = "Species", vali_pct = 0.15,
                  test_pct = 0.15, method = "pca")
## Echo control object for train function.
ctrl <- ph_ctrl(ph_crocs$Species, resample_method = "boot")
## Train all models for ensemble.
## Note: Increasing n_cores will dramatically reduce train time.
train_models <- ph_train(train_df = pc_dfs$train_df,
                         vali_df = pc_dfs$vali_df,
                         test_df = pc_dfs$test_df,
                         class_col = "Species",
                         ctrl = ctrl,
                         task = "multi",
                         methods = "all",
                         tune_length = 5,
                         quiet = FALSE)
## You can also train just a few, although more is preferable.
## Note: Increasing n_cores will dramatically reduce train time.
train_models <- ph_train(train_df = pc_dfs$train_df,
                         vali_df = pc_dfs$vali_df,
                         test_df = pc_dfs$test_df,
                         class_col = "Species",
                         ctrl = ctrl,
                         task = "multi",
                         methods = c("lda", "mda",
                         "nnet", "pda", "sparseLDA"),
                         tune_length = 5,
                         quiet = FALSE)
## Train the ensemble.
## Note: Increasing n_cores will dramatically reduce train time.
ensemble_model <- ph_ensemble(train_models = train_models$train_models,
                              train_df = pc_dfs$train_df,
                              vali_df = pc_dfs$vali_df,
                              test_df = pc_dfs$test_df,
                              class_col = "Species",
                              ctrl = ctrl,
                              task = "multi",
                              top_models = 3,
                              metalearner = "glmnet",
                              tune_length = 25,
                              quiet = FALSE)
# }

Run the code above in your browser using DataLab