ph_ensemble: Classify phenotypes via ensemble learning.

Description

The ph_ensemble function uses classification predictions from a list of algorithms to train an ensemble model. This can be a list of manually trained algorithms from train or, more conveniently, the output from ph_train. The hyperparameter tuning and model evaluations are handled internally to simplify the ensembling process. This function assumes some preprocessing has been performed, hence the training, validation, and test set requirements.

Usage

ph_ensemble(
  train_models,
  train_df,
  vali_df,
  test_df,
  class_col,
  ctrl,
  train_seed = 123,
  n_cores = 2,
  task = "multi",
  metric = ifelse(task == "multi", "Kappa", "ROC"),
  top_models = 3,
  metalearner = ifelse(task == "multi", "glmnet", "rf"),
  tune_length = 10,
  quiet = FALSE
)

Value

A list containing the following components:

`ensemble_test_preds`	The ensemble predictions for the test set.

`vali_preds`	The validation predictions for the top models.

`test_preds`	The test predictions for the top models.

`all_test_preds`	The test predictions for every successfully trained model.

`all_test_results`	The confusion matrix results obtained from comparing the model test predictions (i.e., original models and ensemble) against the actual test classes.

`ensemble_model`	The ensemble `train` object.

`var_imps`	The ensemble variable importances obtained via weighted averaging. The original train importances are multiplied by the model's importance in the ensemble, then averaged across models and normalized.

`train_df`	The training data frame.

`vali_df`	The validation data frame.

`test_df`	The test data frame.

`train_models`	The `train` models for the ensemble.

`ctrl`	A `trainControl` object.

`metric`	The summary metric used to select the optimal model.

`task`	The type of classification task.

`tune_length`	The maximum number of hyperparameter combinations ("random") or individual hyperparameter depth ("grid").

`top_models`	The number of top methods selected for the ensemble.

`metalearner`	The algorithm used to train the ensemble.

Arguments

train_models: A list of at least two train models.
train_df: A data.frame containing a class column and the training data.
vali_df: A data.frame containing a class column and the validation data.
test_df: A data.frame containing a class column and the test data.
class_col: A character value for the name of the class column. This should be consistent across data frames.
ctrl: A list containing the resampling strategy (e.g., "boot") and other parameters for trainControl. Automatically create one via ph_ctrl or manually create it with trainControl.
train_seed: A numeric value to set the training seed and control the randomness of creating resamples: 123 (default).
n_cores: An integer value for the number of cores to include in the cluster: 2 (default). We highly recommend increasing this value to, e.g., parallel::detectCores() - 1.
task: A character value for the type of classification task: "multi" (default), "binary".
metric: A character value for which summary metric should be used to select the optimal model: "ROC" (default for "binary") and "Kappa" (default for "multi"). Other options include "logLoss", "Accuracy", "Mean_Balanced_Accuracy", and "Mean_F1".
top_models: A numeric value for the top n training models to ensemble: 3 (default). Every training model is ordered according to their final metric value (e.g., "ROC" or "Kappa") and the top n models are selected.
metalearner: A character value for the algorithm used to train the ensemble: "glmnet" (default), "rf". Other methods, such as those listed in ph_train methods, may also be used.
tune_length: If search = "random" (default), this is an integer value for the maximum number of hyperparameter combinations to test for each training model in the ensemble; if search = "grid", this is an integer value for the number of levels of each hyperparameter to test for each model.
quiet: A logical value for whether progress should be printed: TRUE (default), FALSE.

Examples

Run this code

## Import data.
data(ph_crocs)
# \donttest{
## Remove anomalies with autoencoder.
rm_outs <- ph_anomaly(df = ph_crocs, ids_col = "Biosample",
                      class_col = "Species", method = "ae")
## Preprocess anomaly-free data frame into train, validation, and test sets
## with PCs as predictors.
pc_dfs <- ph_prep(df = rm_outs$df, ids_col = "Biosample",
                  class_col = "Species", vali_pct = 0.15,
                  test_pct = 0.15, method = "pca")
## Echo control object for train function.
ctrl <- ph_ctrl(ph_crocs$Species, resample_method = "boot")
## Train all models for ensemble.
## Note: Increasing n_cores will dramatically reduce train time.
train_models <- ph_train(train_df = pc_dfs$train_df,
                         vali_df = pc_dfs$vali_df,
                         test_df = pc_dfs$test_df,
                         class_col = "Species",
                         ctrl = ctrl,
                         task = "multi",
                         methods = "all",
                         tune_length = 5,
                         quiet = FALSE)
## You can also train just a few, although more is preferable.
## Note: Increasing n_cores will dramatically reduce train time.
train_models <- ph_train(train_df = pc_dfs$train_df,
                         vali_df = pc_dfs$vali_df,
                         test_df = pc_dfs$test_df,
                         class_col = "Species",
                         ctrl = ctrl,
                         task = "multi",
                         methods = c("lda", "mda",
                         "nnet", "pda", "sparseLDA"),
                         tune_length = 5,
                         quiet = FALSE)
## Train the ensemble.
## Note: Increasing n_cores will dramatically reduce train time.
ensemble_model <- ph_ensemble(train_models = train_models$train_models,
                              train_df = pc_dfs$train_df,
                              vali_df = pc_dfs$vali_df,
                              test_df = pc_dfs$test_df,
                              class_col = "Species",
                              ctrl = ctrl,
                              task = "multi",
                              top_models = 3,
                              metalearner = "glmnet",
                              tune_length = 25,
                              quiet = FALSE)
# }

Run the code above in your browser using DataLab