RFE_SCE: Recursive Feature Elimination for SCE Models

Description

This function implements Recursive Feature Elimination (RFE) to identify the most important predictors for SCE models. It iteratively removes the least important predictors based on Wilks' feature importance scores and evaluates model performance. The function supports both single and multiple predictants, with comprehensive input validation and performance tracking across iterations.

The package also provides a Plot_RFE function for visualizing RFE results, showing validation and testing R2 values as a function of the number of predictors.

Usage

RFE_SCE(
  Training_data,
  Testing_data,
  Predictors,
  Predictant,
  Nmin,
  Ntree,
  alpha = 0.05,
  resolution = 1000,
  step = 1,
  verbose = TRUE,
  parallel = TRUE
)
Plot_RFE(
  rfe_result,
  main = "Validation and Testing R2 vs Number of Predictors",
  col_validation = "blue",
  col_testing = "red",
  pch = 16,
  lwd = 2,
  cex = 1.2,
  legend_pos = "bottomleft",
  ...
)

Value

RFE_SCE: A list containing:

summary: Data.frame with columns:
- n_predictors: Number of predictors at each iteration
- predictors: Comma-separated list of predictors used
performances: List of performance evaluations for each iteration
- For single predictant: Direct performance data.frame
- For multiple predictants: Named list of performance data.frames
importance_scores: List of Wilks' importance scores for each iteration

Plot_RFE: Invisibly returns a list containing:

n_predictors: Vector of predictor counts
validation_r2: Vector of validation R2 values
testing_r2: Vector of testing R2 values

Arguments

Training_data: A data.frame containing the training data. Must include all specified predictors and predictants.
Testing_data: A data.frame containing the testing data. Must include all specified predictors and predictants.
Predictors: A character vector specifying the names of independent variables to be evaluated (e.g., c("Prcp","SRad","Tmax")). Must contain at least 2 elements.
Predictant: A character vector specifying the name(s) of dependent variable(s) (e.g., c("swvl3","swvl4")). Must be non-empty.
Nmin: Integer specifying the minimal number of samples in a leaf node for cutting.
Ntree: Integer specifying the number of trees in the ensemble.
alpha: Numeric significance level for clustering, between 0 and 1. Default value is 0.05.
resolution: Numeric value specifying the resolution for splitting. Default value is 1000.
step: Integer specifying the number of predictors to remove at each iteration. Must be between 1 and (number of predictors - number of predictants). Default value is 1.
verbose: A logical value indicating whether to print progress information during RFE iterations. Default value is TRUE.
parallel: A logical value indicating whether to use parallel processing for SCE model construction. When TRUE, uses multiple CPU cores for faster computation. When FALSE, processes trees sequentially. Default value is TRUE.

Plot_RFE Arguments:

rfe_result: The result object from RFE_SCE function containing summary and performances components.
main: Title for the plot. Default is "Validation and Testing R2 vs Number of Predictors".
col_validation: Color for validation line. Default is "blue".
col_testing: Color for testing line. Default is "red".
pch: Point character for markers. Default is 16 (filled circle).
lwd: Line width. Default is 2.
cex: Point size. Default is 1.2.
legend_pos: Position of legend. Default is "bottomleft".
...: Additional arguments passed to plot function.

Author

Kailong Li <lkl98509509@gmail.com>

Details

RFE_SCE Process: The RFE process involves the following steps:

Input validation:
- Data frame structure validation
- Predictor and predictant validation
- Step size validation
Initialization:
- Set up history tracking structures
- Initialize current predictor set
Main RFE loop (continues while predictors > predictants + 2):
- Train SCE model with current predictors
- Generate predictions using Model_simulation
- Evaluate model using SCE_Model_evaluation
- Store performance metrics and importance scores
- Remove least important predictors based on Wilks' scores

The function handles:

Single and multiple predictants
Performance tracking across iterations
Importance score calculation
Step-wise predictor removal

Plot_RFE Function: Creates a base R plot showing validation and testing R2 values as a function of the number of predictors during the RFE process. The function:

Extracts R2 values from RFE results
Converts formatted strings to numeric values
Creates a line plot with points and lines
Includes a legend distinguishing validation and testing performance
Supports customization of colors, line styles, and plot appearance
Uses only base R graphics (no external dependencies)

Examples

Run this code

# \donttest{
#   # This example is computationally intensive and may take a long time to run.
#   # It is recommended to run this example on a machine with a high-performance CPU.
# 
#   ## Load SCE package and the supporting packages
#   library(SCE)
#   library(parallel)
# 
#   data(Streamflow_training_22var)
#   data(Streamflow_testing_22var)
# 
#   # Define predictors and predictants
#   Predictors <- c(
#     "Precipitation", "Radiation", "Tmax", "Tmin", "VP",
#     "Precipitation_2Mon", "Radiation_2Mon", "Tmax_2Mon", "Tmin_2Mon", "VP_2Mon",
#     "PNA", "Nino3.4", "IPO", "PDO",
#     "PNA_lag1", "Nino3.4_lag1", "IPO_lag1", "PDO_lag1",
#     "PNA_lag2", "Nino3.4_lag2", "IPO_lag2", "PDO_lag2"
#   )
#   Predictants <- c("Flow")
# 
#   # Perform RFE
#   set.seed(123)
#   result <- RFE_SCE(
#     Training_data = Streamflow_training_22var,
#     Testing_data = Streamflow_testing_22var,
#     Predictors = Predictors,
#     Predictant = Predictants,
#     Nmin = 5,
#     Ntree = 48,
#     alpha = 0.05,
#     resolution = 1000,
#     step = 3,  # Number of predictors to remove at each iteration
#     verbose = TRUE,
#     parallel = TRUE
#   )
#
#   ## Access results
#   summary <- result$summary
#   performances <- result$performances
#   importance_scores <- result$importance_scores
#
#   ## Plot RFE results
#   Plot_RFE(result)
#
#   ## Customized plot
#   Plot_RFE(result, 
#            main = "My RFE Results",
#            col_validation = "darkblue",
#            col_testing = "darkred",
#            lwd = 3,
#            cex = 1.5)
#
#   ## Note: The RFE_SCE function internally uses S3 methods for SCE models
#   ## including importance() and evaluate() for model analysis
# 
# # }

Run the code above in your browser using DataLab