Learn R Programming

SCE (version 1.0.0)

SCA_importance: Calculate Variable Importance for a Single SCA Tree

Description

This function calculates the importance of independent variables in explaining the variability of dependent variables for a single Stepwise Cluster Analysis (SCA) tree using the Wilks' Lambda statistic. The importance is calculated based on the contribution of each variable to the reduction in Wilks' Lambda at each split in the tree.

For calculating importance scores across all trees in an SCE ensemble, use Wilks_importance instead.

Usage

SCA_importance(model)

Value

A data.frame containing:

  • Predictor: Names of the predictors

  • Relative_Importance: Normalized importance scores (sum to 1)

Arguments

model

A single SCA tree object containing:

  • Tree: Tree structure with Wilks' Lambda values and split information

  • XName: Names of predictors used

Author

Kailong Li <lkl98509509@gmail.com>

Details

The importance calculation process involves the following steps:

  1. Extract Wilks' Lambda values and split information from the tree

  2. Replace negative Wilks' Lambda values with zero

  3. Calculate raw importance for each split:

    • Importance = (left_samples + right_samples) / total_samples * (1 - Wilks' Lambda)

  4. Aggregate importance scores by predictor

  5. Normalize importance scores to sum to 1

The function handles:

  • Different sets of predictors in the tree

  • Missing or invalid splits

  • Both single and multiple predictants

  • Trees with no splits (returns NULL)

Relationship with Wilks_importance:

  • SCA_importance calculates importance scores for a single SCA tree

  • Wilks_importance calculates importance scores across all trees in an SCE ensemble

  • Both functions use the same underlying importance calculation method

  • Wilks_importance with OOB_weight=FALSE is equivalent to taking the median of SCA_importance scores across all trees

References

Li, Kailong, Guohe Huang, and Brian Baetz. "Development of a Wilks feature importance method with improved variable rankings for supporting hydrological inference and modelling." Hydrology and Earth System Sciences 25.9 (2021): 4947-4966.

Examples

Run this code
## Load SCE package and the supporting packages
library(SCE)

## Load the training and testing data files
data("Streamflow_training_10var")
data("Streamflow_testing_10var")

## Define the independent (x) and dependent (y) variables
Predictors <- c("Prcp", "SRad", "Tmax", "Tmin", "VP", "smlt", "swvl1", "swvl2", "swvl3", "swvl4")
Predictants <- c("Flow")

## Build a single SCA tree
SCA_tree <- SCA(
	Training_data = Streamflow_training_10var,
	X = Predictors,
	Y = Predictants,
	Nmin = 5,
	alpha = 0.05,
	resolution = 1000
)

## Calculate variable importance for the single tree
Tree_importance <- SCA_importance(SCA_tree)

## Print the results
print("Single tree importance scores:")
print(Tree_importance)

## Visualize the importance scores
Importance_ranking_sorted <- Tree_importance[order(-Tree_importance$Relative_Importance), ]
barplot(
  Importance_ranking_sorted$Relative_Importance,
  names.arg = Importance_ranking_sorted$Predictor,
  las = 2, # vertical labels
  col = "skyblue",
  main = "Variable Importance (SCE)",
  ylab = "Importance",
  xlab = "Predictor"
)

Run the code above in your browser using DataLab