SCE: Build a Stepwise Clustered Ensemble (SCE) Model

Description

This function builds a Stepwise Clustered Ensemble (SCE) model for multivariate data analysis. The SCE model is an ensemble of Stepwise Cluster Analysis (SCA) trees, where each tree is built using bootstrap samples and random feature selection. The function includes comprehensive input validation for data types, missing values, and sample size requirements.

Usage

SCE(Training_data, X, Y, mfeature, Nmin, Ntree, 
    alpha = 0.05, resolution = 1000, verbose = FALSE, parallel = TRUE)

Value

A list containing the ensemble model with the following components:

Trees: A list of SCA tree models, each containing:
- Tree: The SCA tree structure
- Map: Mapping information
- XName: Names of predictors used
- YName: Names of predictants
- type: Mapping type
- totalNodes: Total number of nodes
- leafNodes: Number of leaf nodes
- cuttingActions: Number of cutting actions
- mergingActions: Number of merging actions
- OOB_error: Out-of-bag R-squared error
- OOB_sim: Out-of-bag predictions
- Sample: Bootstrap sample indices
- Tree_Info: Tree-specific information
- Training_data: Training data used for the tree
- weight: Tree weight based on OOB performance

Arguments

Training_data: A data.frame or matrix containing the training data. Must contain all specified predictors and predictants. Must not contain missing values.
X: A character vector specifying the names of independent (predictor) variables (e.g., c("Prcp","SRad","Tmax")). Must be present in Training_data. All variables must be numeric.
Y: A character vector specifying the name(s) of dependent (predictant) variable(s) (e.g., c("Flow") or c("swvl3","swvl4")). Must be present in Training_data. All variables must be numeric.
mfeature: An integer specifying how many features will be randomly selected for each tree. Recommended value is round(0.5 * length(X)).
Nmin: An integer specifying the minimal number of samples in a leaf node for cutting. Must be greater than the number of predictants.
Ntree: An integer specifying how many trees (ensemble members) will be built. Recommended values range from 50 to 500 depending on data complexity.
alpha: Numeric significance level for clustering, between 0 and 1. Default value is 0.05.
resolution: Numeric value specifying the resolution for splitting. Controls the granularity of the search for optimal split points. Default value is 1000.
verbose: A logical value indicating whether to print progress information during model building. Default value is FALSE.
parallel: A logical value indicating whether to use parallel processing for tree construction. When TRUE, uses multiple CPU cores for faster computation. When FALSE, processes trees sequentially. Default value is TRUE.

Author

Kailong Li <lkl98509509@gmail.com>

Details

The SCE model is built using the following steps:

Input Validation:
- Data type and structure checks
- Missing value detection
- Numeric data validation
- Sample size requirements verification
Data Preparation:
- Conversion to appropriate format
- Dimension checks
- Parameter initialization
Tree Construction:
- Generation of bootstrap samples
- Random feature selection for each tree
- Parallel construction of SCA trees
Model Evaluation:
- Calculation of out-of-bag (OOB) errors
- Weighting of trees based on OOB performance

The ensemble approach provides improved prediction accuracy and robustness compared to single SCA trees, while the OOB validation provides unbiased performance estimates.

References

Li, Kailong, Guohe Huang, and Brian Baetz. Development of a Wilks feature importance method with improved variable rankings for supporting hydrological inference and modelling. Hydrology and Earth System Sciences 25.9 (2021): 4947-4966.

Wang, X., G. Huang, Q. Lin, X. Nie, G. Cheng, Y. Fan, Z. Li, Y. Yao, and M. Suo (2013), A stepwise cluster analysis approach for downscaled climate projection - A Canadian case study. Environmental Modelling & Software, 49, 141-151.

Huang, G. (1992). A stepwise cluster analysis method for predicting air quality in an urban environment. Atmospheric Environment (Part B. Urban Atmosphere), 26(3): 349-357.

Liu, Y. Y. and Y. L. Wang (1979). Application of stepwise cluster analysis in medical research. Scientia Sinica, 22(9): 1082-1094.

Examples

Run this code

# \donttest{
	## Load required packages
	library(SCE)
	library(parallel)

	## Load example datasets
	data("Streamflow_training_10var")
	data("Streamflow_testing_10var")

	## Define predictors and predictants
	Predictors <- c("Prcp","SRad","Tmax","Tmin","VP","smlt","swvl1","swvl2","swvl3","swvl4")
	Predictants <- c("Flow")

	## Build the SCE model
	Model <- SCE(
		Training_data = Streamflow_training_10var,
		X = Predictors,
		Y = Predictants,
		mfeature = round(0.5 * length(Predictors)),
		Nmin = 5,
		Ntree = 48,
		alpha = 0.05,
		resolution = 1000,
		parallel = FALSE
	)

	## Generate predictions for test data
	predictions <- SCE_Prediction(
		X_sample = Streamflow_testing_10var,
		model = Model
	)

	## Conduct comprehensive model evaluation
	Results <- Model_simulation(
		Testing_data = Streamflow_testing_10var,
		model = Model
	)

	## Access different prediction components
	training_predictions <- Results$Training
	validation_predictions <- Results$Validation
	testing_predictions <- Results$Testing

	## Calculate variable importance with OOB weighting (default)
	Importance_weighted <- Wilks_importance(Model)

	## Calculate variable importance without OOB weighting
	Importance_unweighted <- Wilks_importance(Model, OOB_weight = FALSE)

	## Visualize the importance scores
	Importance_ranking_sorted <- Importance_weighted[
		order(-Importance_weighted$Relative_Importance), 
	]
	barplot(
		Importance_ranking_sorted$Relative_Importance,
		names.arg = Importance_ranking_sorted$Predictor,
		las = 2,
		col = "skyblue",
		main = "Variable Importance (SCE)",
		ylab = "Importance",
		xlab = "Predictor"
	)
# }

Run the code above in your browser using DataLab