build_model: Create ML Model

Description

The function build_model() is designed to construct and attach a ML model to an existing analysis object,which contains the preprocessed dataset generated in the previous step using the preprocessing() function. Based on the specified model type and optional hyperparameters, it supports several popular algorithms—including Neural Network, Random Forest, XGBOOST, and SVM (James et al., 2021)— by initializing the corresponding hyperparameter class, updating the analysis object with these settings, and invoking the appropriate model creation function. For SVM models, it further distinguishes between kernel types (rbf, polynomial, linear) to ensure the correct implementation. The function also updates the analysis object with the model name, the fitted model, and the current processing stage before returning the enriched object, thereby streamlining the workflow for subsequent training, evaluation, or prediction steps. This modular approach facilitates flexible and reproducible ML pipelines by encapsulating both the model and its configuration within a single structured object.

Usage

build_model(analysis_object, model_name, hyperparameters = NULL)

Value

An updated analysis_object containing the fitted machine learning model, the model name, the specified hyperparameters, and the current processing stage. This enriched object retains all previously stored information from the preprocessing step and incorporates the results of the model-building process, ensuring a coherent and reproducible workflow for subsequent training, evaluation, or prediction tasks.

Arguments

analysis_object: analysis_object created from preprocessing function.
model_name: Name of the ML Model. A string of the model name: "Neural Network", "Random Forest", "SVM" or "XGBOOST".
hyperparameters: Hyperparameters of the ML model. List containing the name of the hyperparameter and its value or range of values.

Hyperparameters

Neural Network

Parsnip model using brulee engine. Hyperparameters:

hidden_units: Number of Hidden Neurons. A single value, a vector with range values c(min_val, max_val) or NULL for default range c(5, 20).
activation: Activation Function. A vector with any of ("relu", "sigmoid", "tanh") or NULL for default values c("relu", "sigmoid", "tanh").
learn_rate: Learning Rate. A single value, a vector with range values c(min_val, max_val) or NULL for default range c(-3, -1) in log10 scale.

Random Forest

Parsnip model using ranger engine. Hyperparameters:

trees: Number of Trees. A single value, a vector with range values c(min_val, max_val). Default range c(100, 300).
mtry: Number of variables randomly selected as candidates at each split. A single value, a vector with range values c(min_val, max_val) or NULL for default range c(3, 8).
min_n: Minimum Number of samples to split at each node. A single value, a vector with range values c(min_val, max_val) or NULL for default range c(2, 25).

XGBOOST

Parsnip model using xgboost engine. Hyperparameters:

trees: Number of Trees. A single value, a vector with range values c(min_val, max_val) or NULL for default range c(100, 300).
mtry: Number of variables randomly selected as candidates at each split. A single value, a vector with range values c(min_val, max_val) or NULL for default range c(3, 8).
min_n: Minimum Number of samples to split at each node. A single value, a vector with range values c(min_val, max_val) or NULL for default range c(5, 25).
tree_depth: Maximum tree depth. A single value, a vector with range values c(min_val, max_val) or NULL for default range c(3, 10).
learn_rate: Learning Rate. A single value, a vector with range values c(min_val, max_val) or NULL for default range c(-4, -1) in log10 scale.
loss_reduction: Minimum loss reduction required to make a further partition on a leaf node. A single value, a vector with range values c(min_val, max_val) or NULL for default range c(-5, 1.5) in log10 scale.

SVM

Parsnip model using kernlab engine. Hyperparameters:

cost: Penalty parameter that regulates model complexity and misclassification tolerance. A single value, a vector with range values c(min_val, max_val) or NULL for default range c(-3, 3) in log10 scale.
margin: Distance between the separating hyperplane and the nearest data points. A single value, a vector with range values c(min_val, max_val) or NULL for default range c(0, 0.2).
type: Kernel to be used. A single value from ("linear", "rbf", "polynomial"). Default: "linear".
rbf_sigma: A single value, a vector with range values c(min_val, max_val) or NULL for default range c(-5, 0) in log10 scale.
degree: Polynomial Degree (polynomial kernel only). A single value, a vector with range values c(min_val, max_val) or NULL for default range c(1, 3).
scale_factor: Scaling coefficient applied to inputs. (polynomial kernel only) A single value, a vector with range values c(min_val, max_val) or NULL for default range c(-5, -1) in log10 scale.

References

James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). An Introduction to Statistical Learning: with Applications in R (2nd ed.). Springer. https://doi.org/10.1007/978-1-0716-1418-1

Examples

Run this code

# Example 1: Random Forest for regression task

library(MLwrap)

data(sim_data) # sim_data is a simulated dataset with psychological variables

wrap_object <- preprocessing(
     df = sim_data,
     formula = psych_well ~ depression + emot_intel + resilience + life_sat,
     task = "regression"
     )

wrap_object <- build_model(
               analysis_object = wrap_object,
               model_name = "Random Forest",
               hyperparameters = list(
                                 mtry = 3,
                                 trees = 100
                                 )
                           )
# It is safe to reuse the same object name (e.g., wrap_object, or whatever) step by step,
# as all previous results and information are retained within the updated analysis object.

# Example 2: SVM for classification task

data(sim_data) # sim_data is a simulated dataset with psychological variables

wrap_object <- preprocessing(
         df = sim_data,
         formula = psych_well_bin ~ depression + emot_intel + resilience + life_sat,
         task = "classification"
         )

wrap_object <- build_model(
               analysis_object = wrap_object,
               model_name = "SVM",
               hyperparameters = list(
                                 type = "rbf",
                                 cost = 1,
                                 margin = 0.1,
                                 rbf_sigma = 0.05
                                 )
                           )

Run the code above in your browser using DataLab