fastml: Fast Machine Learning Function

Description

Trains and evaluates multiple classification or regression models automatically detecting the task based on the target variable type.

Usage

fastml(
  data,
  label,
  algorithms = "all",
  test_size = 0.2,
  resampling_method = "cv",
  folds = ifelse(grepl("cv", resampling_method), 10, 25),
  repeats = ifelse(resampling_method == "repeatedcv", 1, NA),
  event_class = "first",
  exclude = NULL,
  recipe = NULL,
  tune_params = NULL,
  metric = NULL,
  algorithm_engines = NULL,
  n_cores = 1,
  stratify = TRUE,
  impute_method = "error",
  impute_custom_function = NULL,
  encode_categoricals = TRUE,
  scaling_methods = c("center", "scale"),
  summaryFunction = NULL,
  use_default_tuning = FALSE,
  tuning_strategy = "grid",
  tuning_iterations = 10,
  early_stopping = FALSE,
  adaptive = FALSE,
  learning_curve = FALSE,
  seed = 123
)

Value

An object of class fastml_model containing the best model, performance metrics, and other information.

Arguments

data

A data frame containing the features and target variable.

label

A string specifying the name of the target variable.

algorithms

A vector of algorithm names to use. Default is "all" to run all supported algorithms.

test_size

A numeric value between 0 and 1 indicating the proportion of the data to use for testing. Default is 0.2.

resampling_method

A string specifying the resampling method for model evaluation. Default is "cv" (cross-validation). Other options include "none", "boot", "repeatedcv", etc.

folds

An integer specifying the number of folds for cross-validation. Default is 10 for methods containing "cv" and 25 otherwise.

repeats

Number of times to repeat cross-validation (only applicable for methods like "repeatedcv").

event_class

A single string. Either "first" or "second" to specify which level of truth to consider as the "event". Default is "first".

exclude

A character vector specifying the names of the columns to be excluded from the training process.

recipe

A user-defined recipe object for custom preprocessing. If provided, internal recipe steps (imputation, encoding, scaling) are skipped.

tune_params

A list specifying hyperparameter tuning ranges. Default is NULL.

metric

The performance metric to optimize during training.

algorithm_engines

A named list specifying the engine to use for each algorithm.

n_cores

An integer specifying the number of CPU cores to use for parallel processing. Default is 1.

stratify

Logical indicating whether to use stratified sampling when splitting the data. Default is TRUE for classification and FALSE for regression.

impute_method

Method for handling missing values. Options include:

"medianImpute": Impute missing values using median imputation (recipe-based).

"knnImpute"

Impute missing values using k-nearest neighbors (recipe-based).

"bagImpute"

Impute missing values using bagging (recipe-based).

"remove"

Remove rows with missing values from the data (recipe-based).

"mice"

Impute missing values using MICE (Multiple Imputation by Chained Equations).

"missForest"

Impute missing values using the missForest algorithm.

"custom"

Use a user-provided imputation function (see `impute_custom_function`).

"error"

Do not perform imputation; if missing values are detected, stop execution with an error.

NULL

Equivalent to "error". No imputation is performed, and the function will stop if missing values are present.

Default is "error".

impute_custom_function

A function that takes a data.frame as input and returns an imputed data.frame. Used only if impute_method = "custom".

encode_categoricals

Logical indicating whether to encode categorical variables. Default is TRUE.

scaling_methods

Vector of scaling methods to apply. Default is c("center", "scale").

summaryFunction

A custom summary function for model evaluation. Default is NULL.

use_default_tuning

Logical indicating whether to use default tuning grids when tune_params is NULL. Default is FALSE.

tuning_strategy

A string specifying the tuning strategy. Options might include "grid", "bayes", or "none". Default is "grid".

tuning_iterations

Number of tuning iterations (applicable for Bayesian or other iterative search methods). Default is 10.

early_stopping

Logical indicating whether to use early stopping in Bayesian tuning methods (if supported). Default is FALSE.

adaptive

Logical indicating whether to use adaptive/racing methods for tuning. Default is FALSE.

learning_curve

Logical. If TRUE, generate learning curves (performance vs. training size).

seed

An integer value specifying the random seed for reproducibility.

Details

Fast Machine Learning Function

Trains and evaluates multiple classification or regression models. The function automatically detects the task based on the target variable type and can perform advanced hyperparameter tuning using various tuning strategies.

Examples

Run this code

# \donttest{
# Example 1: Using the iris dataset for binary classification (excluding 'setosa')
data(iris)
iris <- iris[iris$Species != "setosa", ]  # Binary classification
iris$Species <- factor(iris$Species)

# Train models
model <- fastml(
  data = iris,
  label = "Species",
  algorithms = c("rand_forest", "xgboost", "svm_rbf"), algorithm_engines = c(
  list(rand_forest = c("ranger","aorsf", "partykit", "randomForest")))
)

# View model summary
summary(model)


  # }

Run the code above in your browser using DataLab