check_all: Check all params that don't return a value

Description

Check all params that don't return a value

Usage

check_all(
  dataset,
  method,
  permute,
  kfold,
  training_frac,
  perf_metric_function,
  perf_metric_name,
  groups,
  group_partitions,
  corr_thresh,
  seed,
  hyperparameters
)

Arguments

dataset

Data frame with an outcome variable and other columns as features.

method

ML method. Options: c("glmnet", "rf", "rpart2", "svmRadial", "xgbTree").

glmnet: linear, logistic, or multiclass regression
rf: random forest
rpart2: decision tree
svmRadial: support vector machine
xgbTree: xgboost

kfold

Fold number for k-fold cross-validation (default: 5).

training_frac

Fraction of data for training set (default: 0.8). Rows from the dataset will be randomly selected for the training set, and all remaining rows will be used in the testing set. Alternatively, if you provide a vector of integers, these will be used as the row indices for the training set. All remaining rows will be used in the testing set.

perf_metric_function

Function to calculate the performance metric to be used for cross-validation and test performance. Some functions are provided by caret (see caret::defaultSummary()). Defaults: binary classification = twoClassSummary, multi-class classification = multiClassSummary, regression = defaultSummary.

perf_metric_name

The column name from the output of the function provided to perf_metric_function that is to be used as the performance metric. Defaults: binary classification = "ROC", multi-class classification = "logLoss", regression = "RMSE".

groups

Vector of groups to keep together when splitting the data into train and test sets. If the number of groups in the training set is larger than kfold, the groups will also be kept together for cross-validation. Length matches the number of rows in the dataset (default: NULL).

group_partitions

Specify how to assign groups to the training and testing partitions (default: NULL). If groups specifies that some samples belong to group "A" and some belong to group "B", then setting group_partitions = list(train = c("A", "B"), test = c("B")) will result in all samples from group "A" being placed in the training set, some samples from "B" also in the training set, and the remaining samples from "B" in the testing set. The partition sizes will be as close to training_frac as possible. If the number of groups in the training set is larger than kfold, the groups will also be kept together for cross-validation.

corr_thresh

For feature importance, group correlations above or equal to corr_thresh (range 0 to 1; default: 1).

seed

Random seed (default: NA). Your results will only be reproducible if you set a seed.

hyperparameters

Dataframe of hyperparameters (default NULL; sensible defaults will be chosen automatically).

Author

Kelly Sovacool, sovacool@umich.edu