Check all params that don't return a value
check_all(
dataset,
method,
permute,
kfold,
training_frac,
perf_metric_function,
perf_metric_name,
groups,
group_partitions,
corr_thresh,
seed,
hyperparameters
)
Data frame with an outcome variable and other columns as features.
ML method.
Options: c("glmnet", "rf", "rpart2", "svmRadial", "xgbTree")
.
glmnet: linear, logistic, or multiclass regression
rf: random forest
rpart2: decision tree
svmRadial: support vector machine
xgbTree: xgboost
Fold number for k-fold cross-validation (default: 5
).
Fraction of data for training set (default: 0.8
). Rows
from the dataset will be randomly selected for the training set, and all
remaining rows will be used in the testing set. Alternatively, if you
provide a vector of integers, these will be used as the row indices for the
training set. All remaining rows will be used in the testing set.
Function to calculate the performance metric to
be used for cross-validation and test performance. Some functions are
provided by caret (see caret::defaultSummary()
).
Defaults: binary classification = twoClassSummary
,
multi-class classification = multiClassSummary
,
regression = defaultSummary
.
The column name from the output of the function
provided to perf_metric_function that is to be used as the performance metric.
Defaults: binary classification = "ROC"
,
multi-class classification = "logLoss"
,
regression = "RMSE"
.
Vector of groups to keep together when splitting the data into
train and test sets. If the number of groups in the training set is larger
than kfold
, the groups will also be kept together for cross-validation.
Length matches the number of rows in the dataset (default: NULL
).
Specify how to assign groups
to the training and
testing partitions (default: NULL
). If groups
specifies that some
samples belong to group "A"
and some belong to group "B"
, then setting
group_partitions = list(train = c("A", "B"), test = c("B"))
will result
in all samples from group "A"
being placed in the training set, some
samples from "B"
also in the training set, and the remaining samples from
"B"
in the testing set. The partition sizes will be as close to
training_frac
as possible. If the number of groups in the training set is
larger than kfold
, the groups will also be kept together for
cross-validation.
For feature importance, group correlations
above or equal to corr_thresh
(range 0
to 1
; default: 1
).
Random seed (default: NA
).
Your results will only be reproducible if you set a seed.
Dataframe of hyperparameters
(default NULL
; sensible defaults will be chosen automatically).
Kelly Sovacool, sovacool@umich.edu