validate: Validate regression models on a test set

Description

Train gaussian or binomial models on a full training set and validate it by predicting the test/validation set. Returns results in a tibble for easy reporting, along with the trained models.

Usage

validate(train_data, models, test_data = NULL,
  partitions_col = ".partitions", family = "gaussian", link = NULL,
  control = NULL, REML = FALSE, cutoff = 0.5, positive = 2,
  err_nc = FALSE, rm_nc = FALSE, parallel = FALSE,
  model_verbose = FALSE)

Arguments

train_data

Data Frame.

models

Model formulas as strings. (Character)

E.g. c("y~x", "y~z").

Can contain random effects.

E.g. c("y~x+(1|r)", "y~z+(1|r)").

test_data

Data Frame. If specifying partitions_col, this can be NULL.

partitions_col

Name of grouping factor for identifying partitions. (Character)

Rows with the value 1 in partitions_col are used as training set and rows with the value 2 are used as test set.

N.B. Only used if test_data is NULL.

family

Name of family. (Character)

Currently supports "gaussian" and "binomial".

link

Link function. (Character)

E.g. link = "log" with family = "gaussian" will use family = gaussian(link = "log").

See stats::family for available link functions.

Default link functions

Gaussian: 'identity'.

Binomial: 'logit'.

control

Construct control structures for mixed model fitting (i.e. lmer and glmer). See lme4::lmerControl and lme4::glmerControl.

N.B. Ignored if fitting lm or glm models.

REML

Restricted Maximum Likelihood. (Logical)

cutoff

Threshold for predicted classes. (Numeric)

N.B. Binomial models only

positive

Level from dependent variable to predict. Either as character or level index (1 or 2 - alphabetically).

E.g. if we have the levels "cat" and "dog" and we want "dog" to be the positive class, we can either provide "dog" or 2, as alphabetically, "dog" comes after "cat".

Used when calculating confusion matrix metrics and creating ROC curves.

N.B. Only affects evaluation metrics, not the model training or returned predictions.

N.B. Binomial models only.

err_nc

Raise error if model does not converge. (Logical)

rm_nc

Remove non-converged models from output. (Logical)

parallel

Whether to validate the list of models in parallel. (Logical)

Remember to register a parallel backend first. E.g. with doParallel::registerDoParallel.

model_verbose

Message name of used model function on each iteration. (Logical)

Value

List containing tbl (tibble) with results and the trained model object. The tibble contains:

Gaussian Results

RMSE, MAE, r2m, r2c, AIC, AICc, and BIC.

Count of convergence warnings. Consider discarding the model if it did not converge.

Specified family.

A nested tibble with model coefficients.

A nested tibble with the predictions and targets.

Name of dependent variable.

Names of fixed effects.

Names of random effects if any.

Binomial Results

Based on predictions of the test set, a confusion matrix and ROC curve are used to get the following:

ROC:

AUC, Lower CI, and Upper CI

Confusion Matrix:

Balanced Accuracy, F1, Sensitivity, Specificity, Positive Prediction Value, Negative Prediction Value, Kappa, Detection Rate, Detection Prevalence, Prevalence, and MCC (Matthews correlation coefficient).

A nested tibble with model coefficients.

Count of convergence warnings. Consider discarding the model if it did not converge.

Count of Singular Fit messages. See ?lme4::isSingular for more information.

Specified family.

A tibble with predictions, predicted classes (depends on cutoff), and the targets.

A tibble with the sensativities and specificities from the ROC curve.

Name of dependent variable.

Names of fixed effects.

Names of random effects if any.

Details

Packages used:

Models

Gaussian: stats::lm, lme4::lmer

Binomial: stats::glm, lme4::glmer

Results

Gaussian:

r2m : MuMIn::r.squaredGLMM

r2c : MuMIn::r.squaredGLMM

AIC : stats::AIC

AICc : AICcmodavg::AICc

BIC : stats::BIC

Binomial:

Confusion matrix: caret::confusionMatrix

ROC: pROC::roc

MCC: mltools::mcc

Examples

Run this code

# NOT RUN {
# Attach packages
library(cvms)
library(groupdata2) # partition()
library(dplyr) # %>% arrange()

# Data is part of cvms
data <- participant.scores

# Set seed for reproducibility
set.seed(7)

# Partition data
# Keep as single data frame
# We could also have fed validate() separate train and test sets.
data_partitioned <- partition(data,
                              p = 0.7,
                              cat_col = 'diagnosis',
                              id_col = 'participant',
                              list_out=FALSE) %>%
    arrange(.partitions)

# Validate a model

# Gaussian
validate(data_partitioned,
         models = "score~diagnosis",
         partitions_col = '.partitions',
         family='gaussian',
         REML = FALSE)

# Binomial
validate(data_partitioned,
         models = "diagnosis~score",
         partitions_col = '.partitions',
         family='binomial')

# Use non-default link functions

validate(data_partitioned,
         models = "score~diagnosis",
         partitions_col = '.partitions',
         family = 'gaussian',
         link = 'log',
         REML = FALSE)

## Feed separate train and test sets

# Partition data to list of data frames
# The first data frame will be train (70% of the data)
# The second will be test (30% of the data)
data_partitioned <- partition(data, p = 0.7,
                              cat_col = 'diagnosis',
                              id_col = 'participant',
                              list_out=TRUE)
train_data <- data_partitioned[[1]]
test_data <- data_partitioned[[2]]

# Validate a model

# Gaussian
validate(train_data,
         test_data = test_data,
         models = "score~diagnosis",
         family='gaussian',
         REML = FALSE)

# }

Run the code above in your browser using DataLab