cv: Conduct cross-validation

Description

Function to easily cross-validate (including fold assignation, merging fold outputs, etc).

Usage

cv(x, y, family = c("binomial", "cox", "gaussian"), fit_fun, predict_fun, site = NULL,
covar = NULL, nfolds = 10, pred.format = NA, verbose = TRUE, ...)

Value

A list with the predictions and the models used.

Arguments

x: predictors. A matrix or data.frame (rows are observations and columns are variables) or a vector of factor (if only one predictor).
y: response to be predicted. A binary vector for "binomial", a "Surv" object for "cox", or a numeric vector for "gaussian".
family: distribution of y: "binomial", "cox", or "gaussian".
fit_fun: function to create the prediction model using the training subsets. It can have between two and four arguments(the first two are compulsory): x_training (training X data.frame), y_training (training Y outcomes), site_training (training site names), and covar_training (training covariates). It must return the overall prediction model, which may be a list of the different submodels used in different steps and/or derived from different imputations.
predict_fun: function to apply the prediction model to the test sets. It can have between two and four arguments (the first two are compulsory): model (the overall prediction model), x_test (test X data.frame), site_test (test site names), and covar_test (test covariates). It must return the predictions.
site: vector or factor with the sites' names, or NULL for studies conducted in a single site.
covar: other covariates that can be passed to fit_fun and predict_fun. A matrix or data.frame (rows are observations and columns are variables) or a vector of factor (if only one covariate).
...: other arguments that can be passed to fit_fun and predict_fun.
nfolds: number of folds, only used if folds is NULL.
pred.format: format of the predictions returned by each fold. E.g., if the prediction is an array, use NA.
verbose: (optional) logical, whether to print some messages during execution.

Author

Joaquim Radua

Details

This function iteratively divides the dataset into a training dataset, with which fits the model using the function fit_fun, and a test dataset, to which applies the model using the function predict_fun. It saves the models fit with the training datasets and the predictions obtained in the test datasets. The fols are assigned automatically using assign.folds, accounting for the site is this is not null.

Examples

Run this code

# Create random x (predictors) and y (binary)
x = matrix(rnorm(25000), ncol = 50)
y = 1 * (plogis(apply(x[,1:5], 1, sum) + rnorm(500, 0, 0.1)) > 0.5)

# Predict y via cross-validation
fit_fun = function (x_training, y_training) {
  list(
    lasso = glmnet_fit(x_training, y_training, family = "binomial")
  )
}
predict_fun = function (m, x_test) {
  glmnet_predict(m$lasso, x_test)
}
# Only 2 folds to ensure the example runs quickly
res = cv(x, y, family = "binomial", fit_fun = fit_fun, predict_fun = predict_fun, nfolds = 2)

# Show accuracy
se = mean(res$predictions$y.pred[res$predictions$y == 1] > 0.5)
sp = mean(res$predictions$y.pred[res$predictions$y == 0] < 0.5)
bac = (se + sp) / 2
cat("Sensitivity:", round(se, 2), "\n")
cat("Specificity:", round(sp, 2), "\n")
cat("Balanced accuracy:", round(bac, 2), "\n")

Run the code above in your browser using DataLab