cv_auc: Estimates of CVAUC

Description

This function computes K-fold cross-validated estimates of the area under the receiver operating characteristics (ROC) curve (hereafter, AUC). This quantity can be interpreted as the probability that a randomly selected case will have higher predicted risk than a randomly selected control.

Usage

cv_auc(
  Y,
  X,
  K = 10,
  learner = "glm_wrapper",
  nested_cv = TRUE,
  nested_K = K - 1,
  parallel = FALSE,
  max_cvtmle_iter = 10,
  cvtmle_ictol = 1/length(Y),
  prediction_list = NULL,
  ...
)

Arguments

A numeric vector of outcomes, assume to equal 0 or 1.

A data.frame or matrix of variables for prediction.

The number of cross-validation folds (default is 10).

learner

A wrapper that implements the desired method for building a prediction algorithm. See See ?glm_wrapper or read the package vignette for more information on formatting learners.

nested_cv

A boolean indicating whether nested cross validation should be used to estimate the distribution of the prediction function. Default (TRUE) is best choice for aggressive learner's, while FALSE is reasonable for smooth learner's (e.g., logistic regression).

nested_K

If nested cross validation is used, how many inner folds should there be? Default (K-1) affords quicker computation by reusing training fold learner fits.

parallel

A boolean indicating whether prediction algorithms should be trained in parallel. Default to FALSE.

max_cvtmle_iter

Maximum number of iterations for the bias correction step of the CV-TMLE estimator (default 10).

cvtmle_ictol

The CV-TMLE will iterate max_cvtmle_iter is reached or mean of cross-validated efficient influence function is less than cvtmle_ictol.

prediction_list

For power users: a list of predictions made by learner that has a format compatible with cvauc.

...

Other arguments, not currently used

Value

An object of class "cvauc".

est_cvtmle: cross-validated targeted minimum loss-based estimator of K-fold CV AUC
iter_cvtmle: iterations needed to achieve convergence of CVTMLE algorithm
cvtmle_trace: the value of the CVTMLE at each iteration of the targeting algorithm
se_cvtmle: estimated standard error based on targeted nuisance parameters
est_init: plug-in estimate of CV AUC where nuisance parameters are estimated in the training sample
est_empirical: the standard K-fold CV AUC estimator
se_empirical: estimated standard error for the standard estimator
est_onestep: cross-validated one-step estimate of K-fold CV AUC
se_onestep: estimated standard error for the one-step estimator
est_esteq: cross-validated estimating equations estimate of K-fold CV AUC
se_esteq: estimated standard error for the estimating equations estimator (same as for one-step)
folds: list of observation indexes in each validation fold
ic_cvtmle: influence function evaluated at the targeted nuisance parameter estimates
ic_onestep: influence function evaluated at the training-fold-estimated nuisance parameters
ic_esteq: influence function evaluated at the training-fold-estimated nuisance parameters
ic_empirical: influence function evaluated at the validation-fold estimated nuisance parameters
prediction_list: a list of output from the cross-validated model training; see the individual wrapper function documentation for further details

Details

To estimate the AUC of a particular prediction algorithm, K-fold cross-validation is commonly used: data are partitioned into K distinct groups and the prediction algorithm is developed using K-1 of these groups. In standard K-fold cross-validation, the AUC of this prediction algorithm is estimated using the remaining fold. This can be problematic when the number of observations is small or the number of cross-validation folds is large.

Here, we estimate relevant nuisance parameters in the training sample and use the validation sample to perform some form of bias correction -- either through cross-validated targeted minimum loss-based estimation, estimating equations, or one-step estimation. When aggressive learning algorithms are applied, it is necessary to use an additional layer of cross-validation in the training sample to estimate the nuisance parameters. This is controlled via the nested_cv option below.

Examples

Run this code

# NOT RUN {
# simulate data
n <- 200
p <- 10
X <- data.frame(matrix(rnorm(n*p), nrow = n, ncol = p))
Y <- rbinom(n, 1, plogis(X[,1] + X[,10]))

# get cv auc estimates for logistic regression
cv_auc_ests <- cv_auc(Y = Y, X = X, K = 5, learner = "glm_wrapper")

# get cv auc estimates for random forest
# using nested cross-validation for nuisance parameter estimation
# }
# NOT RUN {
fit <- cv_auc(Y = Y, X = X, K = 5, 
              learner = "randomforest_wrapper", 
              nested_cv = TRUE)
# }

Run the code above in your browser using DataLab