modelvalid: R function for binary Logistic Regression internal validation

Description

The function allows to perform internal validation of a binary Logistic Regression model implementing most of the procedure described in: Arboretti Giancristofaro R, Salmaso L. "Model performance analysis and model validation in logistic regression". Statistica 2003(63): 375<U+2013>396.

Usage

modelvalid(data, fit, B = 200, g = 10, oneplot = TRUE,
  excludeInterc = FALSE)

Arguments

data

Dataframe containing the dataset (Dependent Variable must be stored in the first column to the left).

fit

Object returned from glm() function.

Desired number of iterations (200 by default).

Number of groups to be used for the Hosmer-Lemeshow test (10 by default).

oneplot

TRUE (default) is the user wants the charts returned in a single visualization.

excludeInterc

If set to TRUE, the chart showing the boxplots of the parameters distribution across the selected iteration will have y-axis limits corresponding to the min and max of the parameters value; this allows better displaying the boxplots of the model parameters when they end up showing up too much squeezed due to comparatively higher/lower values of the intercept. FALSE is default.

Value

The function returns:

-a chart with boxplots representing the fitting distribution of the estimated model's coefficients; coefficients' labels are flagged with an asterisk when the proportion of p-values smaller than 0.05 across the selected iterations is at least 95 percent;

-a chart with boxplots representing the fitting and the validation distribution of the AUC value across the selected iterations. for an example of the interpretation of the chart, see the aforementioned article, especially page 390-91;

-a chart of the levels of the dependent variable plotted against the predicted probabilities (if the model has a high discriminatory power, the two stripes of points will tend to be well separated, i.e. the positive outcome of the dependent variable will tend to cluster around high values of the predicted probability, while the opposite will hold true for the negative outcome of the dependent variable);

-a list containing:

$overall.model.significance: statistics related to the overall model p-value and to its distribution across the selected iterations
$parameters.stability: statistics related to the stability of the estimated coefficients across the selected iterations
$p.values.stability: statistics related to the stability of the estimated p-values across the selected iterations
$AUCstatistics: statistics about the fitting and validation AUC distribution
$Hosmer-Lemeshow statistics: statistics about the fitting and validation distribution of the HL test p-values

As for the abovementioned statistics:

-full: statistic estimated on the full dataset;

-median: median of the statistic across the selected iterations;

-QRNG: interquartile range across the selected iterations;

-QRNGoverMedian: ratio between the QRNG and the median, expressed as percentage; -min: minimum of the statistic across the selected iterations;

-max: maximum of the statistic across the selected iterations;

-percent_smaller_0.05: (only for $overall.model.significance, $p.values.stability, and $Hosmer-Lemeshow statistics): proportion of times in which the p-values are smaller than 0.05; please notice that for the overall model significance and for the p-values stability it is desirable that the percentage is at least 95percent, whereas for the HL test p-values it is indeed desirable that the proportion is not larger than 5percent (in line with the interpetation of the test p-value which has to be NOT significant in order to hint at a good fit);

-significant (only for $p.values.stability): asterisk indicating that the p-values of the corresponding coefficient resulted smaller than 0.05 in at least 95percent of the iterations.

Details

The procedure consists of the following steps:

(1) the whole dataset is split into two random parts, a fitting (75 percent) and a validation (25 percent) portion;

(2) the model is fitted on the fitting portion (i.e., its coefficients are computed considering only the observations in that portion) and its performance is evaluated on both the fitting and the validation portion, using AUC as performance measure;

(3) the model's estimated coefficients, p-values, and the p-value of the Hosmer and Lemeshow test are stored;

(4) steps 1-3 are repeated B times, eventually getting a fitting and validation distribution of the AUC values and of the HL test p-values, as well as a fitting distribution of the coefficients and of the associated p-values. The AUC fitting distribution provides an estimate of the performance of the model in the population of all the theoretical fitting samples; the AUC validation distribution represents an estimate of the model<U+2019>s performance on new and independent data.

Examples

Run this code

# NOT RUN {
# load the sample dataset
data(log_regr_data)

# fit a logistic regression model, storing the results into an object called 'model'
model <- glm(admit ~ gre + gpa + rank, data = log_regr_data, family = "binomial")

# run the function, using 100 iterations, and store the result in the 'res' object
res <- modelvalid(data=log_regr_data, fit=model, B=100)

# }