cnorm.cv: Cross validation for term selection

Description

This function helps in selecting the number of terms for the model by doing repeated Monte Carlo cross validation with 80 percent of the data as training data and 20 percent as the validation data. The cases are drawn randomly but stratified by norm group. Successive models are retrieved with increasing number of terms and the RMSE of raw scores (fitted by the regression model) is plotted for the training, validation and the complete dataset. Additionally to this analysis on the raw score level, it is possible (default) to estimate the mean norm score reliability and crossfit measures. For this, please set the norms parameter to TRUE. Due to the high computational load when computing norm scores, it takes time to finish when doing repeated cv or comparing models up to the maximum number of terms. When using the cv = "full" option, the ranking is done for the test and validation dataset separately (always based on T scores), resulting in a complete cross validation. In order to only validate the modeling, you as well can use a pre-ranked data set with prepareData(elfe) already applied. In this case, the training and validation data is drawn from the already ranked data and the scores for the validation set should improve. It is however no independent test, as the ranking between both samples is interlinked. In the output, you will get RMSE for the raw score models, norm score R2 and delta R2, the crossfit and the norm score SE sensu Oosterhuis, van der Ark, & Sijtsma (2016). For assessing, if a model over-fits the data and to what extent, we need cross-validation. We assumed that an overfitting occurred when a model captures more variance of the observed norm scores of the training sample compared to the captured variance of the norm scores of the validation sample. The overfit can therefore be described as: $$CROSSFIT = R(Training; Model)^2 / R(Validation; Model)^2$$ A CROSSFIT higher than 1 is a sign of overfitting. Value lower than 1 indicate an underfit due to a suboptimal modeling procedure, i. e. the method may not have captured all the variance of the observed data it could possibly capture. Values around 1 are ideal, as long as the raw score RMSE is low and the norm score validation R2 reaches high levels. As a suggestion for real tests:

Use visual inspection of the percentiles with plotPercentiles or plotPercentileSeries
Combine the visual inspection of the percentiles with a repeated cross validation (e. g. 10 repetitions)
Focus on low raw score RMSE, high norm score R2 in the validation dataset and avoid a number of terms with a high overfit (e. g. crossfit > 1.1).

Usage

cnorm.cv(
  data,
  formula = NULL,
  repetitions = 5,
  norms = TRUE,
  min = 1,
  max = 12,
  cv = "full",
  pCutoff = NA,
  width = NA,
  raw = NA,
  group = NA,
  age = NA
)

Value

table with results per term number, including RMSE for raw scores in training, validation and complete sample, R2 for the norm scores and the crossfit measure (1 = ideal, <1 = underfit, >1 = overfit)

Arguments

data: data frame of norm sample with ranking, powers and interaction of L and A or a cnorm object
formula: prespecified formula, e. g. from an existing regression model; min and max functions will be ignored In case a cnorm object is used, this functions automatically draws on the formula of the inbuilt regression function
repetitions: number of repetitions for cross validation
norms: determine norm score crossfit and R2 (if set to TRUE). The option is computationally intensive and duration increases with sample size, number of repetitions and maximum number of terms (max option).
min: Minimum number of terms to start from, default = 1
max: Maximum number of terms in model up to 2*k + k^2
cv: If set to full (default), the data is split into training and validation data and ranked afterwards, otherwise, a pre ranked dataset has to be provided, which is then split into train and validation (and thus only the modeling, but not the ranking is independent)
pCutoff: The function checks the stratification for unbalanced data sampling. It performs a t-test per group. pCutoff specifies the p-value per group that the test result has to reach at least. To minimize beta error, the value is set to .2 per default
width: If provided, ranking is done via rankBySlidingWindow, otherwise by group
raw: Name of the raw variable
group: Name of the grouping variable
age: Name of the age variable

References

Oosterhuis, H. E. M., van der Ark, L. A., & Sijtsma, K. (2016). Sample Size Requirements for Traditional and Regression-Based Norms. Assessment, 23(2), 191–202. https://doi.org/10.1177/1073191115580638

Examples

Run this code

# plot cross validation RMSE by number of terms up to 9 with three repetitions
data <- prepareData(elfe)
cnorm.cv(data, min = 3, max = 7, norms = FALSE)

# cross validate prespecified formula
# here, we will use the formula from a model to cross validate it and to retrieve norm RMSE
# own regression functions can of course be used as well
# result <- cnorm(raw = efe$raw, group = elfe$group)
# cnorm.cv(result, repetitions = 5)