catsib: CATSIB DIF Detection Procedure

Description

This function performs DIF analysis on items using the CATSIB procedure (Nandakumar & Roussos, 2004), a modified version of SIBTEST (Shealy & Stout, 1993). The CATSIB procedure is suitable for computerized adaptive testing (CAT) environments. In CATSIB, examinees are matched on IRT-based ability estimates that have been adjusted using a regression correction method (Shealy & Stout, 1993) to reduce statistical bias in the CATSIB statistic caused by impact.

Usage

catsib(
  x = NULL,
  data,
  score = NULL,
  se = NULL,
  group,
  focal.name,
  item.skip = NULL,
  D = 1,
  n.bin = c(80, 10),
  min.binsize = 3,
  max.del = 0.075,
  weight.group = c("comb", "foc", "ref"),
  alpha = 0.05,
  missing = NA,
  purify = FALSE,
  max.iter = 10,
  min.resp = NULL,
  method = "ML",
  range = c(-5, 5),
  norm.prior = c(0, 1),
  nquad = 41,
  weights = NULL,
  ncore = 1,
  verbose = TRUE,
  ...
)

Value

This function returns a list consisting of four elements:

no_purify

A list containing the results of the DIF analysis without applying a purification procedure. This list includes:

dif_stat: A data frame containing the results of the CATSIB statistics for all evaluated items. The columns include the item ID, CATSIB (beta) statistic, standard error of beta, standardized beta, p-value for beta, sample size of the reference group, sample size of the focal group, and total sample size.

dif_item

A numeric vector identifying items flagged as potential DIF items based on the CATSIB statistic.

contingency

A list of contingency tables used for computing the CATSIB statistics for each item.

purify

A logical value indicating whether a purification procedure was applied.

with_purify

A list containing the results of the DIF analysis with a purification procedure. This list includes:

dif_stat: A data frame containing the results of the CATSIB statistics for all evaluated items. The columns include the item ID, CATSIB (beta) statistic, standard error of beta, standardized beta, p-value for beta, sample size of the reference group, sample size of the focal group, total sample size, and the iteration number (n) in which the CATSIB statistics were computed.

dif_item

A numeric vector identifying items flagged as potential DIF items based on the CATSIB statistic.

n.iter

An integer indicating the total number of iterations performed during the purification process.

complete

A logical value indicating whether the purification process was completed. If FALSE, the process reached the maximum number of iterations without full convergence.

contingency

A list of contingency tables used for computing the CATSIB statistics for each item during the purification process.

alpha

The significance level \(\alpha\) used to compute the p-values of the CATSIB statistics.

Arguments

x

A data frame containing item metadata (e.g., item parameters, number of categories, IRT model types, etc.). See est_irt() or simdat() for more details about the item metadata. This data frame can be easily created using the shape_df() function.

data

A matrix of examinees' item responses corresponding to the items specified in the x argument. Rows represent examinees and columns represent items.

score

A numeric vector containing examinees' ability estimates (theta values). If not provided, catsib() will estimate ability parameters internally before computing the CATSIB statistics. See est_score() for more information on scoring methods. Default is NULL.

se

A vector of standard errors corresponding to the ability estimates. The order of the standard errors must match the order of the ability estimates provided in the score argument. Default is NULL.

group

A numeric or character vector indicating examinees' group membership. The length of the vector must match the number of rows in the response data matrix.

focal.name

A single numeric or character value specifying the focal group. For instance, given group = c(0, 1, 0, 1, 1) and '1' indicating the focal group, set focal.name = 1.

item.skip

A numeric vector of item indices to exclude from DIF analysis. If NULL, all items are included. Useful for omitting specific items based on prior insights.

D

A scaling constant used in IRT models to make the logistic function closely approximate the normal ogive function. A value of 1.7 is commonly used for this purpose. Default is 1.

n.bin

A numeric vector of two positive integers specifying the maximum and minimum numbers of bins (or intervals) on the ability scale. The first and second values represent the maximum and minimum numbers of bins, respectively. Default is c(80, 10). See the Details section below for more information.

min.binsize

A positive integer specifying the minimum number of examinees required in each bin. To ensure stable statistical estimation, each bin must contain at least the specified number of examinees from both the reference and focal groups in order to be included in the calculation of \(\hat{\beta}\). Bins that do not meet this minimum are excluded from the computation. Default is 3. See the Details section for further explanation.

max.del

A numeric value specifying the maximum allowable proportion of examinees that may be excluded from either the reference or focal group during the binning process. This threshold is used when determining the number of bins on the ability scale automatically. Default is 0.075. See the Details section for more information.

weight.group

A character string specifying the target ability distribution used to compute the expected DIF measure \(\hat{\beta}\) and its corresponding standard error. Available options are: "comb" for the combined distribution of both the reference and focal groups, "foc" for the focal group's distribution, and "ref" for the reference group's distribution. Default is "comb". See the Details section below for more information.

alpha

A numeric value specifying the significance level (\(\alpha\)) for the hypothesis test associated with the CATSIB (beta) statistic. Default is 0.05.

missing

A value indicating missing responses in the data set. Default is NA.

purify

Logical. Indicates whether to apply a purification procedure. Default is FALSE.

max.iter

A positive integer specifying the maximum number of iterations allowed for the purification process. Default is 10.

min.resp

A positive integer specifying the minimum number of valid item responses required from an examinee in order to compute an ability estimate. Default is NULL. See Details for more information.

method

A character string indicating the scoring method to use. Available options are:

"ML": Maximum likelihood estimation
"WL": Weighted likelihood estimation (Warm, 1989)
"MAP": Maximum a posteriori estimation (Hambleton et al., 1991)
"EAP": Expected a posteriori estimation (Bock & Mislevy, 1982)

Default is "ML".

range

A numeric vector of length two specifying the lower and upper bounds of the ability scale. This is used for the following scoring methods: "ML", "WL", and "MAP". Default is c(-5, 5).

norm.prior

A numeric vector of length two specifying the mean and standard deviation of the normal prior distribution. These values are used to generate the Gaussian quadrature points and weights. Ignored if method is "ML" or "WL". Default is c(0, 1).

nquad

An integer indicating the number of Gaussian quadrature points to be generated from the normal prior distribution. Used only when method is "EAP". Ignored for "ML", "WL", and "MAP". Default is 41.

weights

A two-column matrix or data frame containing the quadrature points (in the first column) and their corresponding weights (in the second column) for the latent variable prior distribution. The weights and points can be conveniently generated using the function gen.weight().

If NULL and method = "EAP", default quadrature values are generated based on the norm.prior and nquad arguments. Ignored if method is "ML", "WL", or "MAP".

ncore

An integer specifying the number of logical CPU cores to use for parallel processing. Default is 1. See est_score() for details.

verbose

Logical. If TRUE, progress messages from the purification procedure will be displayed; if FALSE, the messages will be suppressed. Default is TRUE.

...

Additional arguments passed to the est_score() function.

Author

Hwanggyu Lim hglim83@gmail.com

Details

In the CATSIB procedure (Nandakumar & Roussos, 2004), \(\hat{\theta}^{\ast}\)— the expected value of \(\theta\) regressed on \(\hat{\theta}\)—is a continuous variable. The range of \(\hat{\theta}^{\ast}\) is divided into K equal-width intervals, and examinees are classified into one of these K intervals based on their \(\hat{\theta}^{\ast}\) values. Any interval containing fewer than three examinees from either the reference or focal group is excluded from the computation of \(\hat{\beta}\), the DIF effect size, to ensure statistical stability. According to Nandakumar and Roussos (2004), the default minimum bin size is 3, which can be controlled via the min.binsize argument.

To determine an appropriate number of intervals (K), catsib() automatically decreases K from a large starting value (e.g., 80) based on the rule proposed by Nandakumar and Roussos (2004). Specifically, if more than 7.5\ excluded due to small bin sizes, the number of bins is reduced by one and the process is repeated. This continues until the retained examinees in each group comprise at least 92.5\ few bins, they recommended a minimum of K = 10. Therefore, the default maximum and minimum number of bins are set to 80 and 10, respectively, via n.bin. Likewise, the maximum allowable proportion of excluded examinees is set to 0.075 by default through the max.del argument.

When it comes to the target ability distribution used to compute \(\hat{\beta}\), Li and Stout (1996) and Nandakumar and Roussos (2004) employed the combined-group target ability distribution, which is the default option in weight.group. See Nandakumar and Roussos (2004) for further details about the CATSIB method.

Although Nandakumar and Roussos (2004) did not propose a purification procedure for DIF analysis using CATSIB, catsib() can implement an iterative purification process in a manner similar to that of Lim et al. (2022). Specifically, at each iteration, examinees' latent abilities are recalculated using the purified set of items and the scoring method specified in the method argument. The iterative purification process terminates either when no additional DIF items are detected or when the number of iterations reaches the limit set by max.iter. See Lim et al. (2022) for more details on the purification procedure.

Scoring based on a limited number of items may result in large standard errors, which can negatively affect the effectiveness of DIF detection using the CATSIB procedure. The min.resp argument can be used to prevent the use of scores with large standard errors, particularly during the purification process. For example, if min.resp is not NULL (e.g., min.resp = 5), item responses from examinees whose total number of valid responses is below the specified threshold are treated as missing (i.e., NA). As a result, their ability estimates are also treated as missing and are excluded from the CATSIB statistic computation. If min.resp = NULL, a score will be computed for any examinee with at least one valid item response.

References

Li, H. H., & Stout, W. (1996). A new procedure for detection of crossing DIF. Psychometrika, 61(4), 647-677.

Lim, H., Choe, E. M., & Han, K. T. (2022). A residual-based differential item functioning detection framework in item response theory. Journal of Educational Measurement.

Nandakumar, R., & Roussos, L. (2004). Evaluation of the CATSIB DIF procedure in a pretest setting. Journal of Educational and Behavioral Statistics, 29(2), 177-199.

Shealy, R. T., & Stout, W. F. (1993). A model-based standardization approach that separates true bias/DIF from group ability differences and detects test bias/DIF as well as item bias/DIF. Psychometrika, 58, 159–194.

Examples

Run this code

# \donttest{
# Load required package
library("dplyr")

## Uniform DIF Detection
###############################################
# (1) Simulate data with true uniform DIF
###############################################

# Import the "-prm.txt" output file from flexMIRT
flex_sam <- system.file("extdata", "flexmirt_sample-prm.txt", package = "irtQ")

# Select 36 3PLM items that are non-DIF
par_nstd <-
  bring.flexmirt(file = flex_sam, "par")$Group1$full_df %>%
  dplyr::filter(.data$model == "3PLM") %>%
  dplyr::filter(dplyr::row_number() %in% 1:36) %>%
  dplyr::select(1:6)
par_nstd$id <- paste0("nondif", 1:36)

# Generate four new items to contain uniform DIF
difpar_ref <-
  shape_df(
    par.drm = list(a = c(0.8, 1.5, 0.8, 1.5), b = c(0.0, 0.0, -0.5, -0.5), g = 0.15),
    item.id = paste0("dif", 1:4), cats = 2, model = "3PLM"
  )

# Introduce uniform DIF in the focal group by shifting b-parameters
difpar_foc <-
  difpar_ref %>%
  dplyr::mutate_at(.vars = "par.2", .funs = function(x) x + rep(0.7, 4))

# Combine the 4 DIF and 36 non-DIF items for both reference and focal groups
# Threfore, the first four items now exhibit uniform DIF
par_ref <- rbind(difpar_ref, par_nstd)
par_foc <- rbind(difpar_foc, par_nstd)

# Generate true theta values
set.seed(123)
theta_ref <- rnorm(500, 0.0, 1.0)
theta_foc <- rnorm(500, 0.0, 1.0)

# Simulate response data
resp_ref <- simdat(par_ref, theta = theta_ref, D = 1)
resp_foc <- simdat(par_foc, theta = theta_foc, D = 1)
data <- rbind(resp_ref, resp_foc)

###############################################
# (2) Estimate item and ability parameters
#     using the aggregated data
###############################################

# Estimate item parameters
est_mod <- est_irt(data = data, D = 1, model = "3PLM")
est_par <- est_mod$par.est

# Estimate ability parameters using ML
theta_est <- est_score(x = est_par, data = data, method = "ML")
score <- theta_est$est.theta
se <- theta_est$se.theta

###############################################
# (3) Conduct DIF analysis
###############################################
# Create a vector of group membership indicators
# where '1' indicates the focal group
group <- c(rep(0, 500), rep(1, 500))

# (a)-1 Compute the CATSIB statistic using provided scores,
#       without purification
dif_1 <- catsib(
  x = NULL, data = data, D = 1, score = score, se = se, group = group, focal.name = 1,
  weight.group = "comb", alpha = 0.05, missing = NA, purify = FALSE
)
print(dif_1)

# (a)-2 Compute the CATSIB statistic using provided scores,
#       with purification
dif_2 <- catsib(
  x = est_par, data = data, D = 1, score = score, se = se, group = group, focal.name = 1,
  weight.group = "comb", alpha = 0.05, missing = NA, purify = TRUE
)
print(dif_2)
# }

Run the code above in your browser using DataLab