balqual: Evaluate Matching Quality

Description

The balqual() function evaluates the balance quality of a dataset after matching, comparing it to the original unbalanced dataset. It computes various summary statistics and provides an easy interpretation using user-specified cutoff values.

Usage

balqual(
  matched_data = NULL,
  formula = NULL,
  type = c("smd", "r", "var_ratio"),
  statistic = c("mean", "max"),
  cutoffs = NULL,
  round = 3
)

Value

If assigned to a name, returns a list of summary statistics of class quality containing:

quality_mean - A data frame with the mean values of the statistics specified in the type argument for all balancing variables used in formula.
quality_max - A data frame with the maximal values of the statistics specified in the type argument for all balancing variables used in formula.
perc_matched - A single numeric value indicating the percentage of observations in the original dataset that were matched.
statistic - A single string defining which statistic will be displayed in the console.
summary_head - A summary of the matching process. If max is included in the statistic, it contains the maximal observed values for each variable; otherwise, it includes the mean values.
n_before - The number of observations in the dataset before matching.
n_after - The number of observations in the dataset after matching.
count_table - A contingency table showing the distribution of the treatment variable before and after matching.

The balqual() function also prints a well-formatted table with the defined summary statistics for each variable in the formula to the console.

Arguments

matched_data

An object of class matched, generated by the match_gps() function. This object is essential for the balqual() function as it contains the final data.frame and attributes required to compute the quality coefficients.

formula

A valid R formula used to compute generalized propensity scores during the first step of the vector matching algorithm in estimate_gps(). This formula must match the one used in estimate_gps().

type

A character vector specifying the quality metrics to calculate. Can maximally contain 3 values in a vector created by the c(). Possible values include:

smd - Calculates standardized mean differences (SMD) between groups, defined as the difference in means divided by the standard deviation of the treatment group (Rubin, 2001).
r - Computes Pearson's r coefficient using the Z statistic from the U-Mann-Whitney test.
var_ratio - Measures the dispersion differences between groups, calculated as the ratio of the larger variance to the smaller one.

statistic

A character vector specifying the type of statistics used to summarize the quality metrics. Since quality metrics are calculated for all pairwise comparisons between treatment levels, they need to be aggregated for the entire dataset.

max: Returns the maximum values of the statistics defined in the type argument (as suggested by Lopez and Gutman, 2017).
mean: Returns the corresponding averages.

To compute both, provide both names using the c() function.

cutoffs

A numeric vector with the same length as the number of coefficients specified in the type argument. Defines the cutoffs for each corresponding metric, below which the dataset is considered balanced. If NULL, the default cutoffs are used: 0.1 for smd and r, and 2 for var_ratio.

round

An integer specifying the number of decimal places to round the output to.

References

Rubin, D.B. Using Propensity Scores to Help Design Observational Studies: Application to the Tobacco Litigation. Health Services & Outcomes Research Methodology 2, 169–188 (2001). https://doi.org/10.1023/A:1020363010465

Michael J. Lopez, Roee Gutman "Estimation of Causal Effects with Multiple Treatments: A Review and New Ideas," Statistical Science, Statist. Sci. 32(3), 432-454, (August 2017)

Examples

Run this code

# We try to balance the treatment variable in the cancer dataset based on age
# and sex covariates
data(cancer)

# Firstly, we define the formula
formula_cancer <- formula(status ~ age * sex)

# Then we can estimate the generalized propensity scores
gps_cancer <- estimate_gps(formula_cancer,
  cancer,
  method = "multinom",
  reference = "control",
  verbose_output = TRUE
)

# ... and drop observations based on the common support region...
csr_cancer <- csregion(gps_cancer)

# ... to match the samples using `match_gps()`
matched_cancer <- match_gps(csr_cancer,
  reference = "control",
  caliper = 1,
  kmeans_cluster = 5,
  kmeans_args = list(n.iter = 100),
  verbose_output = TRUE
)

# At the end we can assess the quality of matching using `balqual()`
balqual(
  matched_data = matched_cancer,
  formula = formula_cancer,
  type = "smd",
  statistic = "max",
  round = 3,
  cutoffs = 0.2
)