This function performs energy distance based balancing and selects a subset from pool based on energy distance to approximate a randomized control trial. Optionally, it visualizes the balancing results.
VCG_sampler(formula, data, n, c_w = NULL, random = FALSE, plot = TRUE)If `plot = TRUE`, returns a list with:
A data frame with added columns:
`VCG`: Indicator for selected pool units. VCG==1 indicates the VCG selected.
`e_weights`: Energy weights used for selection
`<treated>_balanced`: A factor indicating balanced treated assignment.
A ggplot2 object showing the median and MAD differences before and after balancing, with a 95
If `plot = FALSE`, returns only the modified data frame.
A formula specifying the treated indicator and covariates, e.g., `treated ~ cov1 + cov2 | stratum`. The treated variable must be binary (0=pool, 1=treated)
A data frame containing the variables specified in the formula.
Integer. Number of observations to sample from the pool, or a vector of n for each stratum
Optional: Vector of positive weights for covariates, reflecting the relative importance of the covariates for balancing.
Logical. If `TRUE`, the distance is used as the probability for selecting the observation; otherwise, the nearest observations are used (deterministic). Default: FALSE
Logical. If `TRUE`, returns a visualization of the balancing effect.
If random is set to FALSE, the function selects the top `n` units from the pool with the lowest energy distance and assigns them to the VCG group. If random is set to TRUE, the function samples `n` units from pool with sampling probability inversely proportional to energy distance. The quality of covariate balancing is visualized using differences in medians and median absolute deviations (MADs). Permutation ellipses are generated by randomly permuting the pool and treated groups to estimate usual (random) variability. Only the X and Y axes are computed directly; the ellipse is interpolated between the axes. This method is intended as a visual approximation rather than a precise statistical test.
dat <- data.frame(
cov1 = rnorm(50, 10, 1),
cov2 = rnorm(50, 7, 1),
cov3 = rnorm(50, 5, 1),
treated = rep(c(0, 1), c(35, 15))
)
VCG_sampler(treated ~ cov1 + cov2 + cov3, data=dat, n=5)
Run the code above in your browser using DataLab