bal.compute: Initialize and efficiently compute scalar balance statistics

Description

These are functions primarily designed for programmers who want to be able to quickly compute one of several scalar (single number) sample balance statistics, e.g., for use in selecting a tuning parameter when estimating balancing weights. bal.init() initializes the input so that when bal.compute() is used on the output along with a set of weights, the computation of the balance statistic is fast. vignette("optimizing-balance") provides an overview and more examples of how to use these functions.

Usage

bal.init(stat, treat, covs, s.weights = NULL, ...)
bal.compute(init, weights = NULL)

Value

For bal.init(), a bal.init object containing the components created in the initialization and the function used to compute the balance statistic. For bal.compute(), a single numeric value.

Arguments

stat: the name of the statistic to compute. See Details.
treat: a vector containing the treatment variable.
covs: a matrix or data frame containing the covariates.
s.weights: optional; a vector of sampling weights.
...: other arguments used to specify options for the balance statistic. See Details for which arguments are allowed with each balance statistic.
init: a bal.init object created by bal.init().
weights: a vector of balancing weights to compute the weighted statistics

Details

The following list contains the allowable balance statistics that can be supplied to bal.init(), the additional arguments that can be used with each one, and the treatment types allowed with each one. For all balance statistics, lower values indicate better balance.

smd.mean, smd.max, smd.rms: The mean, maximum, or root-mean-squared absolute standardized mean difference, computed using col_w_smd(). The other allowable arguments include estimand (ATE, ATC, or ATT) to select the estimand, focal to identify the focal treatment group when the ATT is the estimand and the treatment has more than two categories, and pairwise to select whether mean differences should be computed between each pair of treatment groups or between each treatment group and the target group identified by estimand (default TRUE). Can be used with binary and multi-category treatments.
ks.mean, ks.max, ks.rms: The mean, maximum, or root-mean-squared Kolmogorov-Smirnov statistic, computed using col_w_ks(). The other allowable arguments include estimand (ATE, ATC, or ATT) to select the estimand, focal to identify the focal treatment group when the ATT is the estimand and the treatment has more than two categories, and pairwise to select whether statistics should be computed between each pair of treatment groups or between each treatment group and the target group identified by estimand (default TRUE). Can be used with binary and multi-category treatments.
ovl.mean, ovl.max, ovl.rms: The mean, maximum, or root-mean-squared overlapping coefficient complement, computed using col_w_ovl(). The other allowable arguments include estimand (ATE, ATC, or ATT) to select the estimand, integrate to select whether integration is done using using integrate() (TRUE) or a Riemann sum (FALSE, the default), focal to identify the focal treatment group when the ATT is the estimand and the treatment has more than two categories, pairwise to select whether statistics should be computed between each pair of treatment groups or between each treatment group and the target group identified by estimand (default TRUE). Can be used with binary and multi-category treatments.
mahalanobis: The Mahalanobis distance between the treatment group means. This is similar to smd.rms but the covariates are standardized to remove correlations between them and de-emphasize redundant covariates. The other allowable arguments include estimand (ATE, ATC, or ATT) to select the estimand and focal to identify the focal treatment group when the ATT is the estimand. Can only be used with binary treatments.
energy.dist: The total energy distance between each treatment group and the target sample, which is a scalar measure of the similarity between two multivariate distributions. The other allowable arguments include estimand (ATE, ATC, or ATT) to select the estimand, focal to identify the focal treatment group when the ATT is the estimand and the treatment has more than two categories, and improved to select whether the "improved" energy distance should be used, which emphasizes difference between treatment groups in addition to difference between each treatment group and the target sample (default TRUE). Can be used with binary and multi-category treatments.
l1.med: The median L1 statistic computed across a random selection of possible coarsening of the data. The other allowable arguments include l1.min.bin (default 2) and l1.max.bin default (12) to select the minimum and maximum number of bins with which to bin continuous variables and l1.n (default 101) to select the number of binnings used to select the binning at the median. covs should be supplied without splitting factors into dummies to ensure the binning works correctly. Can be used with binary and multi-category treatments.
r2, r2.2, r2.3: The post-weighting \(R^2\) of a model for the treatment. The other allowable arguments include poly to add polynomial terms of the supplied order to the model and int (default FALSE) to add two-way interaction between covariates into the model. Using r2.2 is a shortcut to requesting squares, and using r2.3 is a shortcut to requesting cubes. Can be used with binary and continuous treatments. For binary treatments, the McKelvey and Zavoina \(R^2\) from a logistic regression is used; for continuous treatments, the \(R^2\) from a linear regression is used.
p.mean, p.max, p.rms: The mean, maximum, or root-mean-squared absolute Pearson correlation between the treatment and covariates, computed using col_w_corr(). Can only be used with continuous treatments.
s.mean, s.max, s.rms: The mean, maximum, or root-mean-squared absolute Spearman correlation between the treatment and covariates, computed using col_w_corr(). Can only be used with continuous treatments.
distance.cov: The distance covariance between the scaled covariates and treatment, which is a scalar measure of the independence of two possibly multivariate distributions. Can only be used with continuous treatments.

Examples

Run this code

if (FALSE) { # requireNamespace("MatchIt", quietly = TRUE)
# Select the optimal number of subclasses for
# subclassification:
data("lalonde")
covs <- c("age", "educ", "race", "married",
          "nodegree", "re74", "re75")

# Estimate propensity score
p <- glm(reformulate(covs, "treat"),
         data = lalonde, 
         family = "binomial")$fitted.values

# Function to compute subclassification weights
subclass_ATE <- function(treat, p, nsub) {
    m <- MatchIt::matchit(treat ~ 1,
                          data = lalonde,
                          distance = p,
                          method = "subclass",
                          estimand = "ATE",
                          subclass = nsub)
    return(m$weights)
}

# Initialize balance statistic; largest KS statistic
init <- bal.init("ks.max", treat = lalonde$treat,
                 covs = lalonde[covs],
                 estimand = "ATE")

# Testing 4 to 50 subclasses
nsubs <- 4:50
stats <- vapply(nsubs, function(n) {
    w <- subclass_ATE(lalonde$treat, p, n)
    bal.compute(init, w)
}, numeric(1L))

plot(stats ~ nsubs)

# 6 subclass gives lowest ks.max value (.238)
nsubs[which.min(stats)]
stats[which.min(stats)]
}