bal.tab.df.formula: Balance Statistics for Data Sets

Description

Generates balance statistics for unadjusted, matched, weighted, or stratified data using either a data frame or formula interface.

Usage

"bal.tab"(x, treat, data = NULL, weights = NULL,  distance = NULL, subclass = NULL, method, int = FALSE, addl = NULL, continuous = c("std", "raw"),  binary = c("raw","std"), s.d.denom = c("treated",  "pooled", "control"), m.threshold = NULL,  v.threshold = NULL, r.threshold = NULL, un = FALSE,  disp.means = FALSE,  disp.v.ratio = FALSE, disp.subclass = FALSE, cluster = NULL, which.cluster = NULL, cluster.summary = TRUE,  quick = FALSE, ...)
"bal.tab"(formula, data, weights = NULL, distance = NULL,  subclass = NULL, method, int = FALSE, addl = NULL,  continuous = c("std", "raw"), binary = c("raw", "std"), s.d.denom = c("treated", "pooled", "control"),  m.threshold = NULL, v.threshold = NULL, r.threshold = NULL, un = FALSE,  disp.means = FALSE, disp.v.ratio = FALSE,  disp.subclass = FALSE, cluster = NULL, which.cluster = NULL,  cluster.summary = TRUE, quick = FALSE, ...)

Arguments

A data frame containing covariate values for each unit.

treat

Either a vector containing treatment status values for each unit or a string containing the name of the treatment variable in data.

formula

a formula with the treatment variable as the response and the covariates for which balance is to be assessed as the terms. All arguments must be present as variable names in data.

data

For the data frame method: Optional; a data frame containing variables with the names used in treat, weights, distance, and/or subclass, if any.

For the formula method: Required; a data frame containing all covariates named in formula and variables with the names used in weights, distance, and/or subclass, if any.

weights

Optional; either a vector containing weights for each unit or a string containing the name of the weights variable in data. These can be weights generated by, e.g., inverse probability weighting or matching weights resulting from a matching algorithm. This must be specified in method. If weights = NULL and subclass = NULL, balance information will be presented only for the unadjusted sample.

distance

Optional; either a vector containing distance values (e.g., propensity scores) for each unit or a string containing the name of the distance variable in data.

subclass

Optional; either a vector containing subclass memberhsip for each unit or a string containing the name of the subclass variable in data. If weights=NULL and subclass=NULL, balance information will be presented only for the unadjusted sample.

method

A string containing the method of adjustement, if any. If weights are specified, the user must specify either "matching" or "weighting"; "weighting" is the default. If subclass is specified, "subclassification" is the default. Abbreviations allowed.

int

logical; whether or not to include 2-way interactions of covariates included in covs and in addl.

addl

A data frame of additional covariates for which to present balance. These may be covariates included in the original dataset but not included in covs. In general, it makes more sense to include all desired variables in covs than in addl. See note in Details for using addl.

continuous

Whether mean differences for continuous variables should be standardized ("std") or raw ("raw"). Default "std". Abbreviations allowed.

binary

Whether mean differences for binary variables (i.e., difference in proportion) should be standardized ("std") or raw ("raw"). Default "raw". Abbreviations allowed.

s.d.denom

Whether the denominator for standardized differences (if any are calculated) should be the standard deviation of the treated group ("treated"), the standard deviation of the control group ("control"), or the pooled standard deviation ("pooled"), computed as the square root of the mean of the group variances. Abbreviations allowed. The default is "treated".

m.threshold

A numeric value for the threshold for mean differences. .1 is recommended.

v.threshold

A numeric value for the threshold for variance ratios. Will automatically convert to the inverse if less than 1.

r.threshold

A numeric value for the threshold for correlations between covariates and treatment when treatment is continuous.

logical; whether to print statistics for the unadjusted sample as well as for the adjusted sample. If weights = NULL and subclass = NULL, un will be set to TRUE.

disp.means

logical; whether to print the group means in balance output.

disp.v.ratio

logical; whether to display variance ratios in balance output.

disp.subclass

logical; whether to display balance information for individual subclasses if subclassification is used in conditioning.

cluster

either a vector containing cluster membserhip for each unit or a string containing the name of the cluster membership variable in data.

which.cluster

which cluster(s) to display. If NULL, all clusters in cluster will be displayed. If NA, no clusters will be displayed. Otherwise, can be a vector of cluster names or numerical indices for which to display balance. Indices correspond to the alphabetical order of cluster names.

cluster.summary

logical; whether to display the cluster summary table if cluster is specified. If which.cluster is NA, cluster.summary will be set to TRUE.

quick

logical; if TRUE, will not compute any values that will not be displayed. Leave FALSE if computed values not displayed will be used later.

...

further arguments passed to or from other methods. They are ignored in this function.

Value

If clusters are not specified, an object of class "bal.tab" containing balance summaries for the data object. If subclassifcation is not used, the following are the elements of bal.tab: :If clusters are specified, an object of class "bal.tab.cluster" containing balance summaries within each cluster and a summary of balance across clusters. Each balance summary is a balance table as described in Balance above. The summary of balance across clusters displays the mean, median, and maximum mean difference and variance ratio after adjustment for each covariate across clusters. Minimum statistics are calculated as well, but not displayed. To see these, use the options in print.bal.tab.cluster.If subclassification is used, the following are the elements of bal.tab: If subclassification is used, the following are the elements of bal.tab:If treatment is continuous, means, mean differences, and variance ratios are replaced by (weighted) Pearson correlations between each covariate and treatment. The r.threshold argument works the same as m.threshold or v.threshold, adding an extra column to the balance table output and creating additional summaries for balance tallies and maximum imbalances. All arguments related to the calculation or display of mean differences or variance ratios are ignored. The int, addl, un, and cluster arguments are still used as described above.

Details

bal.tab.data.frame() generates a list of balance summaries for the data frame of covariates and treatment status values given. bal.tab.formula() does the same but uses a formula interface instead. When the formula interface is used, the formula and data are reshaped into a treatment vector and data frame of covariates and then simply passed through the data frame method.

bal.tab() behaves differently depending on whether subclasses are used in conditioning or not. If they are used, bal.tab creates balance statistics for each subclass and for the sample in aggregate. If weights are specified, subclass will be ignored unless method is specified as "subclassification".

The last four arguments of bal.tab affect display only; they are passed directly to print.bal.tab or print.bal.tab.subclass, and do not affect any calculations or the contents of the bal.tab object. All balance statistics are calculated whether they are displayed by print or not. The threshold values (m.threshold, v.threshold, and r.threshold) control whether extra columns should be inserted into the Balance table describing whether the balance statistics in question exceeded or were within the threshold. Including these thresholds also creates summary tables tallying the number of variables that exceeded and were within the threshold and displaying the variables with the greatest imbalance on that balance measure. When subclassification is used, the extra threshold columns are placed within the balance tables for each subclass as well as in the aggregate balance table, and the summary tables display balance for each subclass.

The input to addl must be a data frame; if more than one variable is included, this is straightforward (i.e., because data[,c("v1", "v2")] is already a data frame), but if only one variable is used (e.g., data[,"v1"]), R will coerce it to a vector, thus making it unfit for input in addl. To avoid this, simply wrap the input to addl in data.frame() or use subset() if only one variable is to be added. Again, when more than one variable is included, the input is general already a data frame and nothing needs to be done. It is recommended to include all desired variables in formula or covs rather than specifying additional variables using addl.

Examples

Run this code

data("lalonde", package = "cobalt")

## Propensity score weighting using IPTW
glm1 <- glm(treat ~ age + educ + black + hispan, data = lalonde, 
            family = "binomial")
lalonde$distance <- glm1$fitted.values
lalonde$iptw.weights <- ifelse(lalonde$treat==1, 
                               1/lalonde$distance, 
                               1/(1-lalonde$distance))
covariates <- subset(lalonde, 
                     select = c(age, educ, black, hispan))

# data frame interface:
bal.tab(covariates, treat = "treat", data = lalonde, 
      weights = "iptw.weights", method = "weighting", 
      s.d.denom = "pooled")

# Formula interface:
bal.tab(treat ~ age + educ + black + hispan, data = lalonde, 
      weights = "iptw.weights", method = "weighting", 
      s.d.denom = "pooled")