bmatch: Optimal bipartite matching in observational studies

Description

Function for optimal bipartite matching in observational studies that directly balances the observed covariates. bmatch allows the user to enforce different forms of covariate balance in the matched samples such as moment balance (e.g., of means, variances and correlations), distributional balance (e.g., fine balance, near-fine balance, strength-k balancing) and exact matching. In observational studies where an instrumental variable is available, bmatch can be used to handle weak instruments and strengthen them by means of the far options (see Yang et al. 2014 for an example). bmatch can also be used in discontinuity designs by matching units in a neighborhood of the discontinuity (see Keele et al. 2015 for details). In any of these settings, bmatch either minimizes the total sum of covariate distances between matched units, maximizes the total number of matched units, or optimizes a combination of the two, subject to matching, covariate balancing, and representativeness constraints (see the examples below).

Usage

bmatch(t_ind, dist_mat = NULL, subset_weight = NULL, n_controls = 1,
	       total_groups = NULL, mom = NULL, ks = NULL, exact = NULL, 
	       near_exact = NULL, fine = NULL, near_fine = NULL, near = NULL, 
	       far = NULL, solver = NULL)

Value

A list containing the optimal solution, with the following objects:

obj_total: value of the objective function at the optimum;
obj_dist_mat: value of the total sum of distances term of the objective function at the optimum;
t_id: indexes of the matched treated units at the optimum;
c_id: indexes of the matched controls at the optimum;
group_id: matched pairs or groups at the optimum;
time: time elapsed to find the optimal solution.

Arguments

t_ind

treatment indicator: a vector of zeros and ones indicating treatment (1 = treated; 0 = control). Please note that the data needs to be sorted in decreasing order according to this treatment indicator.

dist_mat

distance matrix: a matrix of positive distances between treated units (rows) and controls (columns). If dist_mat = NULL and subset_weight = 1, then bmatch will solve the cardinality matching problem in Zubizarreta et al. (2014).

subset_weight

subset matching weight: a scalar that regulates the trade-off between the total sum of distances between matched pairs and the total number of matched pairs. The larger subset_weight, the more importance will be given to the the total number of matched pairs relative to the total sum of distances between matched pairs. See Rosenbaum (2012) and Zubizarreta et al. (2013) for a discussion of this parameter.

If subset_weight = NULL, then bmatch will match all the treated observations, provided there exists a feasible solution.

If subset_weight = 1 and dist_mat = NULL, then bmatch will solve the cardinality matching problem in Zubizarreta et al. (2014).

n_controls

number of controls: a scalar defining the number of controls to be matched with a fixed rate to each treated unit. The default is n_controls = 1, i.e. pair matching.

total_groups

total number of matched pairs: a scalar specifying the number of matched pairs to be obtained. If total_groups = NULL, then no specific number of matched pairs is required before matching.

mom

moment balance parameters: a list with three arguments,

mom = list(covs = mom_covs, tols = mom_tols, targets = mom_targets).

mom_covs is a matrix where each column is a covariate whose mean is to be balanced. mom_tols is a vector of tolerances for the maximum difference in means for the covariates in mom_covs. mom_targets is a vector of target moments (e.g., means) of a distribution to be approximated by matched sampling. mom_targets is optional, but if mom_covs is specified then mom_tols needs to be specified too. If mom_targets is NULL, then bmatch will match treated and control units so that covariates in mom_covs differ at most by mom_tols. If mom_targets is specified, then bmatch will match treated and control units so that each matched group differs at most by mom_tols units from the respective moments in mom_targets. As a result, the matched groups will differ at most mom_tols * 2 from each other. Under certain assumptions, mom_targets can be used for constructing a representative matched sample. The lengths of mom_tols and mom_target have to be equal to the number of columns of mom_covs. Note that the columns of mom_covs can be transformations of the original covariates to balance higher order single-dimensional moments like variances and skewness, and multidimensional moments such as correlations (Zubizarreta 2012).

ks

Kolmogorov-Smirnov balance parameters: a list with three objects,

ks = list(covs = ks_covs, n_grid = ks_n_grid, tols = ks_tols).

ks_covs is a matrix where each column is a covariate whose difference in a coarsened version of the Kolmogorov-Smirnov statistic between treated units and matched controls is to be constrained. ks_n_grid is scalar defining the number of equispaced grid points for calculating the Kolmogorov-Smirnov statistic between the distributions of the treated units and matched controls for the covariates in ks_covs. The defaul is ks_n_grid = 10 for the deciles of the empirical cumulative distribution functions of the treated units for the covariate in question. ks_tols is a vector of tolerances for the maximum differences in means for the covariates defined in ks_covs. Note that the length of ks_tols has to be equal to the number of columns of ks_covs.

exact

Exact matching parameters: a list with one argument,

exact = list(covs = exact_covs),

where exact_covs is a matrix where each column is a nominal covariate for exact matching.

near_exact

Near-exact matching parameters: a list with two objects,

near_exact = list(covs = near_exact_covs, devs = near_exact_devs).

near_exact_covs are the near-exact matching covariates; specifically, a matrix where each column is a nominal covariate for near-exact matching. near_exact_devs are the maximum deviations from near-exact matching: a vector of scalars defining the maximum deviation allowed from exact matching for the covariates defined in near_exact_covs. Note that the length of near_exact_devs has to be equal to the number of columns of near_exact_covs. For detailed expositions of near-exact matching, see section 9.2 of Rosenbaum (2010) and Zubizarreta et al. (2011).

fine

Fine balance parameters: a list with one argument,

fine = list(covs = fine_covs),

where fine_covs is a matrix where each column is a nominal covariate for fine balance. Fine balance enforces exact distributional balance on nominal covariates, but without constraining treated and control units to be matched within each category of each nominal covariate as in exact matching. See chapter 10 of Rosenbaum (2010) for details.

near_fine

Near-fine balance parameters: a list with two objects,

near_fine = list(covs = near_fine_covs, devs = near_fine_devs).

near_fine_covs is a matrix where each column is a nominal covariate for near-fine matching. near_fine_devs is a vector of scalars defining the maximum deviation allowed from fine balance for the covariates in near_fine_covs. Note that the length of near_fine_devs has to be equal to the number of columns of near_fine_covs. See Yang et al. (2012) for a description of near-fine balance.

near

Near matching parameters: a list with three objects,

near = list(covs = near_covs, pairs = near_pairs, groups = near_groups).

near_covs is a matrix where each column is a variable for near matching. near_pairs is a vector determining the maximum distance between individual matched pairs for each variable in near_covs. near_groups is a vector determining the maximum average distance between matched groups (in aggregate) for each covariate in near_covs. If near_covs is specified, then either near_pairs, near_covs or both must be specified as well, and the length of near_pairs and/or near_groups has to be equal to the number of columns of near_covs. The near options can be useful for matching within a window of a discontinuity (see Keele et al. 2015 for details).

far

Far matching parameters: a list with three objects,

far = list(covs = far_covs, pairs = far_pairs, groups = far_groups).

far_covs is a matrix where each column is a variable (a covariate or an instrumental variable) for far matching. far_pairs is a vector determining the minimum distance between units in a matched pair for each variable in far_covs, and far_groups is a vector defining the minimum average (aggregate) distance between matched groups for each variable in far_covs. If far_covs is specified, then either far_pairs, far_covs, or both, must be specified, and the length of far_pairs and/or far_groups has to be equal to the number of columns of far_covs. See Yang et al. (2014) for strengthening an instrumental variable via far matching in a bipartite setting.

solver

Optimization solver parameters: a list with four objects,

solver = list(name = name, t_max = t_max, approximate = 1, round_cplex = 0,
trace_cplex = 0).

solver is a string that determines the optimization solver to be used. The options are: cplex, glpk, gurobi, highs, and symphony. The default solver is highs with approximate = 1, so that by default an approximate solution is found (see approximate below). For an exact solution, we strongly recommend using cplex or gurobi as they are much faster than the other solvers, but they do require a license (free for academics, but not for people outside universities). Between cplex and gurobi, note that installing the R interface for gurobi is much simpler.

t_max is a scalar with the maximum time limit for finding the matches. This option is specific to cplex and gurobi. If the optimal matches are not found within this time limit, a partial, suboptimal solution is given.

approximate is a scalar that determines the method of solution. If approximate = 1 (the default), an approximate solution is found via a relaxation of the original integer program. This method of solution is faster than approximate = 0, but some balancing constraints may be violated to some extent. This option works only with n_controls = 1, i.e. pair matching.

round_cplex is binary specific to cplex. round_cplex = 1 ensures that the solution found is integral by rounding and all the constraints are exactly statisfied; round_cplex = 0 (the default) encodes there is no rounding which may return slightly infeasible integer solutions.

trace is a binary specific to cplex and gurobi. trace = 1 turns the optimizer output on. The default is trace = 0.

Author

Jose R. Zubizarreta <zubizarreta@hcp.med.harvard.edu>, Cinar Kilcioglu <ckilcioglu16@gsb.columbia.edu>.

References

Keele, L., Titiunik, R., and Zubizarreta, J. R., (2015), "Enhancing a Geographic Regression Discontinuity Design Through Matching to Estimate the Effect of Ballot Initiatives on Voter Turnout," Journal of the Royal Statistical Society: Series A, 178, 223-239.

Rosenbaum, P. R. (2010), Design of Observational Studies, Springer.

Rosenbaum, P. R. (2012), "Optimal Matching of an Optimally Chosen Subset in Observa- tional studies," Journal of Computational and Graphical Statistics, 21, 57-71.

Yang, D., Small, D., Silber, J. H., and Rosenbaum, P. R. (2012), "Optimal Matching With Minimal Deviation From Fine Balance in a Study of Obesity and Surgical Outcomes," Biometrics, 68, 628-36.

Yang. F., Zubizarreta, J. R., Small, D. S., Lorch, S. A., and Rosenbaum, P. R. (2014), "Dissonant Conclusions When Testing the Validity of an Instrumental Variable," The American Statistician, 68, 253-263.

Zubizarreta, J. R., Reinke, C. E., Kelz, R. R., Silber, J. H., and Rosenbaum, P. R. (2011), "Matching for Several Sparse Nominal Variables in a Case-Control Study of Readmission Following Surgery," The American Statistician, 65, 229-238.

Zubizarreta, J. R. (2012), "Using Mixed Integer Programming for Matching in an Observational Study of Kidney Failure after Surgery," Journal of the American Statistical Association, 107, 1360-1371.

Zubizarreta, J. R., Paredes, R. D., and Rosenbaum, P. R. (2014), "Matching for Balance, Pairing for Heterogeneity in an Observational Study of the Effectiveness of For-profit and Not-for-profit High Schools in Chile," Annals of Applied Statistics, 8, 204-231.

Examples

Run this code


## Uncomment the following examples
## Load, sort, and attach data
#data(lalonde)
#lalonde = lalonde[order(lalonde$treatment, decreasing = TRUE), ]
#attach(lalonde)

################################# 
## Example 1: cardinality matching
################################# 

## Cardinality matching finds the largest matched sample of pairs that meets balance 
## requirements.  Here the balance requirements are mean balance, fine balance and 
## exact matching for different covariates. The solver used is glpk with the 
## approximate option.

## Treatment indicator; note that the data needs to be sorted in decreasing order
## according to this treatment indicator
#t_ind = treatment
#t_ind 

## Distance matrix
#dist_mat = NULL 

## Subset matching weight
#subset_weight = 1

## Moment balance: constrain differences in means to be at most .05 standard deviations apart
#mom_covs = cbind(age, education, black, hispanic, married, nodegree, re74, re75)
#mom_tols = round(absstddif(mom_covs, t_ind, .05), 2)
#mom = list(covs = mom_covs, tols = mom_tols)

## Fine balance
#fine_covs = cbind(black, hispanic, married, nodegree)
#fine = list(covs = fine_covs)

## Exact matching
#exact_covs = cbind(black)
#exact = list(covs = exact_covs)

## Solver options
#t_max = 60*5
#solver = "glpk"
#approximate = 1
#solver = list(name = solver, t_max = t_max, approximate = approximate,
#round_cplex = 0, trace = 0)

## Match                   
#out = bmatch(t_ind = t_ind, dist_mat = dist_mat, subset_weight = subset_weight, 
#mom = mom, fine = fine, exact = exact, solver = solver)              
              
## Indices of the treated units and matched controls
#t_id = out$t_id  
#c_id = out$c_id	

## Time
#out$time/60

## Matched group identifier (who is matched to whom)
#out$group_id

## Assess mean balance
#meantab(mom_covs, t_ind, t_id, c_id)

## Assess fine balance (note here we are getting an approximate solution)
#for (i in 1:ncol(fine_covs)) {		
#  print(finetab(fine_covs[, i], t_id, c_id))
#}

## Assess exact matching balance
#table(exact_covs[t_id]==exact_covs[c_id])

################################## 
## Example 2: minimum distance matching
################################## 

## The goal here is to minimize the total of distances between matched pairs.  In 
## this example there are no covariate balance requirements.  Again, the solver 
## used is glpk with the approximate option

## Treatment indicator
#t_ind = treatment

## Matrix of covariates
#X_mat = cbind(age, education, black, hispanic, married, nodegree, re74, re75)

## Distance matrix
#dist_mat = distmat(t_ind, X_mat)

## Subset matching weight
#subset_weight = NULL

## Total pairs to be matched
#total_groups = sum(t_ind)

## Solver options
#t_max = 60*5
#solver = "glpk"
#approximate = 1
#solver = list(name = solver, t_max = t_max, approximate = approximate, 
#round_cplex = 0, trace_cplex = 0)

## Match                   
#out = bmatch(t_ind = t_ind, dist_mat = dist_mat, total_groups = total_groups, 
#solver = solver)              
              
## Indices of the treated units and matched controls
#t_id = out$t_id  
#c_id = out$c_id	

## Total of distances between matched pairs
#out$obj_total

## Assess mean balance
#meantab(X_mat, t_ind, t_id, c_id)

################################## 
## Example 3: optimal subset matching
################################## 

## Optimal subset matching pursues two competing goals at 
## the same time: to minimize the total sum of covariate distances  
## while matching as many observations as possible.  The trade-off 
## between these two goals is regulated by the parameter subset_weight 
## (see Rosenbaum 2012 and Zubizarreta et al. 2013 for a discussion).
## Here the balance requirements are mean balance, near-fine balance 
## and near-exact matching for different covariates.
## Again, the solver used is glpk with the approximate option.

## Treatment indicator
#t_ind = treatment

## Matrix of covariates
#X_mat = cbind(age, education, black, hispanic, married, nodegree, re74, re75)

## Distance matrix
#dist_mat = distmat(t_ind, X_mat)

## Subset matching weight
#subset_weight = median(dist_mat)

## Moment balance: constrain differences in means to be at most .05 standard deviations apart
#mom_covs = cbind(age, education, black, hispanic, married, nodegree, re74, re75)
#mom_tols = round(absstddif(mom_covs, t_ind, .05), 2)
#mom = list(covs = mom_covs, tols = mom_tols)

## Near-fine balance
#near_fine_covs = cbind(married, nodegree)
#near_fine_devs = rep(5, 2)
#near_fine = list(covs = near_fine_covs, devs = near_fine_devs)

## Near-exact matching
#near_exact_covs = cbind(black, hispanic)
#near_exact_devs = rep(5, 2)
#near_exact = list(covs = near_exact_covs, devs = near_exact_devs)

## Solver options
#t_max = 60*5
#solver = "glpk"
#approximate = 1
#solver = list(name = solver, t_max = t_max, approximate = approximate, 
#round_cplex = 0, trace_cplex = 0)

## Match                   
#out = bmatch(t_ind = t_ind, dist_mat = dist_mat, subset_weight = subset_weight, 
#mom = mom, near_fine = near_fine, near_exact = near_exact, solver = solver)              
              
## Indices of the treated units and matched controls
#t_id = out$t_id  
#c_id = out$c_id	

## Time
#out$time/60

## Matched group identifier (who is matched to whom)
#out$group_id

## Assess mean balance (note here we are getting an approximate solution)
#meantab(X_mat, t_ind, t_id, c_id)

## Assess fine balance
#for (i in 1:ncol(near_fine_covs)) {		
#	print(finetab(near_fine_covs[, i], t_id, c_id))
#}

## Assess exact matching balance
#for (i in 1:ncol(near_exact_covs)) {	
#	print(table(near_exact_covs[t_id, i]==near_exact_covs[c_id, i]))
#}

Run the code above in your browser using DataLab