nmatch: Optimal nonbipartite matching in randomized experiments and observational studies

Description

Function for optimal nonbipartite matching in randomized experiments and observational studies that directly balances the observed covariates. nmatch allows the user to enforce different forms of covariate balance in the matched samples, such as moment balance (e.g., of means, variances, and correlations), distributional balance (e.g., fine balance, near-fine balance, strength-k balancing), and exact matching. Among others, nmatch can be used in the design of randomized experiments for matching before randomization (Greevy et al. 2004, Zou and Zubizarreta 2015), and in observational studies for matching with doses and strengthening an instrumental variable (Baiocchi et al. 2010, Lu et al. 2011).

Usage

nmatch(dist_mat, subset_weight = NULL, total_pairs = NULL, mom = NULL,
	       exact = NULL, near_exact = NULL, fine = NULL, near_fine = NULL,
	       near = NULL, far = NULL, solver = NULL)

Arguments

dist_mat

distance matrix: a matrix of positive distances between units.

subset_weight

subset matching weight: a scalar that regulates the trade-off between the total sum of distances between matched pairs and the total number of matched pairs. The larger subset_weight, the more importance will be given to the the total number of matched pairs relative to the total sum of distances between matched pairs. See Rosenbaum (2012) and Zubizarreta et al. (2013) for a discussion of this parameter. If subset_weight = NULL, then nmatch will match all the available units, provided it exists a feasible solution exists.

total_pairs

total number of matched pairs: a scalar specifying the number of matched pairs to be obtained. If total_pairs = NULL then no specific number of matched pairs is required before matching.

mom

moment balance parameters: a list with two arguments,

mom = list(mom_covs, mom_tols).

mom_covs is a matrix where each column is a covariate whose mean is to be balanced. mom_tols is a vector of tolerances for the maximum differences in means for the covariates in mom_covs. Note that if mom_covs is specified, then mom_tols needs to be specified as well, and the length of mom_tols has to be equal to the number of columns of mom_covs. Note that the columns of mom_covs can be transformations of the original covariates to balance higher order single-dimensional moments like variances and skewness, and multidimensional moments such as correlations (Zubizarreta 2012).

exact

Exact matching parameters: a list with one argument,

exact = list(exact_covs),

where exact_covs is a matrix where each column is a nominal covariate for exact matching.

near_exact

Near-exact matching parameters: a list with two arguments,

near_exact = list(near_exact_covs, near_exact_devs).

near_exact_covs are the near-exact matching covariates; specifically, a matrix where each column is a nominal covariate for near-exact matching. near_exact_devs are the maximum deviations from near-exact matching: a vector of scalars defining the maximum deviation allowed from exact matching for the covariates defined in near_exact_covs. Note that the length of near_exact_devs has to be equal to the number of columns of near_exact_covs. For detailed expositions of near-exact matching in the context of bipartite matching, see section 9.2 of Rosenbaum (2010) and Zubizarreta et al. (2011).

fine

Fine balance parameters: a list with one argument,

fine = list(fine_covs),

where fine_covs is a matrix where each column is a nominal covariate for fine balance. Fine balance enforces exact distributional balance on nominal covariates, but without constraining treated and control units to be matched within each category of each nominal covariate as in exact matching. See chapter 10 of Rosenbaum (2010) for details.

near_fine

Near-fine balance parameters: a list with two arguments,

near_fine = list(near_fine_covs, near_fine_devs).

near_fine_covs is a matrix where each column is a nominal covariate for near-fine matching. near_fine_devs is a vector of scalars defining the maximum deviation allowed from fine balance for the covariates in near_fine_covs. Note that the length of near_fine_devs has to be equal to the number of columns of near_fine_covs. See Yang et al. (2012) for a description of near-fine balance.

near

Near matching parameters: a list with three arguments,

near = list(near_covs, near_pairs, near_groups).

near_covs is a matrix where each column is a variable for near matching. near_pairs is a vector determining the maximum distance between individual matched pairs for each variable in near_covs. near_groups is a vector defining the maximum average distance (in aggregate) between matched groups for each covariate in near_covs. If near_covs is specified, then either near_pairs, near_covs, or both must be specified as well, and the length of near_pairs and/or near_groups has to be equal to the number of columns of near_covs.

far

Far matching parameters: a list with three arguments,

far = list(far_covs, far_pairs, far_groups).

far_covs is a matrix where each column is a variable (a covariate or an instrumental variable) for far matching. far_pairs is a vector determining the minimum distance between units in a matched pair for each variable in far_covs, and far_groups is a vector defining the minimum average (aggregate) distance between matched groups for each variable in far_covs. If far_covs is specified, then either far_pairs, far_covs, or both, must be specified, and the length of far_pairs and/or far_groups has to be equal to the number of columns of far_covs. See Zubizarreta et al. (2013) for strengthening an instrumental variable with integer programming.

solver

Optimization solver parameters: a list with four objects,

solver = list(name, t_max, approximate = 1, round_cplex = 0, trace_cplex = 0). solver is a string that determines the optimization solver to be used. The options are: cplex, glpk, gurobi and symphony. The default solver is glpk with approximate = 1, so that by default an approximate solution is found (see approximate below). For an exact solution, we strongly recommend using cplex or gurobi as they are much faster than the other solvers, but they do require a license (free for academics, but not for people outside universities). Between cplex and gurobi, note that the installation of the gurobi interface for R is much simpler.

t_max is a scalar with the maximum time limit for finding the matches. This option is specific to cplex and gurobi. If the optimal matches are not found within this time limit, a partial, suboptimal solution is given. approximate is a scalar that determines the method of solution. If approximate = 1 (the default), an approximate solution is found via a relaxation of the original integer program. This method of solution is faster than approximate = 0, but some balancing constraints may be violated to some extent.

round_cplex is binary specific to cplex. round_cplex = 1 ensures that the solution found is integral by rounding and all the constraints are exactly statisfied; round_cplex = 0 (the default) encodes there is no rounding which may return slightly infeasible integer solutions. trace is a binary specific to cplex and gurobi. trace = 1 turns the optimizer output on. The default is trace = 0.

Value

obj_total: value of the objective function at the optimum;
obj_dist_mat: value of the total sum of distances term of the objective function at the optimum;
id_1: indexes of the matched units in group 1 at the optimum;
id_2: indexes of the matched units in group 2 at the optimum;
group_id: matched pairs at the optimum;
time: time elapsed to find the optimal solution.

References

Baiocchi, M., Small, D., Lorch, S. and Rosenbaum, P. R. (2010), "Building a Stronger Instrument in an Observational Study of Perinatal Care for Premature Infants," Journal of the American Statistical Association, 105, 1285-1296. Greevy, R., Lu, B., Silber, J. H., and Rosenbaum, P. R. (2004), "Optimal Multivariate Matching Before Randomization," Biostatistics, 5, 263-275.

Lu, B., Greevy, R., Xu, X., and Beck C. (2011), "Optimal Nonbipartite Matching and its Statistical Applications," The American Statistician, 65, 21-30.

Rosenbaum, P. R. (2010), Design of Observational Studies, Springer.

Rosenbaum, P. R. (2012), "Optimal Matching of an Optimally Chosen Subset in Observa- tional studies," Journal of Computational and Graphical Statistics, 21, 57-71.

Yang. F., Zubizarreta, J. R., Small, D. S., Lorch, S. A., and Rosenbaum, P. R. (2014), "Dissonant Conclusions When Testing the Validity of an Instrumental Variable," The American Statistician, 68, 253-263.

Zou, J., and Zubizarreta, J. R. (2015), "Covariate Balanced Restricted Randomization: Optimal Designs, Exact Tests, and Asymptotic Results," working paper. Zubizarreta, J. R., Reinke, C. E., Kelz, R. R., Silber, J. H., and Rosenbaum, P. R. (2011), "Matching for Several Sparse Nominal Variables in a Case-Control Study of Readmission Following Surgery," The American Statistician, 65, 229-238. Zubizarreta, J. R. (2012), "Using Mixed Integer Programming for Matching in an Observational Study of Kidney Failure after Surgery," Journal of the American Statistical Association, 107, 1360-1371.

Examples

Run this code


# Load and attach data
data(lalonde)
attach(lalonde)

################################# 
# Example: optimal subset matching
################################# 

# Optimal subset matching pursues two competing goals at 
# the same time: to minimize the total of distances while 
# matching as many observations as possible.  The trade-off 
# between these two is regulated by the parameter subset_weight 
# (see Rosenbaum 2012 and Zubizarreta et al. 2013 for a discussion).
# Here the balance requirements are mean and fine balance for 
# different covariates.  We require 50 pairs to be matched.
# Again, the solver used is glpk with the approximate option.

# Matrix of covariates
X_mat = cbind(age, education, black, hispanic, married, nodegree, re74, re75)

# Distance matrix
dist_mat_covs = round(dist(X_mat, diag = TRUE, upper = TRUE), 1)
dist_mat = as.matrix(dist_mat_covs)

# Subset matching weight
subset_weight = 1

# Total pairs to be matched
total_pairs = 50

# Moment balance: constrain differences in means to be at most .1 standard deviations apart
mom_covs = cbind(age, education)
mom_tols = apply(mom_covs, 2, sd)*.1
mom = list(covs = mom_covs, tols = mom_tols)

# Solver options
t_max = 60*5
solver = "glpk"
approximate = 1
solver = list(name = solver, t_max = t_max, approximate = approximate, round_cplex = 0, 
trace_cplex = 0)

# Match                  
out = nmatch(dist_mat = dist_mat, subset_weight = subset_weight, total_pairs = total_pairs, 
mom = mom, solver = solver)              
              
# Indices of the treated units and matched controls
id_1 = out$id_1  
id_2 = out$id_2	

# Assess mean balance
a = apply(mom_covs[id_1, ], 2, mean)
b = apply(mom_covs[id_2, ], 2, mean)
tab = round(cbind(a, b, a-b, mom_tols), 2)
colnames(tab) = c("Mean 1", "Mean 2", "Diffs", "Tols")
tab

## Assess fine balance (note here we are getting an approximate solution)
#for (i in 1:ncol(fine_covs)) {		
#	print(finetab(fine_covs[, i], id_1, id_2))
#}

Run the code above in your browser using DataLab