alittleArt: Artful Optimal Matching

Description

Implements a simple version of multivariate matching using a propensity score, near-exact matching, near-fine balance, and robust Mahalanobis distance matching. Provides fine control of the penalties used in matching.

Usage

alittleArt(dat, z, x = NULL, pr = NULL, xm = NULL, near = NULL,
  fine = NULL, xinteger = NULL, xbalance = NULL, ncontrols = 1,
  rnd = 2, solver = "rlemon", min.penalty = c(10, 1, 0.05),
  pr.penalty = c(2, 5, 25, 250), near.penalty = 1000,
  fine.penalty = 50, integer.penalty = 20)

Value

match: A dataframe containing the matched data set. match contains the rows of dat in a different order. match adds two columns to dat, called mset and matched, which identify matched pairs or matched sets. Specifically, matched is TRUE if a row is in the matched sample and is FALSE otherwise. Rows of dat that are in the same matched set have the same value of mset. The rows of match are sorted by mset with the treated individual before the matched controls. The unmatched controls with matched=FALSE appear as the last rows of match. When you analyze the matched data, you will want to remove rows of match with matched==FALSE.
balance: A matrix called the balance table. The matrix has one row for each covariate in x. It also has a first row for the propensity score. There are five columns. Column 1 is the mean of the covariate in the treated group. Column 2 is the mean of the covariate in the matched control group. Column 3 is the mean of the covariate among all controls prior to matching. Column 4 is the difference between columns 1 and 2 divided by a pooled estimate of the standard deviation of the covariate before matching. Column 5 is the difference between columns 1 and 3 divided by a pooled estimate of the standard deviation of the covariate before matching. Notice that columns 4 and 5 have the same denominator, but different numerators. Tom Love (2002) suggests a graphical display of this information.

Arguments

dat: A dataframe containing the data set that will be matched. Let N be the number of rows of dat.
z: A binary vector with N coordinates where z[i]=1 if the ith row of dat describes a treated individual and z[i]=0 if the ith row of dat describes a control.
x: x is a numeric matrix with N rows. If pr is NULL, then the covariates in x are used to estimate a propensity score using a linear logit model that predicts z from x. An error will stop the program if pr and x are both NULL. If neither pr nor x is NULL, then a harmless warning message will remind you that your propensity score pr was used in matching and x was not used to estimate the propensity score. If xbalance is NULL, then the balance table will describe the covariates in x; so, those covariates should be continuous variables or binary variables that can be described by a mean or a proportion, not nominal categories.
pr: A vector with N coordinates containing an estimated propensity or similar quantity. If pr is NULL, then the program estimates the propensity score; see the discussion of x above.
xm: xm is a numeric matrix with N rows. The covariates in xm are used to define a robust Mahalanobis distance between treated and control individuals. The covariates in xm may be continuous variables like weight, integer covariates like number of rooms in a home, or binary variables; however, they should not be unordered nominal covariates like 1=New York, 2=Chicago, 3=London, 4=Tokyo.
near: A numeric vector of length N or a numeric matrix with N rows. Each column of near should represent levels of a nominal covariate with two or a few levels. The variables in near are used in near-exact matching.
fine: A numeric vector of length N or a numeric matrix with N rows. Each column of fine should represent levels of a nominal covariate with two or a few levels. The variables in fine are used in near-fine balancing.
xinteger: A numeric vector of length N or a numeric matrix with N rows. Each column of xinteger should represent levels of an integer covariate with three or a few levels. The variables in xinteger are used in near-fine balancing that prefers an imbalance from an adjacent category to an imbalance from a distant category. See the notes.
xbalance: If not NULL, xbalance is numeric vector of length N or a numeric matrix with N rows. If xbalance is not NULL, then the balance table will describe the covariates in xbalance; so, those covariates should be continuous variables or binary variables that can be described by a mean or a proportion, not nominal categories. See also the discussion of x above and the notes.
ncontrols: A positive integer. ncontrols is the number of controls to be matched to each treated individual.
rnd: A nonnegative integer. The balance table is rounded for display to rnd digits.
solver: Either "rlemon" or "rrelaxiv". The rlemon solver is automatically available without special installation. The rrelaxiv requires a special installation. See the note.
min.penalty: A vector of three nonnegative coordinates. The third coordinate must be strictly greater than zero and strictly less than one. See the notes.
pr.penalty: A vector with four nonnegative coordinates that determine aspects of matching for the propensity score. See the notes.
near.penalty: Either one nonnegative number of a vector of nonnegative numbers with one coordinate for each column of near. See the notes.
fine.penalty: Either one nonnegative number of a vector of nonnegative numbers with one coordinate for each column of fine. See the notes.
integer.penalty: Either one nonnegative number of a vector of nonnegative numbers with one coordinate for each column of xinteger. See the notes.

Author

Paul R. Rosenbaum

Details

This function builds a matched treated-control sample from an unmatched data set. It asks you to designate roles for specific covariates, and it does the rest. Unlike artlessV2(), the function alittleArt() gives you control over the penalties used in matching. In particular, if in an initial match one covariate, say age, remains out of balance, then you can adjust a penalty specific to age to attempt to improve its balance.

References

Bertsekas, D. P., Tseng, P. (1988) <doi:10.1007/BF02288322> The Relax codes for linear minimum cost network flow problems. Annals of Operations Research, 13, 125-190.

Bertsekas, D. P. (1990) <doi:10.1287/inte.20.4.133> The auction algorithm for assignment and other network flow problems: A tutorial. Interfaces, 20(4), 133-149.

Bertsekas, D. P., Tseng, P. (1994) <http://web.mit.edu/dimitrib/www/Bertsekas_Tseng_RELAX4_!994.pdf> RELAX-IV: A Faster Version of the RELAX Code for Solving Minimum Cost Flow Problems.

Greifer, N. and Stuart, E.A., (2021). <doi:10.1093/epirev/mxab003> Matching methods for confounder adjustment: an addition to the epidemiologist’s toolbox. Epidemiologic Reviews, 43(1), pp.118-129.

Hansen, B. B. and Klopfer, S. O. (2006) <doi:10.1198/106186006X137047> "Optimal full matching and related designs via network flows". Journal of computational and Graphical Statistics, 15(3), 609-627. ('optmatch' package)

Hansen, B. B. (2007) <https://www.r-project.org/conferences/useR-2007/program/presentations/hansen.pdf> Flexible, optimal matching for observational studies. R News, 7, 18-24. ('optmatch' package)

Lee, K., Small, D.S. and Rosenbaum, P.R. (2018) <doi:10.1111/biom.12884> A powerful approach to the study of moderate effect modification in observational studies. Biometrics, 74:(4)1161-1170.

Love, Thomas E. (2002) Displaying covariate balance after adjustment for selection bias. Joint Statistical Meetings. Vol. 11. https://chrp.org/love/JSM_Aug11_TLove.pdf

Niknam, B.A. and Zubizarreta, J.R. (2022). <10.1001/jama.2021.20555> Using cardinality matching to design balanced and representative samples for observational studies. JAMA, 327(2), pp.173-174.

Pimentel, S. D., Yoon, F., & Keele, L. (2015) <doi:10.1002/sim.6593> Variable‐ratio matching with fine balance in a study of the Peer Health Exchange. Statistics in Medicine, 34(30), 4070-4082.

Pimentel, S. D., Kelz, R. R., Silber, J. H. and Rosenbaum, P. R. (2015) <doi:10.1080/01621459.2014.997879> Large, sparse optimal matching with refined covariate balance in an observational study of the health outcomes produced by new surgeons. Journal of the American Statistical Association, 110, 515-527.

Rosenbaum, P. R. and Rubin, D. B. (1985) <doi:10.1080/00031305.1985.10479383> Constructing a control group using multivariate matched sampling methods that incorporate the propensity score. The American Statistician, 39, 33-38.

Rosenbaum, P. R. (1989) <doi:10.1080/01621459.1989.10478868> Optimal matching for observational studies. Journal of the American Statistical Association, 84(408), 1024-1032.

Rosenbaum, P. R., Ross, R. N. and Silber, J. H. (2007) <doi:10.1198/016214506000001059> Minimum distance matched sampling with fine balance in an observational study of treatment for ovarian cancer. Journal of the American Statistical Association, 102, 75-83.

Rosenbaum, P. R. (2020a) <doi:10.1007/978-3-030-46405-9> Design of Observational Studies (2nd Edition). New York: Springer.

Rosenbaum, P. R. (2020b). <doi:10.1146/annurev-statistics-031219-041058> Modern algorithms for matching in observational studies. Annual Review of Statistics and Its Application, 7(1), 143-176.

Rosenbaum, P. R. and Zubizarreta, J. R. (2023). <doi:10.1201/9781003102670> Optimization Techniques in Multivariate Matching. Handbook of Matching and Weighting Adjustments for Causal Inference, pp.63-86. Boca Raton: FL: Chapman and Hall/CRC Press.

Rosenbaum, P. R. (2025) <doi:10.1007/978-3-031-90494-3> Introduction to the Theory of Observational Studies. New York: Springer.

Rubin, D. B. (1980) <doi:10.2307/2529981> Bias reduction using Mahalanobis-metric matching. Biometrics, 36, 293-298.

Rubin, D. B. (2008) <doi:10.1214/08-AOAS187> For objective causal inference, design trumps analysis. Annals of Applied Statistics, 2, 808-840.

Stuart, E.A., (2010). <doi:10.1214/09-STS313> Matching methods for causal inference: A review and a look forward. Statistical Science, 25(1), 1-21.

Yang, D., Small, D. S., Silber, J. H. and Rosenbaum, P. R. (2012) <doi:10.1111/j.1541-0420.2011.01691.x> Optimal matching with minimal deviation from fine balance in a study of obesity and surgical outcomes. Biometrics, 68, 628-636.

Yu, Ruoqi, and P. R. Rosenbaum. <doi:10.1111/biom.13098> Directional penalties for optimal matching in observational studies. Biometrics 75, no. 4 (2019): 1380-1390.

Yu, R., Silber, J. H., & Rosenbaum, P. R. (2020) <doi:10.1214/19-STS699> Matching methods for observational studies derived from large administrative databases. Statistical Science, 35(3), 338-355.

Yu, R. (2021) <doi:10.1111/biom.13374> Evaluating and improving a matched comparison of antidepressants and bone density. Biometrics, 77(4), 1276-1288.

Yu R. & Rosenbaum, P. R. (2022) <doi:10.1080/10618600.2022.2058001> Graded matching for large observational studies. Journal of Computational and Graphical Statistics, 31(4):1406-1415.

Yu, R. (2023) <doi:10.1111/biom.13771> How well can fine balance work for covariate balancing? Biometrics. 79(3), 2346-2356.

Zhang, B., D. S. Small, K. B. Lasater, M. McHugh, J. H. Silber, and P. R. Rosenbaum (2023) <doi:10.1080/01621459.2021.1981337> Matching one sample according to two criteria in observational studies. Journal of the American Statistical Association, 118, 1140-1151.

Zubizarreta, J.R., 2012. <doi:10.1080/01621459.2012.703874>Using mixed integer programming for matching in an observational study of kidney failure after surgery. Journal of the American Statistical Association, 107(500), pp.1360-1371.

Zubizarreta, J. R., Reinke, C. E., Kelz, R. R., Silber, J. H. and Rosenbaum, P. R. (2011) <doi:10.1198/tas.2011.11072> Matching for several sparse nominal variables in a case control study of readmission following surgery. The American Statistician, 65(4), 229-238.

Zubizarreta, J.R., Stuart, E.A., Small, D.S. and Rosenbaum, P.R. eds. (2023). <doi:10.1201/9781003102670> Handbook of Matching and Weighting Adjustments for Causal Inference. Boca Raton: FL: Chapman and Hall/CRC Press.

Examples

Run this code

# \donttest{
# The example below uses the binge data from the iTOS package.
# See the documentation for binge in the iTOS package for more information.
#
library(iTOS)
data(binge)
b2<-binge[binge$AlcGroup!="P",] # Match binge drinkers to nondrinkers
z<-1*(b2$AlcGroup=="B") # Treatment/control indicator
b2<-cbind(b2,z)
rm(z)
rownames(b2)<-b2$SEQN
attach(b2)
# Estimate a propensity score
pr<-stats::glm(z~age+female+education+bmi+vigor+
      smokenow+smokeQuit+bpRX,family=binomial)$fitted.values
#
#  Create nominal covariates to include in near or fine
#
smoke<-1*(smokenow==1)
dontSmoke<-1*(smokenow==3)
age50<-1*(age>=50)
bmi30<-1*(bmi>=30)
ed2<-1*(education<=2)
smoke<-1*(smokenow==1)
#
#  near contains covariates to be matched as exactly as possible
#
near<-cbind(female,dontSmoke)
#
# xm contains covariates in the robust Mahalanobis distance
# Includes some continuous covariates.
#
xm<-cbind(age,bmi,vigor,smokenow,education)
#
# fine contains covariate that will be balanced, but not matched
#
fine<-cbind(ed2,smoke,dontSmoke)

# variable to be used in xinteger
ageCi<-as.integer(ageC)
xbalance<-cbind(pr,age,female,education,bmi,vigor,smokenow,smokeQuit,bpRX,
   ageCi,ed2,smoke,dontSmoke,bmi30,smoke,ed2,age50)
b2<-cbind(b2,pr)
rm(bmi30,smoke,ed2,age50,dontSmoke)
detach(b2)

mc<-alittleArt(b2,b2$z,pr=pr,xm=xm,near=near,fine=fine,xinteger=ageCi,
   ncontrols=3,xbalance=xbalance,pr.penalty = c(3, 5, 50, 250))
#
#  Here are the first two 1-to-3 matched sets.
#
mc$match[1:8,]
#
#  You can check that every matched set is exactly matched for
#  female and nonsmoking.  This is from near-exact matching.
#  In some other data set, the number of mismatches might be
#  minimized, not driven to zero.
#
#  The balance table shows that large imbalances in covariates
#  existed before matching, but are much smaller after matching.
#  Look, for example, at the propensity score, female, and
#  the several versions of the smoking variable.
#
mc$balance
m<-mc$match
m<-m[m$matched,] # Remove the unmatched controls
table(m$z)
prop.table(table(m$ageC,m$z),2)
# You could improve this table by setting integer.penalty=500.
# Other things might suffer a bit.  The boxplot of age is good as is.
boxplot(m$age~m$z)
boxplot(m$pr~m$z)
# }

Run the code above in your browser using DataLab