distance: Propensity scores and other distance measures

Description

Several matching methods require or can involve the distance between treated and control units. Options include the Mahalanobis distance, propensity score distance, or distance between user-supplied values. Propensity scores are also used for common support via the discard options and for defining calipers. This page documents the options that can be supplied to the distance argument to matchit().

Arguments

Allowable options

There are four ways to specify the distance argument: 1) as a string containing the name of a method for estimating propensity scores, 2) as a string containing the name of a method for computing pairwise distances from the covariates, 3) as a vector of values whose pairwise differences define the distance between units, or 4) as a distance matrix containing all pairwise distances. The options are detailed below.

Propensity score estimation methods

When distance is specified as the name of a method for estimating propensity scores (described below), a propensity score is estimated using the variables in formula and the method corresponding to the given argument. This propensity score can be used to compute the distance between units as the absolute difference between the propensity scores of pairs of units. Propensity scores can also be used to create calipers and common support restrictions, whether or not they are used in the actual distance measure used in the matching, if any.

In addition to the distance argument, two other arguments can be specified that relate to the estimation and manipulation of the propensity scores. The link argument allows for different links to be used in models that require them such as generalized linear models, for which the logit and probit links are allowed, among others. In addition to specifying the link, the link argument can be used to specify whether the propensity score or the linearized version of the propensity score should be used; by specifying link = "linear.{link}", the linearized version will be used.

The distance.options argument can also be specified, which should be a list of values passed to the propensity score-estimating function, for example, to choose specific options or tuning parameters for the estimation method. If formula, data, or verbose are not supplied to distance.options, the corresponding arguments from matchit() will be automatically supplied. See the Examples for demonstrations of the uses of link and distance.options. When s.weights is supplied in the call to matchit(), it will automatically be passed to the propensity score-estimating function as the weights argument unless otherwise described below.

The following methods for estimating propensity scores are allowed:

"glm": The propensity scores are estimated using a generalized linear model (e.g., logistic regression). The formula supplied to matchit() is passed directly to glm(), and predict.glm() is used to compute the propensity scores. The link argument can be specified as a link function supplied to binomial(), e.g., "logit", which is the default. When link is prepended by "linear.", the linear predictor is used instead of the predicted probabilities. distance = "glm" with link = "logit" (logistic regression) is the default in matchit().
"gam": The propensity scores are estimated using a generalized additive model. The formula supplied to matchit() is passed directly to mgcv::gam(), and mgcv::predict.gam() is used to compute the propensity scores. The link argument can be specified as a link function supplied to binomial(), e.g., "logit", which is the default. When link is prepended by "linear.", the linear predictor is used instead of the predicted probabilities. Note that unless the smoothing functions mgcv::s(), mgcv::te(), mgcv::ti(), or mgcv::t2() are used in formula, a generalized additive model is identical to a generalized linear model and will estimate the same propensity scores as glm. See the documentation for mgcv::gam(), mgcv::formula.gam(), and mgcv::gam.models() for more information on how to specify these models. Also note that the formula returned in the matchit() output object will be a simplified version of the supplied formula with smoothing terms removed (but all named variables present).
"gbm": The propensity scores are estimated using a generalized boosted model. The formula supplied to matchit() is passed directly to gbm::gbm(), and gbm::predict.gbm() is used to compute the propensity scores. The optimal tree is chosen using 5-fold cross-validation by default, and this can be changed by supplying an argument to method to distance.options; see gbm::gbm.perf() for details. The link argument can be specified as "linear" to use the linear predictor instead of the predicted probabilities. No other links are allowed. The tuning parameter defaults differ from gbm::gbm(); they are as follows: n.trees = 1e4, interaction.depth = 3, shrinkage = .01, bag.fraction = 1, cv.folds = 5, keep.data = FALSE. These are the same defaults as used in WeightIt and twang, except for cv.folds and keep.data. Note this is not the same use of generalized boosted modeling as in twang; here, the number of trees is chosen based on cross-validation or out-of-bag error, rather than based on optimizing balance. twang should not be cited when using this method to estimate propensity scores.
"lasso", "ridge", "elasticnet": The propensity scores are estimated using a lasso, ridge, or elastic net model, respectively. The formula supplied to matchit() is processed with model.matrix() and passed to glmnet::cv.glmnet(), and glmnet::predict.cv.glmnet() is used to compute the propensity scores. The link argument can be specified as a link function supplied to binomial(), e.g., "logit", which is the default. When link is prepended by "linear.", the linear predictor is used instead of the predicted probabilities. When link = "log", a Poisson model is used. For distance = "elasticnet", the alpha argument, which controls how to prioritize the lasso and ridge penalties in the elastic net, is set to .5 by default and can be changed by supplying an argument to alpha in distance.options. For "lasso" and "ridge", alpha is set to 1 and 0, respectively, and cannot be changed. The cv.glmnet() defaults are used to select the tuning parameters and generate predictions and can be modified using distance.options. If the s argument is passed to distance.options, it will be passed to predict.cv.glmnet(). Note that because there is a random component to choosing the tuning parameter, results will vary across runs unless a seed is set.
"rpart": The propensity scores are estimated using a classification tree. The formula supplied to matchit() is passed directly to rpart::rpart(), and rpart::predict.rpart() is used to compute the propensity scores. The link argument is ignored, and predicted probabilities are always returned as the distance measure.
"randomforest": The propensity scores are estimated using a random forest. The formula supplied to matchit() is passed directly to randomForest::randomForest(), and randomForest::predict.randomForest() is used to compute the propensity scores. The link argument is ignored, and predicted probabilities are always returned as the distance measure.
"nnet": The propensity scores are estimated using a single-hidden-layer neural network. The formula supplied to matchit() is passed directly to nnet::nnet(), and fitted() is used to compute the propensity scores. The link argument is ignored, and predicted probabilities are always returned as the distance measure. An argument to size must be supplied to distance.options when using method = "nnet".
"cbps": The propensity scores are estimated using the covariate balancing propensity score (CBPS) algorithm, which is a form of logistic regression where balance constraints are incorporated to a generalized method of moments estimation of of the model coefficients. The formula supplied to matchit() is passed directly to CBPS::CBPS(), and fitted is used to compute the propensity scores. The link argument can be specified as "linear" to use the linear predictor instead of the predicted probabilities. No other links are allowed. The estimand argument supplied to matchit() will be used to select the appropriate estimand for use in defining the balance constraints, so no argument needs to be supplied to ATT in CBPS.
"bart": The propensity scores are estimated using Bayesian additive regression trees (BART). The formula supplied to matchit() is passed directly to dbarts::bart2(), and dbarts::fitted() is used to compute the propensity scores. The link argument can be specified as "linear" to use the linear predictor instead of the predicted probabilities. When s.weights is supplied to matchit(), it will not be passed to bart2 because the weights argument in bart2 does not correspond to sampling weights.

Methods for computing distances from covariates

The following methods involve computing a distance matrix from the covariates themselves without estimating a propensity score. Calipers on the distance measure and common support restrictions cannot be used, and the distance component of the output object will be empty because no propensity scores are estimated. The link and distance.options arguments are ignored with these methods. See the individual matching methods pages for whether these distances are allowed and how they are used. Each of these distance measures can also be calculated outside matchit() using its corresponding function.

"euclidean": The Euclidean distance is the raw distance between units, computed as $$d_{ij} = \sqrt{(x_i - x_j)(x_i - x_j)'}$$ It is sensitive to the scale of the covariates, so covariates with larger scales will take higher priority.

"scaled_euclidean"

The scaled Euclidean distance is the Euclidean distance computed on the scaled (i.e., standardized) covariates. This ensures the covariates are on the same scale. The covariates are standardized using the pooled within-group standard deviations, computed by treatment group-mean centering each covariate before computing the standard deviation in the full sample.

"mahalanobis"

The Mahalanobis distance is computed as $$d_{ij} = \sqrt{(x_i - x_j)\Sigma^{-1}(x_i - x_j)'}$$ where $\Sigma$ is the pooled within-group covariance matrix of the covariates, computed by treatment group-mean centering each covariate before computing the covariance in the full sample. This ensures the variables are on the same scale and accounts for the correlation between covariates.

"robust_mahalanobis"

The robust rank-based Mahalanobis distance is the Mahalanobis distance computed on the ranks of the covariates with an adjustment for ties. It is described in Rosenbaum (2010, ch. 8) as an alternative to the Mahalanobis distance that handles outliers and rare categories better than the standard Mahalanobis distance but is not affinely invariant.

To perform Mahalanobis distance matching and estimate propensity scores to be used for a purpose other than matching, the mahvars argument should be used along with a different specification to distance. See the individual matching method pages for details on how to use mahvars.

Distances supplied as a numeric vector or matrix

distance can also be supplied as a numeric vector whose values will be taken to function like propensity scores; their pairwise difference will define the distance between units. This might be useful for supplying propensity scores computed outside matchit() or resupplying matchit() with propensity scores estimated previously without having to recompute them.

distance can also be supplied as a matrix whose values represent the pairwise distances between units. The matrix should either be a square, with a row and column for each unit (e.g., as the output of a call to as.matrix(dist(.))), or have as many rows as there are treated units and as many columns as there are control units (e.g., as the output of a call to mahalanobis_dist() or optmatch::match_on()). Distance values of Inf will disallow the corresponding units to be matched. When distance is a supplied as a numeric vector or matrix, link and distance.options are ignored.

Outputs

When specifying an argument to distance that estimates a propensity score, the output of the function called to estimate the propensity score (e.g., the glm object when distance = "glm") will be included in the matchit() output object in the model component. When distance propensity score estimation method or a vector of distance values, the estimated or supplied distance measures will be included in the matchit() output object in the distance component. Otherwise, the distance component will be empty.

Examples

Run this code

data("lalonde")
# Linearized probit regression PS:
m.out1 <- matchit(treat ~ age + educ + race + married +
                    nodegree + re74 + re75, data = lalonde,
                  distance = "glm", link = "linear.probit")
# GAM logistic PS with smoothing splines (s()):
m.out2 <- matchit(treat ~ s(age) + s(educ) + race + married +
                    nodegree + re74 + re75, data = lalonde,
                  distance = "gam")
summary(m.out2$model)
if (FALSE) { # requireNamespace("CBPS", quietly = TRUE)
# CBPS for ATC matching w/replacement, using the just-
# identified version of CBPS (setting method = "exact"):
m.out3 <- matchit(treat ~ age + educ + race + married +
                    nodegree + re74 + re75, data = lalonde,
                  distance = "cbps", estimand = "ATC",
                  distance.options = list(method = "exact"),
                  replace = TRUE)
}
# Mahalanobis distance matching - no PS estimated
m.out4 <- matchit(treat ~ age + educ + race + married +
                    nodegree + re74 + re75, data = lalonde,
                  distance = "mahalanobis")

m.out4$distance #NULL

# Mahalanobis distance matching with PS estimated
# for use in a caliper; matching done on mahvars
m.out5 <- matchit(treat ~ age + educ + race + married +
                    nodegree + re74 + re75, data = lalonde,
                  distance = "glm", caliper = .1,
                  mahvars = ~ age + educ + race + married +
                                nodegree + re74 + re75)

summary(m.out5)

# User-supplied propensity scores
p.score <- fitted(glm(treat ~ age + educ + race + married +
                        nodegree + re74 + re75, data = lalonde,
                      family = binomial))

m.out6 <- matchit(treat ~ age + educ + race + married +
                    nodegree + re74 + re75, data = lalonde,
                  distance = p.score)

# User-supplied distance matrix using optmatch::match_on()
if (FALSE) { # requireNamespace("optmatch", quietly = TRUE)
dist_mat <- optmatch::match_on(
              treat ~ age + educ + race + nodegree +
                married + re74 + re75, data = lalonde,
              method = "rank_mahalanobis")

m.out7 <- matchit(treat ~ age + educ + race + nodegree +
                    married + re74 + re75, data = lalonde,
                  distance = dist_mat)
}

Run the code above in your browser using DataLab