match_on is generic. There are several supplied
methods, all providing the same basic output: a matrix
(or similar) object with treated units on the rows and
control units on the columns. Each cell [i,j] then
indicates the distance from a treated unit i to control
unit j. Entries that are Inf are said to be
unmatchable. Such units are guaranteed to never be in a
matched set. For problems with many Inf entries,
so called sparse matching problems, match_on uses
a special data type that is more space efficient than a
standard R matrix. When problems are not sparse
(i.e. dense), match_on uses the standard
matrix type. match_on methods differ on the types of arguments
they take, making the function a one-stop location of
many different ways of specifying matches: using
functions, formulas, models, and even simple scores. Many
of the methods require additional arguments, detailed
below. All methods take a within argument, a
distance specification made using
exactMatch or caliper (or
some additive combination of these or other distance
creating functions). All match_on methods will use
the finite entries in the within argument as a
guide for producing the new distance. Any entry that is
Inf in within will be Inf in the
distance matrix returned by match_on. This
argument can reduce the processing time needed to compute
sparse distance matrices.
The match_on function is similar to the older, but
still supplied, mdist function. Future
development will concentrate on match_on, but
mdist is still supplied for users familiar with
the interface. For the most part, the two functions can
be used interchangeably by users.
The function method takes as its x argument
a function of three arguments: index, data,
and z. The data and z arguments will
be the same as those passed directly to match_on.
The index argument is a matrix of two columns,
representing the pairs of treated and control units that
are valid comparisons (given any within
arguments). The first column is the row name or id of the
treated unit in the data object. The second column
is the id for the control unit, again in the data
object. For each of these pairs, the function should
return the distance between the treated unit and control
unit. This may sound complicated, but is simple to use.
For example, a function that returned the absolute
difference between two units using a vector of data would
be f <- function(index, data, z) { abs(apply(index,
1, function(pair) { data[pair[1]] - data[pair[2]] })) }.
(Note: This simple case is precisely handled by the
numeric method.)
The formula method produces, by default, a Mahalanobis
distance specification based on the formula Z ~ X1
+ X2 + ..., where Z is the treatment indicator.
The Mahalanobis distance is calculated as the square root
of d'Cd, where d is the vector of X-differences on a pair
of observations and C is an inverse (generalized inverse)
of the pooled covariance of Xes. (The pooling is of the
covariance of X within the subset defined by Z==0
and within the complement of that subset. This is similar
to a Euclidean distance calculated after reexpressing the
Xes in standard units, such that the reexpressed
variables all have pooled SDs of 1; except that it
addresses redundancies among the variables by scaling
down variables contributions in proportion to their
correlations with other included variables.)
Euclidean distance is also available, via
method="euclidean". Or, implement your own; for
hints as to how, refer to
https://github.com/markmfredrickson/optmatch/wiki/How-to-write-your-own-compute-method
The glm method assumes its first argument to be a
fitted propensity model. From this it extracts distances
on the linear propensity score: fitted values of
the linear predictor, the link function applied to the
estimated conditional probabilities, as opposed to the
estimated conditional probabilities themselves (Rosenbaum
& Rubin, 1985). For example, a logistic model
(glm with family=binomial()) has the logit
function as its link, so from such models match_on
computes distances in terms of logits of the estimated
conditional probabilities, i.e. the estimated log odds.
Optionally these distances are also rescaled. The default
is to rescale, by the reciprocal of an outlier-resistant
variant of the pooled s.d. of propensity scores. (Outlier
resistance is obtained by the application of mad,
as opposed to sd, to linear propensity scores in
the treatment; this can be changed to the actual s.d., or
rescaling can be skipped entirely, by setting argument
standardization.scale to sd or NULL,
respectively.) The overall result records absolute
differences between treated and control units on linear,
possibly rescaled, propensity scores.
In addition, one can impose a caliper in terms of these
distances by providing a scalar as a caliper
argument, forbidding matches between treatment and
control units differing in the calculated propensity
score by more than the specified caliper. For example,
Rosenbaum and Rubin's (1985) caliper of one-fifth of a
pooled propensity score s.d. would be imposed by
specifying caliper=.2, in tandem either with the
default rescaling or, to follow their example even more
closely, with the additional specification
standardization.scale=sd. Propensity calipers are
beneficial computationally as well as statistically, for
reasons indicated in the below discussion of the
numeric method.
The bigglm method works analogously to the
glm method, but with bigglm objects,
created by the bigglm function from package
biglm, which can handle bigger data sets than
the ordinary glm function can.
The numeric method returns absolute differences
between treated and control units' values of x. If
a caliper is specified, pairings with
x-differences greater than it are forbidden.
Conceptually, those distances are set to Inf;
computationally, if either of caliper and
within has been specified then only information
about permissible pairings will be stored, so the
forbidden pairings are simply omitted. Providing a
caliper argument here, as opposed to omitting it
and afterwards applying the caliper
function, reduces storage requirements and may otherwise
improve performance, particularly in larger problems.
For the numeric method, x must have names.
The matrix and InfinitySparseMatrix just
return their arguments as these objects are already valid
distance specifications.