The match_gps()
function performs sample matching based on
generalized propensity scores (GPS). It utilizes the k-means clustering
algorithm to partition the data into clusters and subsequently matches all
treatment groups within these clusters. This approach ensures efficient and
structured comparisons across treatment levels while accounting for the
propensity score distribution.
match_gps(
csmatrix = NULL,
method = "nnm",
caliper = 0.2,
reference = NULL,
ratio = NULL,
replace = NULL,
order = NULL,
ties = NULL,
min_controls = NULL,
max_controls = NULL,
kmeans_args = NULL,
kmeans_cluster = 5,
verbose_output = FALSE,
...
)
A data.frame
similar to the one provided as the data
argument in
the estimate_gps()
function, containing the same columns but only the
observations for which a match was found. The returned object includes two
attributes, accessible with the attr()
function:
original_data
: A data.frame
with the original data returned by the
csregion()
or estimate_gps()
function, after the estimation of the csr
and filtering out observations not within the csr.
matching_filter
: A logical vector indicating which rows from
original_data
were included in the final matched dataset.
An object of class gps
and/or csr
representing a data
frame of generalized propensity scores. The first column must be the
treatment variable, with additional attributes describing the calculation
of the common support region and the estimation of generalized propensity
scores. It is crucial that the common support region was calculated using
the csregion()
function to ensure compatibility.
A single string specifying the matching method to use. The
default is "nnm"
, which applies the k-nearest neighbors matching
algorithm. See the Details section for a full list of available methods.
A numeric value specifying the caliper width, which defines
the allowable range within which observations can be matched. It is
expressed as a percentage of the standard deviation of the
logit-transformed generalized propensity scores. To perform matching
without a caliper, set this parameter to a very large value. For exact
matching, set caliper = 0
and enable the exact
option by setting it to
TRUE
.
A single string specifying the exact level of the treatment variable to be used as the reference in the matching process. All other treatment levels will be matched to this reference level. Ideally, this should be the control level. If no natural control is present, avoid selecting a level with extremely low or high covariate or propensity score values. Instead, choose a level with covariate or propensity score distributions that are centrally positioned among all treatment groups to maximize the number of matches.
A scalar for the number of matches which should be found for
each control observation. The default is one-to-one matching. Only
available for the methods "nnm"
and "pairopt"
.
Logical value indicating whether matching should be done with
replacement. If FALSE
, the order of matches generally matters. Matches
are found in the same order as the data is sorted. Specifically, the
matches for the first observation will be found first, followed by those
for the second observation, and so on. Matching without replacement is
generally not recommended as it tends to increase bias. However, in cases
where the dataset is large and there are many potential matches, setting
replace = FALSE
often results in a substantial speedup with negligible or
no bias. Only available for the method "nnm"
A string specifying the order in which logit-transformed GPS values are sorted before matching. The available options are:
"desc"
– sorts GPS values from highest to lowest (default).
"asc"
– sorts GPS values from lowest to highest.
"original"
– preserves the original order of GPS values.
"random"
– randomly shuffles GPS values. To generate different random
orders, set a seed using set.seed()
.
A logical flag indicating how tied matches should be handled.
Available only for the "nnm"
method, with a default value of FALSE
(all
tied matches are included in the final dataset, but only unique
observations are retained). For more details, see the ties
argument in
Matching::Matchby()
.
The minimum number of treatment observations that should
be matched to each control observation. Available only for the "fullopt"
method. For more details, see the min.controls
argument in
optmatch::fullmatch()
.
The maximum number of treatment observations that can be
matched to each control observation. Available only for the "fullopt"
method. For more details, see the max.controls
argument in
optmatch::fullmatch()
.
A list of arguments to pass to stats::kmeans. These
arguments must be provided inside a list()
in the paired name = value
format.
An integer specifying the number of clusters to pass to stats::kmeans.
a logical flag. If TRUE
a more verbose version of the
function is run and the output is printed out to the console.
Additional arguments to be passed to the matching function.
Propensity score matching can be performed using various matching
algorithms. Lopez and Gutman (2017) do not explicitly specify the matching
algorithm used, but it is assumed they applied the commonly used k-nearest
neighbors matching algorithm, implemented as method = "nnm"
. However,
this algorithm can sometimes be challenging to use, especially when
treatment and control groups have unequal sizes. When replace = FALSE
,
the number of matches is strictly limited by the smaller group, and even
with replace = TRUE
, the results may not always be satisfactory. To
address these limitations, we have implemented an additional matching
algorithm to maximize the number of matched observations within a dataset.
The available matching methods are:
"nnm"
– classic k-nearest neighbors matching, implemented using
Matching::Matchby()
. The tunable parameters in match_gps()
are
caliper
, ratio
, replace
, order
, and ties
. Additional arguments
can be passed to Matching::Matchby()
via the ...
argument.
"fullopt"
– optimal full matching algorithm, implemented with
optmatch::fullmatch()
. This method calculates a discrepancy matrix to
identify all possible matches, often optimizing the percentage of matched
observations. The available tuning parameters are caliper
,
min_controls
, and max_controls
.
"pairmatch"
– optimal 1:1 and 1:k matching algorithm, implemented using
optmatch::pairmatch()
, which is actually a wrapper around
optmatch::fullmatch()
. Like "fullopt"
, this method calculates a
discrepancy matrix and finds matches that minimize its sum. The available
tuning parameters are caliper
and ratio
.
Michael J. Lopez, Roee Gutman "Estimation of Causal Effects with Multiple Treatments: A Review and New Ideas," Statistical Science, Statist. Sci. 32(3), 432-454, (August 2017)
estimate_gps()
for the calculation of generalized propensity
scores; MatchIt::matchit()
, optmatch::fullmatch()
and
optmatch::pairmatch()
for the documentation of the matching functions;
stats::kmeans()
for the documentation of the k-Means algorithm.
# Defining the formula used for gps estimation
formula_cancer <- formula(status ~ age + sex)
# Step 1.) Estimation of the generalized propensity scores
gp_scores <- estimate_gps(formula_cancer,
data = cancer,
method = "multinom",
reference = "control",
verbose_output = TRUE
)
# Step 2.) Defining the common support region
gps_csr <- csregion(gp_scores)
# Step 3.) Matching the gps
matched_cancer <- match_gps(gps_csr,
caliper = 0.25,
reference = "control",
method = "fullopt",
kmeans_cluster = 2,
kmeans_args = list(
iter.max = 200,
algorithm = "Forgy"
),
verbose_output = TRUE
)
Run the code above in your browser using DataLab