polymatch: Polymatching

Description

polymatch generates matched samples in designs with up to 10 groups.

Usage

polymatch(
  formulaMatch,
  start = "small.to.large",
  data,
  distance = "euclidean",
  exactMatch = NULL,
  vectorK = NULL,
  iterate = TRUE,
  niter_max = 50,
  withinGroupDist = TRUE,
  verbose = TRUE
)

Value

A list containing the following components:

match_id: A numeric vector identifying the matched sets---matched units have the same identifier.
total_distance: Total distance of the returned matched sample.
total_distance_start: Total distance at the starting point.

Arguments

formulaMatch

Formula with form group ~ x_1 + ... + x_p, where group is the name of the variable identifying the treatment groups/exposures and x_1,...,x_p are the matching variables.

start

An object specifying the starting point of the iterative algorithm. Three types of input are accepted:

start="small.to.large" (default): the starting matched set is generated by matching groups from the smallest to the largest.
Users can specify the order to be used to match groups for the starting sample. For example, if there are four groups with labels "A","B","C" and "D", start="D-B-A-C" generates the starting sample by matching groups "D" and "B", then units from "A" to the "D"-"B"pairs, then units from "C" to the "D"-"B"-"A" triplets.
Users can provide the starting matched set and the algorithm will explore possible reductions in the total distance. In this case, start must be a vector with the IDs of the matched sets, i.e., a vector with length equal to the number of rows of data where matched subjects are flagged with the same value and non-matched subjects have value NA.

data

The data.frame object with the data.

distance

String specifying whether the distance between pairs of observations should be computed with the Euclidean ("euclidean", default) or Mahalanobis ("mahalanobis") distance. See section 'Details' for further information.

exactMatch

Formula with form ~ z_1 + ... + z_k, where z_1,...,z_k must be factor variables. Subjects are exactly matched on z_1,...,z_k, i.e., matched within levels of these variables.

vectorK

A named vector with the number of subjects from each group in each matched set. The names of the vector must be the labels of the groups, i.e., the levels of the variable identifying the treatment groups/exposures. For example, in case of four groups with labels "A","B","C" and "D" and assuming that the desired design is 1:2:3:3 (1 subject from A, 2 from B, 3 from C and 3 from D in each matched set), the parameter should be set to vectorK = c("A" = 1, "B" = 2, "C" = 3, "D" = 3). By default, the generated matched design includes 1 subject per group in each matched set, i.e, a 1:1: ... :1 matched design.

iterate

Boolean specifying whether iterations should be done (iterate=TRUE, default) or not (iterate=FALSE).

niter_max

Maximum number of iterations. Default is 50.

withinGroupDist

Boolean specifying whether the distances within the same treatment/exposure group should be considered in the total distance. For example, in a 1:2:3 matched design among the groups A, B and C, the parameters controls whether the distance between the two subjects in B and the three pairwise distances among the subjects in C should be counted in the total distance. The default value is TRUE.

verbose

Boolean: should text be printed in the console? Default is TRUE.

Details

The function implements the conditionally optimal matching algorithm, which iteratively uses two-group optimal matching steps to generate matched samples with small total distance. In the current implementation, it is possible to generate matched samples with multiple subjects per group, with the matching ratio being specified by the vectorK parameter.

The steps of the algorithm are described with the following example. Consider a 4-group design with groups labels "A", "B", "C" and "D" and a 1:1:1:1 matching ratio. The algorithm requires a set of quadruplets as starting point. The argument start defines the approach to be used to generate such a starting point. polymatch generates the starting point by sequentially using optimal two-group matching. In the default setting (start="small.to.large"), the steps are:

optimally match the two smallest groups;
optimally match the third smallest group to the pairs generated in the first step;
optimally match the last group to the triplets generated in the second step.

Notably, we can use the optimal two-group algorithm in steps 2) and 3) because they are two-dimensional problems: the elements of one group on one hand, fixed matched sets on the other hand. The order of the groups to be considered when generating the starting point can be user-specified (e.g., start="D-B-A-C"). In alternative, the user can provide a matched set that will be used as starting point.

Given the starting matched set, the algorithm iteratively explores possible reductions in the total distance (if iterate="TRUE"), by sequentially relaxing the connection to each group and rematching units of that group. In our example:

rematch "B-C-D" triplets within the starting quadruplets to units in group "A";
rematch "A-C-D" triplets within the starting quadruplets to units in group "B";
rematch "A-B-D" triplets within the starting quadruplets to units in group "C";
rematch "A-B-C" triplets within the starting quadruplets to units in group "D".

If none of the sets of quadruplets generated in 1)-4) has smaller total distance than the starting point, the algorihm stops. Otherwise, the set of quadruplets with smallest distance is seleceted and the process iterated, until no reduction in the total distance is found or the number of maximum iterations is reached (niter_max=50 by default).

The total distance is defined as the sum of all the within-matched-set distances. The within-matched-set distance is defined as the sum of the pairwise distances between pairs of units in the matched set. The type of distance is specified with the distance argument. The current implementation supports Euclidean (distance="euclidean") and Mahalanobis (distance="mahalanobis") distances. In particular, for the Mahalanobis distance, the covariance matrix is defined only once on the full dataset.

Examples

Run this code

#Generate a datasets with group indicator and four variables:
#- var1, continuous, sampled from normal distributions;
#- var2, continuous, sampled from beta distributions;
#- var3, categorical with 4 levels;
#- var4, binary.
set.seed(1234567)
dat <- data.frame(group = c(rep("A",10),rep("B",20),rep("C",30)),
               var1 = c(rnorm(10,mean=0,sd=1),
                        rnorm(20,mean=1,sd=2),
                        rnorm(30,mean=-1,sd=2)),
               var2 = c(rbeta(10,shape1=1,shape2=1),
                        rbeta(20,shape1=2,shape2=1),
                        rbeta(30,shape1=1,shape2=2)),
               var3 = factor(c(rbinom(10,size=3,prob=.4),
                               rbinom(20,size=3,prob=.5),
                               rbinom(30,size=3,prob=.3))),
               var4 = factor(c(rbinom(10,size=1,prob=.5),
                               rbinom(20,size=1,prob=.3),
                               rbinom(30,size=1,prob=.7))))

#Match on propensity score
#-------------------------

#With multiple groups, need a multinomial model for the PS
library(VGAM)
psModel <- vglm(group ~ var1 + var2 + var3 + var4,
                family=multinomial, data=dat)
#Estimated logits - 2 for each unit: log(P(group=A)/P(group=C)), log(P(group=B)/P(group=C))
logitPS <- predict(psModel, type = "link")
dat$logit_AvsC <- logitPS[,1]
dat$logit_BvsC <- logitPS[,2]

#Match on logits of PS
resultPs <- polymatch(group ~ logit_AvsC + logit_BvsC, data = dat,
                    distance = "euclidean")
dat$match_id_ps <- resultPs$match_id


#Match on covariates
#--------------------


#Match on continuous covariates with exact match on categorical/binary variables
resultCov <- polymatch(group ~ var1 + var2, data = dat,
                        distance = "mahalanobis",
                        exactMatch = ~var3+var4)
dat$match_id_cov <- resultCov$match_id