The Euclidean distance (computed using euclidean_dist()
) is the raw distance between units, computed as
$$d_{ij} = \sqrt{(x_i - x_j)(x_i - x_j)'}$$
where \(x_i\) and \(x_j\) are vectors of covariates for units \(i\) and \(j\), respectively. The Euclidean distance is sensitive to the scales of the variables and their redundancy (i.e., correlation). It should probably not be used for matching unless all of the variables have been previously scaled appropriately or are already on the same scale. It forms the basis of the other distance measures.
The scaled Euclidean distance (computed using scaled_euclidean_dist()
) is the Euclidean distance computed on the scaled covariates. Typically the covariates are scaled by dividing by their standard deviations, but any scaling factor can be supplied using the var
argument. This leads to a distance measure computed as
$$d_{ij} = \sqrt{(x_i - x_j)S_d^{-1}(x_i - x_j)'}$$
where \(S_d\) is a diagonal matrix with the squared scaling factors on the diagonal. Although this measure is not sensitive to the scales of the variables (because they are all placed on the same scale), it is still sensitive to redundancy among the variables. For example, if 5 variables measure approximately the same construct (i.e., are highly correlated) and 1 variable measures another construct, the first construct will have 5 times as much influence on the distance between units as the second construct. The Mahalanobis distance attempts to address this issue.
The Mahalanobis distance (computed using mahalanobis_dist()
) is computed as
$$d_{ij} = \sqrt{(x_i - x_j)S^{-1}(x_i - x_j)'}$$
where \(S\) is a scaling matrix, typically the covariance matrix of the covariates. It is essentially equivalent to the Euclidean distance computed on the scaled principal components of the covariates. This is the most popular distance matrix for matching because it is not sensitive to the scale of the covariates and accounts for redundancy between them. The scaling matrix can also be supplied using the var
argument.
The Mahalanobis distance can be sensitive to outliers and long-tailed or otherwise non-normally distributed covariates and may not perform well with categorical variables due to prioritizing rare categories over common ones. One solution is the rank-based robust Mahalanobis distance (computed using robust_mahalanobis_dist()
), which is computed by first replacing the covariates with their ranks (using average ranks for ties) and rescaling each ranked covariate by a constant scaling factor before computing the usual Mahalanobis distance on the rescaled ranks.
The Mahalanobis distance and its robust variant are computed internally by transforming the covariates in such a way that the Euclidean distance computed on the scaled covariates is equal to the requested distance. For the Mahalanobis distance, this involves replacing the covariates vector \(x_i\) with \(x_iS^{-.5}\), where \(S^{-.5}\) is the Cholesky decomposition of the (generalized) inverse of the covariance matrix \(S\).
When a left-hand-side splitting variable is present in formula
and var = NULL
(i.e., so that the scaling matrix is computed internally), the covariance matrix used is the "pooled" covariance matrix, which essentially is a weighted average of the covariance matrices computed separately within each level of the splitting variable to capture within-group variation and reduce sensitivity to covariate imbalance. This is also true of the scaling factors used in the scaled Euclidean distance.