mdist: Calculation of Pairwise Distances for Mixed-Type Data

Description

Computes pairwise distances between observations described by numeric and/or categorical attributes, with support for validation data. The function provides options for computing independent, dependent, and practice-based distances as defined in van de Velden et al. (2024), with support for various continuous and categorical distance metrics, scaling, and commensurability adjustments.

Usage

mdist(x, validate_x = NULL, response = NULL, distance_cont = "manhattan", 
      distance_cat = "tot_var_dist", commensurable = FALSE, scaling = "none",
      ncomp = ncol(x), threshold = NULL, preset = "custom")

Value

A matrix of pairwise distances. If validate_x is provided, rows correspond to validation observations and columns to training observations.

Arguments

x

A dataframe or tibble containing continuous (coded as numeric), categorical (coded as factors), or mixed-type variables.

validate_x

Optional validation data with the same structure as x. If provided, distances are computed between observations in validate_x and x. Default is NULL.

response

An optional factor for supervised distance calculation in categorical variables, applied only if distance_cat = "supervised". Default is NULL.

distance_cont

Character string specifying the distance metric for continuous variables. Options include "manhattan" (default) and "euclidean".

distance_cat

Character string specifying the distance metric for categorical variables. Options include "tot_var_dist" (default), "HL", "HLeucl", cat_dis, mca, st_dev, "matching", "eskin", "iof", "of", "goodall_3", "goodall_4", "gifi_chi2", "lin", "var_entropy", "var_mutability", "supervised", "supervised_full", "le_and_ho" and all the options in the package philentropy.

commensurable

Logical. If TRUE, the function adjusts each variable's contribution to ensure equal average influence in the overall distance. Default is FALSE.

scaling

Character string specifying the scaling method for continuous variables. Options include "none" (default), "std", "range", "pc_scores", and "robust".

ncomp

Integer specifying the number of components to retain when scaling = "pc_scores". Default is ncol(x).

threshold

Numeric value specifying the percentage of variance explained by retained components when scaling = "pc_scores". Overrides ncomp if specified. Default is NULL.

preset

Character string specifying pre-defined combinations of arguments. Options include:

"custom" (default): Use specified distance metrics and parameters
"gower": Gower's distance for mixed data
"unbiased_dependent": Total variation distance for categorical and Manhattan for standardized continuous
"euclidean_onehot": Euclidean distance on one-hot encoded categorical and standardized continuous
"catdissim": Matching distance for categorical and Manhattan for standardized continuous

References

van de Velden, M., Iodice D'Enza, A., Markos, A., Cavicchia, C. (2024). (Un)biased distances for mixed-type data. arXiv preprint. Retrieved from https://arxiv.org/abs/2411.00429.

Examples

Run this code

library(palmerpenguins)
library(rsample)

# Prepare complete data
pengmix <- palmerpenguins::penguins[complete.cases(palmerpenguins::penguins), ]

# Create training-test split
set.seed(123)
pengmix_split <- initial_split(pengmix, prop = 0.8)
tr_pengmix <- training(pengmix_split)
ts_pengmix <- testing(pengmix_split)

# Example 1: Basic usage with validation data
dist_matrix <- mdist(x = tr_pengmix, 
                    validate_x = ts_pengmix)

# Example 2: Gower preset with validation
dist_gower <- mdist(x = tr_pengmix, 
                   validate_x = ts_pengmix,
                   preset = "gower", 
                   commensurable = TRUE)

# Example 3: Euclidean one-hot preset with validation
dist_onehot <- mdist(x = tr_pengmix, 
                    validate_x = ts_pengmix,
                    preset = "euclidean_onehot")

# Example 4: Custom preset with standardization
dist_custom <- mdist(x = tr_pengmix,
                    validate_x = ts_pengmix,
                    preset = "custom",
                    distance_cont = "manhattan",
                    distance_cat = "matching",
                    commensurable = TRUE,
                    scaling = "std")

# Example 5: PCA-based scaling with threshold
dist_pca <- mdist(x = tr_pengmix,
                 validate_x = ts_pengmix,
                 distance_cont = "euclidean",
                 scaling = "pc_scores",
                 threshold = 0.85)

# Example 6: Categorical variables only
cat_vars <- c("species", "island", "sex")
dist_cat <- mdist(tr_pengmix[, cat_vars],
                 validate_x = ts_pengmix[, cat_vars],
                 distance_cat = "tot_var_dist")

# Example 7: Continuous variables only
num_vars <- c("bill_length_mm", "bill_depth_mm", 
              "flipper_length_mm", "body_mass_g")
dist_cont <- mdist(tr_pengmix[, num_vars],
                  validate_x = ts_pengmix[, num_vars],
                  distance_cont = "manhattan",
                  scaling = "std")

# Example 8: Supervised distance with response
response_tr <- tr_pengmix$body_mass_g
dist_sup <- mdist(tr_pengmix,
                 validate_x = ts_pengmix,
                 response = response_tr,
                 distance_cat = "supervised")