cdist: Calculation of Pairwise Distances for Categorical Data

Description

Computes a distance matrix for categorical variables with support for validation data, multiple distance metrics, and variable weighting. The function implements various distance calculation approaches as described in van de Velden et al. (2024), including commensurable distances and supervised options when response variable is provided.

Usage

cdist(x, response = NULL, validate_x = NULL, method = "tot_var_dist",
      commensurable = FALSE, weights = 1)

Value

A list containing:

distance_mat: Matrix of pairwise distances. If validate_x is provided, rows correspond to validation observations and columns to training observations.
delta: Matrix or list of matrices containing level-wise distances for each variable.
delta_names: Vector of level names used in the delta matrices.

Arguments

x

A data frame or matrix of categorical variables (factors).

response

Optional response variable for supervised distance calculations. Default is NULL.

validate_x

Optional validation data frame or matrix. If provided, distances are computed between observations in validate_x and x. Default is NULL.

method

Character string or vector specifying the distance metric(s). Options include:

"tot_var_dist": Total variation distance (default)
"HL", "HLeucl": Hennig-Liao distance
"cat_dis": Category-based dissimilarity
"mca": Multiple correspondence analysis based
"st_dev": Standard deviation based
"matching", "eskin", "iof", "of": Various coefficients
"goodall_3", "goodall_4": Goodall-based distances
"gifi_chi2": Gifi chi-square distance
"lin": Lin's similarity measure
"var_entropy", "var_mutability": Variability-based measures
"supervised", "supervised_full": Response-guided distances
"le_and_ho": Le and Ho distance
Additional methods from philentropy package

Can be a single string or vector for different methods per variable.

commensurable

Logical. If TRUE, standardizes each variable's distance matrix by dividing by its mean distance. Default is FALSE.

weights

Numeric vector or matrix of weights. If vector, must have length equal to number of variables. If matrix, must match the dimension of level-wise distances. Default is 1 (equal weighting).

Details

The cdist function provides a comprehensive framework for categorical distance calculations:

Supports multiple distance calculation methods that can be specified globally or per variable
Handles validation data through validate_x parameter
Implements supervised distances when response variable is provided
Supports commensurable distances for better comparability across variables
Provides flexible weighting schemes at variable and level granularity

Important notes:

Input variables are automatically converted to factors with dropped unused levels
Different methods per variable is not supported for "none", "st_dev", "HL", "cat_dis", "HLeucl", "mca"
Weight vector length must match the number of variables when specified as a vector
Variables should be factors; numeric variables will cause errors

References

van de Velden, M., Iodice D'Enza, A., Markos, A., Cavicchia, C. (2024). (Un)biased distances for mixed-type data. arXiv preprint. Retrieved from https://arxiv.org/abs/2411.00429.

Examples

Run this code

library(palmerpenguins)
library(rsample)

# Prepare data with complete cases for both categorical variables and response
complete_vars <- c("species", "island", "sex", "body_mass_g")
penguins_complete <- penguins[complete.cases(penguins[, complete_vars]), ]
penguins_cat <- penguins_complete[, c("species", "island", "sex")]
response <- penguins_complete$body_mass_g

# Create training-test split
set.seed(123)
penguins_split <- initial_split(penguins_cat, prop = 0.8)
tr_penguins <- training(penguins_split)
ts_penguins <- testing(penguins_split)
response_tr <- response[penguins_split$in_id]
response_ts <- response[-penguins_split$in_id]

# Basic usage
result <- cdist(tr_penguins)

# With validation data
val_result <- cdist(x = tr_penguins, 
                   validate_x = ts_penguins,
                   method = "tot_var_dist")
                   
# ...and commensurability
val_result_COMM <- cdist(x = tr_penguins, 
                   validate_x = ts_penguins,
                   method = "tot_var_dist",
                   commensurable = TRUE)

# Supervised distance with response variable
sup_result <- cdist(x = tr_penguins, 
                   response = response_tr,
                   method = "supervised")

# Supervised with validation data
sup_val_result <- cdist(x = tr_penguins,
                       validate_x = ts_penguins,
                       response = response_tr,
                       method = "supervised")

# Commensurable distances with custom weights
comm_result <- cdist(tr_penguins,
                    commensurable = TRUE,
                    weights = c(2, 1, 1))

# Different methods per variable
multi_method <- cdist(tr_penguins,
                     method = c("matching", "goodall_3", "tot_var_dist"))

Run the code above in your browser using DataLab