dist_categorical: Compute pairwise distances for categorical data

Description

Internal helper function to compute distances between observations based on the matching coefficient, which measures the proportion of matching attributes between two categorical vectors. This approach is particularly useful for multiclass categorical variables.

Usage

dist_categorical(x, method = "matching_coefficient")

Value

A symmetric numeric matrix of pairwise distances. Distance is in the range [0, 1], where 0 indicates complete agreement and 1 indicates complete disagreement. NA is returned for pairs with no valid comparisons (all NA entries).

Arguments

x: A data frame or matrix containing only categorical variables (factor or character)
method: Currently only "matching_coefficient" is supported.

Details

The distance between two observations $i$ and $j$ is defined as: $$d(i, j) = 1 - \frac{\alpha}{p^\prime}$$ where $\alpha$ is the number of matching attributes (agreements) and $p'$ is the number of non-missing comparisons between the two observations.

Only categorical columns (factor or character) are supported; numeric columns must be converted prior to using this function.
Missing values (NA) are ignored pairwise. If all attributes are missing for a given pair, the distance is returned as NA.
This distance is equivalent to the normalized Hamming distance when applied to binary variables.
The matching coefficient satisfies metric properties and can be used as a building block for mixed-type distances (e.g., combined with quantitative distances via Gower's similarity).

Examples

Run this code

# Small categorical dataset
df <- data.frame(
  A = factor(c("red", "blue", "red")),
  B = factor(c("circle", "circle", "square"))
)
# Compute matching coefficient
dbrobust::dist_categorical(df)

Run the code above in your browser using DataLab