dist_categorical: Compute pairwise distances for categorical data
Description
Internal helper function to compute distances between observations based on
the matching coefficient, which measures the proportion of matching attributes
between two categorical vectors. This approach is particularly useful for
multiclass categorical variables.
A symmetric numeric matrix of pairwise distances. Distance is in the
range [0, 1], where 0 indicates complete agreement and 1 indicates
complete disagreement. NA is returned for pairs with no valid comparisons
(all NA entries).
Arguments
x
A data frame or matrix containing only categorical variables (factor or character)
method
Currently only "matching_coefficient" is supported.
Details
The distance between two observations \(i\) and \(j\) is defined as:
$$d(i, j) = 1 - \frac{\alpha}{p^\prime}$$
where \(\alpha\) is the number of matching attributes (agreements) and \(p'\)
is the number of non-missing comparisons between the two observations.
Only categorical columns (factor or character) are supported; numeric columns
must be converted prior to using this function.
Missing values (NA) are ignored pairwise. If all attributes are missing
for a given pair, the distance is returned as NA.
This distance is equivalent to the normalized Hamming distance when
applied to binary variables.
The matching coefficient satisfies metric properties and can be used
as a building block for mixed-type distances (e.g., combined with
quantitative distances via Gower's similarity).