distmix: Distances for mixed variables data set

Description

This function computes a distance matrix for a mixed variable data set applying various methods.

Usage

distmix(data, method = "gower", idnum = NULL, idbin = NULL, idcat = NULL)

Value

Function returns a distance matrix (n x n).

Arguments

data: A data frame or matrix object.
method: A method to calculate the mixed variables distance (see Details).
idnum: A vector of column index of the numerical variables.
idbin: A vector of column index of the binary variables.
idcat: A vector of column index of the categorical variables.

Author

Weksi Budiaji
Contact: budiaji@untirta.ac.id

Details

There are six methods available to calculate the mixed variable distance. They are gower, wishart, podani, huang, harikumar, ahmad.

gower

The Gower (1971) distance is the most common distance for a mixed variable data set. Although the Gower distance accommodates missing values, a missing value is not allowed in this function. If there is a missing value, the Gower distance from the daisy function in the cluster package can be applied. The Gower distance between objects i and j is computed by $d_{ij} = 1 - s_{ij}$, where $$s_{ij} = \frac{\sum_{l=1}^p \omega_{ijl} s_{ijl}} {\sum_{l=1}^p \omega_{ijl}}$$ $\omega_{ijl}$ is a weight in variable l that is usually 1 or 0 (for a missing value). If the variable l is a numerical variable, $$s_{ijl} = 1- \frac{|x_{il} - x_{jl}|}{R_l}$$ $s_{ijl} \in$ {0, 1}, if the variable l is a binary/ categorical variable.

wishart

Wishart (2003) has proposed a different measure compared to Gower (1971) in the numerical variable part. Instead of a range, it applies a variance of the numerical variable in the $s_{ijl}$ such that the distance becomes $$d_{ij} = \sqrt{\sum_{l=1}^p \omega_{ijl} \left(\frac{x_{il} - x_{jl}} {\delta_{ijl}}\right)^2}$$ where $\delta_{ijl} = s_l$ when l is a numerical variable and $\delta_{ijl} \in$ {0, 1} when l is a binary/ categorical variable.

podani

Podani (1999) has suggested a different method to compute a distance for a mixed variable data set. The Podani distance is calculated by $$d_{ij} = \sqrt{\sum_{l=1}^p \omega_{ijl} \left(\frac{x_{il} - x_{jl}} {\delta_{ijl}}\right)^2}$$ where $\delta_{ijl} = R_l$ when l is a numerical variable and $\delta_{ijl} \in$ {0, 1} when l is a binary/ categorical variable.

huang

The Huang (1997) distance between objects i and j is computed by $$ d_{ij} = \sum_{r=1}^{P_n} (x_{ir} - x_{jr})^2 + \gamma \sum_{s=1}^{P_c} \delta_c (x_{is} - x_{js})$$ where $P_n$ and $P_c$ are the number of numerical and categorical variables, respectively, $$\gamma = \frac{\sum_{r=1}^{P_n} s_{r}^2}{P_n} $$ and $\delta_c(x_{is} - x_{js})$ is the mismatch/ simple matching distance (see matching) between object i and object j in the variable s.

harikumar

Harikumar-PV (2015) has proposed a distance for a mixed variable data set: $$ d_{ij} = \sum_{r=1}^{P_n} |x_{ir} - x_{jr}| + \sum_{s=1}^{P_c} \delta_c (x_{is} - x_{js}) + \sum_{t=1}^{p_b} \delta_b (x_{it}, x_{jt})$$ where $P_b$ is the number of binary variables, $\delta_c (x_{is}, x_{js})$ is the co-occurrence distance (see cooccur), and $\delta_b (x_{it}, x_{jt})$ is the Hamming distance.

ahmad

Ahmad and Dey (2007) has computed a distance of a mixed variable set via $$ d_{ij} = \sum_{r=1}^{P_n} (x_{ir} - x_{jr})^2 + \sum_{s=1}^{P_c} \delta_c (x_{is} - x_{js})$$ where $\delta_c (x_{it}, x_{jt})$ are the co-occurrence distance (see cooccur). In the Ahmad and Dey distance, the binary and categorical variables are not separable such that the co-occurrence distance is based on the combined these two classes, i.e. binary and categorical variables. Note that this function applies standard version of Squared Euclidean, i.e without any weight.

At leas two arguments of the idnum, idbin, and idcat have to be provided because this function calculates the mixed distance. If the method is harikumar, the categorical variables have to be at least two variables such that the co-occurrence distance can be computed. It also applies when method = "ahmad". The idbin + idcat has to be more than 1 column. It returns to an Error message otherwise.

References

Ahmad, A., and Dey, L. 2007. A K-mean clustering algorithm for mixed numeric and categorical data. Data and Knowledge Engineering 63, pp. 503-527.

Gower, J., 1971. A general coefficient of similarity and some of its properties. Biometrics 27, pp. 857-871

Harikumar, S., PV, S., 2015. K-medoid clustering for heterogeneous data sets. JProcedia Computer Science 70, pp. 226-237.

Huang, Z., 1997. Clustering large data sets with mixed numeric and categorical values, in: The First Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 21-34.

Podani, J., 1999. Extending gower's general coefficient of similarity to ordinal characters. Taxon 48, pp. 331-340.

Wishart, D., 2003. K-means clustering with outlier detection, mixed variables and missing values, in: Exploratory Data Analysis in Empirical Research: Proceedings of the 25th Annual Conference of the Gesellschaft fur Klassifikation e.V., University of Munich, March 14-16, 2001, Springer Berlin Heidelberg, Berlin, Heidelberg. pp. 216-226.

Examples

Run this code

set.seed(1)
a <- matrix(sample(1:2, 7*3, replace = TRUE), 7, 3)
a1 <- matrix(sample(1:3, 7*3, replace = TRUE), 7, 3)
mixdata <- cbind(iris[1:7,1:3], a, a1)
colnames(mixdata) <- c(paste(c("num"), 1:3, sep = ""),
                       paste(c("bin"), 1:3, sep = ""),
                       paste(c("cat"), 1:3, sep = ""))
distmix(mixdata, method = "gower", idnum = 1:3, idbin = 4:6, idcat = 7:9)

Run the code above in your browser using DataLab