This function computes a distance matrix for a mixed variable data set applying various methods.
distmix(data, method = "gower", idnum = NULL, idbin = NULL, idcat = NULL)
Function returns a distance matrix (n x n).
A data frame or matrix object.
A method to calculate the mixed variables distance (see Details).
A vector of column index of the numerical variables.
A vector of column index of the binary variables.
A vector of column index of the categorical variables.
Weksi Budiaji
Contact: budiaji@untirta.ac.id
There are six methods available to calculate the mixed variable
distance. They are gower
, wishart
, podani
,
huang
, harikumar
, ahmad
.
gower
The Gower (1971) distance is the most common distance for a mixed variable
data set. Although the Gower distance accommodates missing values, a missing
value is not allowed in this function. If there is a missing value, the Gower
distance from the daisy
function in the cluster package can be
applied. The Gower distance between objects i and j is
computed by
\(d_{ij} = 1 - s_{ij}\), where
$$s_{ij} = \frac{\sum_{l=1}^p \omega_{ijl} s_{ijl}}
{\sum_{l=1}^p \omega_{ijl}}$$
\(\omega_{ijl}\) is a weight in variable l that is usually 1 or 0
(for a missing value). If the variable l is a numerical variable,
$$s_{ijl} = 1- \frac{|x_{il} - x_{jl}|}{R_l}$$
\(s_{ijl} \in\) {0, 1}, if the variable l is a binary/
categorical variable.
wishart
Wishart (2003) has proposed a different measure compared to Gower (1971) in the numerical variable part. Instead of a range, it applies a variance of the numerical variable in the \(s_{ijl}\) such that the distance becomes $$d_{ij} = \sqrt{\sum_{l=1}^p \omega_{ijl} \left(\frac{x_{il} - x_{jl}} {\delta_{ijl}}\right)^2}$$ where \(\delta_{ijl} = s_l\) when l is a numerical variable and \(\delta_{ijl} \in\) {0, 1} when l is a binary/ categorical variable.
podani
Podani (1999) has suggested a different method to compute a distance for a mixed variable data set. The Podani distance is calculated by $$d_{ij} = \sqrt{\sum_{l=1}^p \omega_{ijl} \left(\frac{x_{il} - x_{jl}} {\delta_{ijl}}\right)^2}$$ where \(\delta_{ijl} = R_l\) when l is a numerical variable and \(\delta_{ijl} \in\) {0, 1} when l is a binary/ categorical variable.
huang
The Huang (1997) distance between objects i and j is computed
by
$$ d_{ij} = \sum_{r=1}^{P_n} (x_{ir} - x_{jr})^2 + \gamma
\sum_{s=1}^{P_c} \delta_c (x_{is} - x_{js})$$
where \(P_n\) and \(P_c\) are the number of numerical and categorical
variables, respectively,
$$\gamma = \frac{\sum_{r=1}^{P_n} s_{r}^2}{P_n} $$
and \(\delta_c(x_{is} - x_{js})\) is the mismatch/ simple matching distance
(see matching
) between object i and object
j in the variable s.
harikumar
Harikumar-PV (2015) has proposed a distance for a mixed variable data set:
$$ d_{ij} = \sum_{r=1}^{P_n} |x_{ir} - x_{jr}| + \sum_{s=1}^{P_c}
\delta_c (x_{is} - x_{js}) + \sum_{t=1}^{p_b} \delta_b (x_{it}, x_{jt})$$
where \(P_b\) is the number of binary variables,
\(\delta_c (x_{is}, x_{js})\) is the co-occurrence distance (see
cooccur
), and \(\delta_b (x_{it}, x_{jt})\) is the
Hamming distance.
ahmad
Ahmad and Dey (2007) has computed a distance of a mixed variable set via
$$ d_{ij} = \sum_{r=1}^{P_n} (x_{ir} - x_{jr})^2 +
\sum_{s=1}^{P_c} \delta_c (x_{is} - x_{js})$$
where \(\delta_c (x_{it}, x_{jt})\) are the co-occurrence distance
(see cooccur
). In the Ahmad and Dey distance,
the binary and categorical variables are not separable such that
the co-occurrence distance is based on the combined these two classes,
i.e. binary and categorical variables. Note that this function applies
standard version of Squared Euclidean, i.e without any weight.
At leas two arguments of the idnum
, idbin
, and
idcat
have to be provided because this function calculates
the mixed distance. If the method
is harikumar
,
the categorical variables have to be at least two variables such
that the co-occurrence distance can be computed. It also applies when
method = "ahmad"
. The idbin
+ idcat
has to
be more than 1 column. It returns to an Error
message otherwise.
Ahmad, A., and Dey, L. 2007. A K-mean clustering algorithm for mixed numeric and categorical data. Data and Knowledge Engineering 63, pp. 503-527.
Gower, J., 1971. A general coefficient of similarity and some of its properties. Biometrics 27, pp. 857-871
Harikumar, S., PV, S., 2015. K-medoid clustering for heterogeneous data sets. JProcedia Computer Science 70, pp. 226-237.
Huang, Z., 1997. Clustering large data sets with mixed numeric and categorical values, in: The First Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 21-34.
Podani, J., 1999. Extending gower's general coefficient of similarity to ordinal characters. Taxon 48, pp. 331-340.
Wishart, D., 2003. K-means clustering with outlier detection, mixed variables and missing values, in: Exploratory Data Analysis in Empirical Research: Proceedings of the 25th Annual Conference of the Gesellschaft fur Klassifikation e.V., University of Munich, March 14-16, 2001, Springer Berlin Heidelberg, Berlin, Heidelberg. pp. 216-226.
set.seed(1)
a <- matrix(sample(1:2, 7*3, replace = TRUE), 7, 3)
a1 <- matrix(sample(1:3, 7*3, replace = TRUE), 7, 3)
mixdata <- cbind(iris[1:7,1:3], a, a1)
colnames(mixdata) <- c(paste(c("num"), 1:3, sep = ""),
paste(c("bin"), 1:3, sep = ""),
paste(c("cat"), 1:3, sep = ""))
distmix(mixdata, method = "gower", idnum = 1:3, idbin = 4:6, idcat = 7:9)
Run the code above in your browser using DataLab