This function computes a distance matrix for a mixed variable data set applying various methods.
distmix(data, method = "gower", idnum = NULL, idbin = NULL,
idcat = NULL)
A data frame or matrix object.
A method to calculate the mixed variables distance (see Details).
A vector of column index of the numerical variables.
A vector of column index of the binary variables.
A vector of column index of the categorical variables.
Function returns a distance matrix (n x n).
There are six methods available to calculate the mixed variable
distance. They are gower
, wishart
, podani
,
huang
, harikumar
, ahmad
.
gower
The Gower (1971) distance is the most common distance for a mixed variable
data set. Although the Gower distance accommodates missing values, a missing
value is not allowed in this function. If there is a missing value, the Gower
distance from the daisy
function in the cluster package can be
applied. The Gower distance between objects i and j is
computed by
\(d_{ij} = 1 - s_{ij}\), where
$$s_{ij} = \frac{\sum_{l=1}^p \omega_{ijl} s_{ijl}}
{\sum_{l=1}^p \omega_{ijl}}$$
\(\omega_{ijl}\) is a weight in variable l that is usually 1 or 0
(for a missing value). If the variable l is a numerical variable,
$$s_{ijl} = 1- \frac{|x_{il} - x_{jl}|}{R_l}$$
\(s_{ijl} \in\) {0, 1}, if the variable l is a binary/
categorical variable.
wishart
Wishart (2003) has proposed a different measure compared to Gower (1971) in the numerical variable part. Instead of a range, it applies a variance of the numerical variable in the \(s_{ijl}\) such that the distance becomes $$d_{ij} = \sqrt{\sum_{l=1}^p \omega_{ijl} \left(\frac{x_{il} - x_{jl}} {\delta_{ijl}}\right)^2}$$ where \(\delta_{ijl} = s_l\) when l is a numerical variable and \(\delta_{ijl} \in\) {0, 1} when l is a binary/ categorical variable.
podani
Podani (1999) has suggested a different method to compute a distance for a mixed variable data set. The Podani distance is calculated by $$d_{ij} = \sqrt{\sum_{l=1}^p \omega_{ijl} \left(\frac{x_{il} - x_{jl}} {\delta_{ijl}}\right)^2}$$ where \(\delta_{ijl} = R_l\) when l is a numerical variable and \(\delta_{ijl} \in\) {0, 1} when l is a binary/ categorical variable.
huang
The Huang (1997) distance between objects i and j is computed
by
$$ d_{ij} = \sum_{r=1}^{P_n} (x_{ir} - x_{jr})^2 + \gamma
\sum_{s=1}^{P_c} \delta_c (x_{is} - x_{js})$$
where \(P_n\) and \(P_c\) are the number of numerical and categorical
variables, respectively,
$$\gamma = \frac{\sum_{r=1}^{P_n} s_{r}^2}{P_n} $$
and \(\delta_c(x_{is} - x_{js})\) is the mismatch/ simple matching distance
(see matching
) between object i and object
j in the variable s.
harikumar
Harikumar-PV (2015) has proposed a distance for a mixed variable data set:
$$ d_{ij} = \sum_{r=1}^{P_n} |x_{ir} - x_{jr}| + \sum_{s=1}^{P_c}
\delta_c (x_{is} - x_{js}) + \sum_{t=1}^{p_b} \delta_b (x_{it}, x_{jt})$$
where \(P_b\) is the number of binary variables,
\(\delta_c (x_{is}, x_{js})\) is the co-occurrence distance (see
cooccur
), and \(\delta_b (x_{it}, x_{jt})\) is the
Hamming distance.
ahmad
Ahmad and Dey (2007) has computed a distance of a mixed variable set via
$$ d_{ij} = \sum_{r=1}^{P_n} (x_{ir} - x_{jr})^2 +
\sum_{s=1}^{P_c} \delta_c (x_{is} - x_{js})$$
where \(\delta_c (x_{it}, x_{jt})\) are the co-occurrence distance
(see cooccur
). In the Ahmad and Dey distance,
the binary and categorical variables are not separable such that
the co-occurrence distance is based on the combined these two classes,
i.e. binary and categorical variables.
At leas two arguments of the idnum
, idbin
, and
idcat
have to be provided because this function calculates
the mixed distance. If the method
is harikumar
,
the categorical variables have to be at least two variables such
that the co-occurrence distance can be computed. It also applies when
method = "ahmad"
. The idbin
+ idcat
has to
be more than 1 column. It returns to an Error
message otherwise.
Ahmad, A., and Dey, L. 2007. A K-mean clustering algorithm for mixed numeric and categorical data. Data and Knowledge Engineering 63, pp. 503-527.
Gower, J., 1971. A general coefficient of similarity and some of its properties. Biometrics 27, pp. 857-871
Harikumar, S., PV, S., 2015. K-medoid clustering for heterogeneous data sets. JProcedia Computer Science 70, pp. 226-237.
Huang, Z., 1997. Clustering large data sets with mixed numeric and categorical values, in: The First Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 21-34.
Podani, J., 1999. Extending gower's general coefficient of similarity to ordinal characters. Taxon 48, pp. 331-340.
Wishart, D., 2003. K-means clustering with outlier detection, mixed variables and missing values, in: Exploratory Data Analysis in Empirical Research: Proceedings of the 25th Annual Conference of the Gesellschaft fur Klassifikation e.V., University of Munich, March 14-16, 2001, Springer Berlin Heidelberg, Berlin, Heidelberg. pp. 216-226.
# NOT RUN {
set.seed(1)
a <- matrix(sample(1:2, 7*3, replace = TRUE), 7, 3)
a1 <- matrix(sample(1:3, 7*3, replace = TRUE), 7, 3)
mixdata <- cbind(iris[1:7,1:3], a, a1)
colnames(mixdata) <- c(paste(c("num"), 1:3, sep = ""),
paste(c("bin"), 1:3, sep = ""),
paste(c("cat"), 1:3, sep = ""))
distmix(mixdata, method = "gower", idnum = 1:3, idbin = 4:6, idcat = 7:9)
# }
Run the code above in your browser using DataLab