gower.dist(data.x, data.y=data.x, rngs=NULL, KR.corr=TRUE)numeric will be considered as interval scaled variables; columns of mode character or class factor will be considered as categorical nominal variables; columns of class ordered will be considered as categorical ordinal variables and, columns of mode logical will be considered as binary asymmetric variables (see Details for further information). Missing values (NA) are allowed.
If only data.x is supplied, the dissimilarities between rows of data.x will be computed.
data.x. Dissimilarities between rows of data.x and rows of data.y will be computed. If not provided, by default it is assumed equal to data.x and only dissimilarities between rows of data.x will be computed.
data.x. In correspondence of nonnumeric variables, just put 1 or NA. When rngs=NULL (default) the range of a numeric variable is estimated by jointly considering the values for the variable in data.x and those in data.y. Therefore, assuming rngs=NULL, if a variable "X1" is considered:
rngs["X1"] <- max(data.x[,"X1"], data.y[,"X1"]) -
min(data.x[,"X1"], data.y[,"X1"]).
TRUE (default) the extension of the Gower's dissimilarity measure proposed by Kaufman and Rousseeuw (1990) is used. Otherwise, when
KR.corr=FALSE, the Gower's (1971) formula is considered.
matrix object with distances among rows of data.x and those of data.y.
KR.corr=TRUE) the Kaufman and Rousseeuw (1990) extension of the Gower's dissimilarity coefficient is used. The final dissimilarity between the ith and jth unit is obtained as a weighted sum of dissimilarities for each variable: $$d(i,j) = \frac{\sum_k{\delta_{ijk} d_{ijk}}}{\sum_k{\delta_{ijk}}}$$
In particular, $d_ijk$ represents the distance between the ith and jth unit computed considering the kth variable. It depends on the nature of the variable:
logical columns are considered as asymmetric binary variables, for such case $d_ijk = 0$ if $x_ik = x_jk = TRUE$, 1 otherwise;
factor or character columns are considered as categorical nominal variables and $d_ijk = 0$ if $x_ik = x_jk$, 1 otherwise;
numeric columns are considered as interval-scaled variables and
$$d_{ijk}=\frac{\left|x_{ik}-x_{jk}\right|}{R_k}$$
being $R_k$ the range of the kth variable. The range is the one supplied with the argument rngs (rngs[k]) or the one computed on available data (when rngs=NULL);
ordered columns are considered as categorical ordinal variables and the values are substituted with the corresponding position index, $r_ik$ in the factor levels. When KR.corr=FALSE these position indexes (that are different from the output of the R function rank) are transformed in the following manner
$$z_{ik}=\frac{(r_{ik}-1)}{max\left(r_{ik}\right) - 1}$$
These new values, $z_ik$, are treated as observations of an interval scaled variable.
As far as the weight $delta_ijk$ is concerned:
In practice, NAs and couple of cases with $x_ik = x_jk = \code{FALSE}$ do not contribute to distance computation.
Gower, J. C. (1971), A general coefficient of similarity and some of its properties. Biometrics, 27, 623--637.
Kaufman, L. and Rousseeuw, P.J. (1990), Finding Groups in Data: An Introduction to Cluster Analysis. Wiley, New York.
daisy,
dist
x1 <- as.logical(rbinom(10,1,0.5))
x2 <- sample(letters, 10, replace=TRUE)
x3 <- rnorm(10)
x4 <- ordered(cut(x3, -4:4, include.lowest=TRUE))
xx <- data.frame(x1, x2, x3, x4, stringsAsFactors = FALSE)
# matrix of distances among observations in xx
gower.dist(xx)
# matrix of distances among first obs. in xx
# and the remaining ones
gower.dist(data.x=xx[1:3,], data.y=xx[4:10,])
Run the code above in your browser using DataLab