`gower.dist(data.x, data.y=data.x, rngs=NULL, KR.corr=TRUE)`

data.x

A matrix or a data frame containing variables that should be used in the computation of the distance.
Columns of mode

`numeric`

will be considered as interval scaled variables; columns of mode `character`

or class `fact`

data.y

A numeric matrix or data frame with the same variables, of the same type, as those in

`data.x`

. Dissimilarities between rows of `data.x`

and rows of `data.y`

will be computed. If not provided, by default it is assumed equarngs

A vector with the ranges to scale the variables. Its length must be equal to number of variables in

`data.x`

. In correspondence of nonnumeric variables, just put 1 or `NA`

. When `rngs=NULL`

(default) the range of a numericKR.corr

When

`TRUE`

(default) the extension of the Gower's dissimilarity measure proposed by Kaufman and Rousseeuw (1990) is used. Otherwise, when
`KR.corr=FALSE`

, the Gower's (1971) formula is considered.- A
`matrix`

object with distances among rows of`data.x`

and those of`data.y`

.

`KR.corr=TRUE`

) the Kaufman and Rousseeuw (1990) extension of the Gower's dissimilarity coefficient is used. The final dissimilarity between the *i*th and *j*th unit is obtained as a weighted sum of dissimilarities for each variable:
$$d(i,j) = \frac{\sum_k{\delta_{ijk} d_{ijk}}}{\sum_k{\delta_{ijk}}}$$

In particular, $d_{ijk}$ represents the distance between the *i*th and *j*th unit computed considering the *k*th variable. It depends on the nature of the variable:

`logical`

columns are considered as asymmetric binary variables, for such case$d_{ijk}=0$if$x_{ik} = x_{jk} = \code{TRUE}$, 1 otherwise;`factor`

or`character`

columns are considered as categorical nominal variables and$d_{ijk}=0$if$x_{ik}=x_{jk}$, 1 otherwise;`numeric`

columns are considered as interval-scaled variables and$$d_{ijk}=\frac{\left|x_{ik}-x_{jk}\right|}{R_k}$$being$R_k$the range of the*k*th variable. The range is the one supplied with the argument`rngs`

(`rngs[k]`

) or the one computed on available data (when`rngs=NULL`

);`ordered`

columns are considered as categorical ordinal variables and the values are substituted with the corresponding position index,$r_{ik}$in the factor levels. When`KR.corr=FALSE`

these position indexes (that are different from the output of the R function`rank`

) are transformed in the following manner$$z_{ik}=\frac{(r_{ik}-1)}{max\left(r_{ik}\right) - 1}$$These new values,$z_{ik}$, are treated as observations of an interval scaled variable.

As far as the weight $\delta_{ijk}$ is concerned:

- $\delta_{ijk}=0$if$x_{ik} = \code{NA}$or$x_{jk} = \code{NA}$;
- $\delta_{ijk}=0$if the variable is asymmetric binary and$x_{ik}=x_{jk}=0$or$x_{ik} = x_{jk} = \code{FALSE}$;
- $\delta_{ijk}=1$in all the other cases.

In practice, `NAs`

and couple of cases with $x_{ik}=x_{jk}=\code{FALSE}$ do not contribute to distance computation.

Kaufman, L. and Rousseeuw, P.J. (1990), *Finding Groups in Data: An Introduction to Cluster Analysis.* Wiley, New York.

`daisy`

,
`dist`

x1 <- as.logical(rbinom(10,1,0.5)) x2 <- sample(letters, 10, replace=TRUE) x3 <- rnorm(10) x4 <- ordered(cut(x3, -4:4, include.lowest=TRUE)) xx <- data.frame(x1, x2, x3, x4, stringsAsFactors = FALSE) # matrix of distances among observations in xx gower.dist(xx) # matrix of distances among first obs. in xx # and the remaining ones gower.dist(data.x=xx[1:3,], data.y=xx[4:10,])

Run the code above in your browser using DataCamp Workspace