This function computes the Gower's distance (dissimilarity) between units in a dataset or between observations in two distinct datasets.

```
gower.dist(data.x, data.y=data.x, rngs=NULL, KR.corr=TRUE, var.weights = NULL,
robcb=NULL)
```

A `matrix`

object with distances between rows of `data.x`

and those of `data.y`

.

- data.x
A matrix or a data frame containing variables that should be used in the computation of the distance.

Columns of mode

`numeric`

will be considered as interval scaled variables; columns of mode`character`

or class`factor`

will be considered as categorical nominal variables; columns of class`ordered`

will be considered as categorical ordinal variables and, columns of mode`logical`

will be considered as binary asymmetric variables (see Details for further information).Missing values (

`NA`

) are allowed.If only

`data.x`

is supplied, the dissimilarities between rows of`data.x`

will be computed.- data.y
A numeric matrix or data frame with the same variables, of the same type, as those in

`data.x`

. Dissimilarities between rows of`data.x`

and rows of`data.y`

will be computed. If not provided, by default it is assumed equal to`data.x`

and only dissimilarities between rows of`data.x`

will be computed.- rngs
A vector with the ranges to scale the variables. Its length must be equal to number of variables in

`data.x`

. In correspondence of non-numeric variables, just put 1 or`NA`

. When`rngs=NULL`

(default) the range of a numeric variable is estimated by jointly considering the values for the variable in`data.x`

and those in`data.y`

. Therefore, assuming`rngs=NULL`

, if a variable`"X1"`

is considered:`rngs["X1"] <- max(data.x[,"X1"], data.y[,"X1"]) - min(data.x[,"X1"], data.y[,"X1"])`

.

- KR.corr
When

`TRUE`

(default) the extension of the Gower's dissimilarity measure proposed by Kaufman and Rousseeuw (1990) is used. Otherwise, when

`KR.corr=FALSE`

, the Gower's (1971) formula is considered.- var.weights
By default (

`NULL`

) each variable has the same weight (value 1) when calculating the overall distance (weighted average of distances on single variables; see Details). User can specify different weights for the different variables by providing a numeric value for each of the variables contributing to the distance. In other words,`var.weights`

should be set equal to a numeric vector having length equal to the number of variables considered in calculating distance. Entered weights are scales to sum up to 1.- robcb
By default is (

`NULL`

). If`robcb="boxp"`

the scaling of the Manhattan distance is done by using the difference between upper and lower fences of the Boxplot with k=3. In alternative,`robcb="asyboxp"`

the scaling of the Manhattan distance is done by the difference between upper and lower fences of the modified Boxplot to accocunt for slight skewness. In this case scaled distances greater than 1 are set equal to 1. This option is suggested in the presence of outliers in the continuous variables.

Marcello D'Orazio mdo.statmatch@gmail.com

This function computes distances between records when variables of different type (categorical and continuous) have been observed. In order to handle different types of variables, the Gower's dissimilarity coefficient (Gower, 1971) is used. By default (`KR.corr=TRUE`

) the Kaufman and Rousseeuw (1990) extension of the Gower's dissimilarity coefficient is used.

The final dissimilarity between the *i*th and *j*th unit is obtained as a weighted sum of dissimilarities for each variable:
$$d(i,j) = \frac{\sum_k{\delta_{ijk} d_{ijk} w_k}}{\sum_k{\delta_{ijk} w_k}}$$

In particular, \(d_{ijk}\) represents the distance between the *i*th and *j*th unit computed considering the *k*th variable, while \(w_k\) is the weight assigned to variable *k* (by default 1 for all the variables, unless different weights are provided by user with argument `var.weights`

). Distance depends on the nature of the variable:

`logical`

columns are considered as asymmetric binary variables, for such case \(d_{ijk}=0\) if \(x_{ik} = x_{jk} = \code{TRUE}\), 1 otherwise;`factor`

or`character`

columns are considered as categorical nominal variables and \(d_{ijk}=0\) if \(x_{ik}=x_{jk}\), 1 otherwise;`numeric`

columns are considered as interval-scaled variables and $$d_{ijk}=\frac{\left|x_{ik}-x_{jk}\right|}{R_k}$$ being \(R_k\) the range of the*k*th variable. The range is the one supplied with the argument`rngs`

(`rngs[k]`

) or the one computed on available data (when`rngs=NULL`

);`ordered`

columns are considered as categorical ordinal variables and the values are substituted with the corresponding position index, \(r_{ik}\) in the factor levels. When`KR.corr=FALSE`

these position indexes (that are different from the output of the R function`rank`

) are transformed in the following manner $$z_{ik}=\frac{(r_{ik}-1)}{max\left(r_{ik}\right) - 1}$$ These new values, \(z_{ik}\), are treated as observations of an interval scaled variable.

As far as the weight \(\delta_{ijk}\) is concerned:

\(\delta_{ijk}=0\) if \(x_{ik} = \code{NA}\) or \(x_{jk} = \code{NA}\);

\(\delta_{ijk}=0\) if the variable is asymmetric binary and \(x_{ik}=x_{jk}=0\) or \(x_{ik} = x_{jk} = \code{FALSE}\);

\(\delta_{ijk}=1\) in all the other cases.

In practice, `NAs`

and couple of cases with \(x_{ik}=x_{jk}=\code{FALSE}\) do not contribute to distance computation.

Gower, J. C. (1971), “A general coefficient of similarity and some of its properties”. *Biometrics*, **27**, 623--637.

Kaufman, L. and Rousseeuw, P.J. (1990), *Finding Groups in Data: An Introduction to Cluster Analysis.* Wiley, New York.

`daisy`

,
`dist`

```
x1 <- as.logical(rbinom(10,1,0.5))
x2 <- sample(letters, 10, replace=TRUE)
x3 <- rnorm(10)
x4 <- ordered(cut(x3, -4:4, include.lowest=TRUE))
xx <- data.frame(x1, x2, x3, x4, stringsAsFactors = FALSE)
# matrix of distances between observations in xx
dx <- gower.dist(xx)
head(dx)
# matrix of distances between first obs. in xx
# and the remaining ones
gower.dist(data.x=xx[1:6,], data.y=xx[7:10,], var.weights = c(1,2,5,2))
```

Run the code above in your browser using DataLab