# clara

##### Clustering Large Applications

Computes a `"clara"`

object, a list representing a clustering of
the data into `k`

clusters.

- Keywords
- cluster

##### Usage

```
clara(x, k, metric = "euclidean", stand = FALSE, samples = 5,
sampsize = 40 + 2 * k)
```

##### Arguments

- x
- data matrix or data frame, each row corresponds to an observation, and each column corresponds to a variable. All variables must be numeric. Missing values (NAs) are allowed.
- k
- integer, the number of clusters.
It is required that $0 < k < n$ where $n$ is the number of
observations (i.e., n =
`nrow(x)`

). - metric
- character string specifying the metric to be used for calculating dissimilarities between observations. The currently available options are "euclidean" and "manhattan". Euclidean distances are root sum-of-squares of differences, and manhat
- stand
- logical, indicating if the measurements in
`x`

are standardized before calculating the dissimilarities. Measurements are standardized for each variable (column), by subtracting the variable's mean value and dividing by the variable - samples
- integer, number of samples to be drawn from the dataset.
- sampsize
- integer, number of observations in each
sample.
`sampsize`

should be higher than the number of clusters (`k`

) and at most the number of observations (n =`nrow(x)`

).

##### Details

`clara`

is fully described in chapter 3 of Kaufman and Rousseeuw (1990).
Compared to other partitioning methods such as `pam`

, it can deal with
much larger datasets. Internally, this is achieved by considering
sub-datasets of fixed size (`sampsize`

) such that the time and
storage requirements become linear in $n$ rather than quadratic.

Each sub-dataset is partitioned into `k`

clusters using the same
algorithm as in `pam`

.
Once `k`

representative objects have been selected from the
sub-dataset, each observation of the entire dataset is assigned
to the nearest medoid.

The sum of the dissimilarities of the observations to their closest medoid is used as a measure of the quality of the clustering. The sub-dataset for which the sum is minimal, is retained. A further analysis is carried out on the final partition.

Each sub-dataset is forced to contain the medoids obtained from the
best sub-dataset until then. Randomly drawn observations are added to
this set until `sampsize`

has been reached.

##### Value

- an object of class
`"clara"`

representing the clustering. See`clara.object`

for details.

##### Note

The random sampling is implemented with a *very* simple scheme
(with period $2^{16} = 65536$) inside the Fortran code,
independently of R's random number generation, and as a matter of
fact, deterministically.

The storage requirement of `clara`

computation (for small
`k`

) is about
$O(n \times p) + O(j^2)$ where
$j = \code{sampsize}$, and $(n,p) = \code{dim(x)}$.
The CPU computing time (again neglecting small `k`

) is about
$O(n \times p \times j^2 \times N)$, where
$N = \code{samples}$.

For ``small'' datasets, the function `pam`

can be used
directly. What can be considered *small*, is really a function
of available computing power, both memory (RAM) and speed.
Originally (1990), ``small'' meant less than 100 observations;
later, the authors said *``small (say with fewer than 200
observations)''*..

##### See Also

`agnes`

for background and references;
`clara.object`

, `pam`

,
`partition.object`

, `plot.partition`

.

##### Examples

```
## generate 500 objects, divided into 2 clusters.
x <- rbind(cbind(rnorm(200,0,8), rnorm(200,0,8)),
cbind(rnorm(300,50,8), rnorm(300,50,8)))
clarax <- clara(x, 2)
clarax
clarax$clusinfo
plot(clarax)
## `xclara' is an artificial data set with 3 clusters of 1000 bivariate
## objects each.
data(xclara)
## Plot similar to Figure 5 in Struyf et al (1996)
plot(clara(xclara, 3), ask = TRUE)
<testonly>plot(clara(xclara, 3))</testonly>
```

*Documentation reproduced from package cluster, version 1.4-1, License: GPL version 2 or later*