Clustering Large Applications

Computes a "clara" object, a list representing a clustering of the data into k clusters.

clara(x, k, metric = "euclidean", stand = FALSE, samples = 5,
      sampsize = 40 + 2 * k)
data matrix or data frame, each row corresponds to an observation, and each column corresponds to a variable. All variables must be numeric. Missing values (NAs) are allowed.
integer, the number of clusters. It is required that $0 < k < n$ where $n$ is the number of observations (i.e., n = nrow(x)).
character string specifying the metric to be used for calculating dissimilarities between observations. The currently available options are "euclidean" and "manhattan". Euclidean distances are root sum-of-squares of differences, and manhat
logical, indicating if the measurements in x are standardized before calculating the dissimilarities. Measurements are standardized for each variable (column), by subtracting the variable's mean value and dividing by the variable
integer, number of samples to be drawn from the dataset.
integer, number of observations in each sample. sampsize should be higher than the number of clusters (k) and at most the number of observations (n = nrow(x)).

clara is fully described in chapter 3 of Kaufman and Rousseeuw (1990). Compared to other partitioning methods such as pam, it can deal with much larger datasets. Internally, this is achieved by considering sub-datasets of fixed size (sampsize) such that the time and storage requirements become linear in $n$ rather than quadratic.

Each sub-dataset is partitioned into k clusters using the same algorithm as in pam. Once k representative objects have been selected from the sub-dataset, each observation of the entire dataset is assigned to the nearest medoid.

The sum of the dissimilarities of the observations to their closest medoid is used as a measure of the quality of the clustering. The sub-dataset for which the sum is minimal, is retained. A further analysis is carried out on the final partition.

Each sub-dataset is forced to contain the medoids obtained from the best sub-dataset until then. Randomly drawn observations are added to this set until sampsize has been reached.


  • an object of class "clara" representing the clustering. See clara.object for details.


The random sampling is implemented with a very simple scheme (with period $2^{16} = 65536$) inside the Fortran code, independently of R's random number generation, and as a matter of fact, deterministically.

The storage requirement of clara computation (for small k) is about $O(n \times p) + O(j^2)$ where $j = \code{sampsize}$, and $(n,p) = \code{dim(x)}$. The CPU computing time (again neglecting small k) is about $O(n \times p \times j^2 \times N)$, where $N = \code{samples}$.

For ``small'' datasets, the function pam can be used directly. What can be considered small, is really a function of available computing power, both memory (RAM) and speed. Originally (1990), ``small'' meant less than 100 observations; later, the authors said ``small (say with fewer than 200 observations)''..

See Also

agnes for background and references; clara.object, pam, partition.object, plot.partition.

  • clara
## generate 500 objects, divided into 2 clusters.
x <- rbind(cbind(rnorm(200,0,8), rnorm(200,0,8)),
           cbind(rnorm(300,50,8), rnorm(300,50,8)))
clarax <- clara(x, 2)

## `xclara' is an artificial data set with 3 clusters of 1000 bivariate
## objects each.
## Plot similar to Figure 5 in Struyf et al (1996)
plot(clara(xclara, 3), ask = TRUE)
<testonly>plot(clara(xclara, 3))</testonly>
Documentation reproduced from package cluster, version 1.4-1, License: GPL version 2 or later

Community examples

Looks like there are no examples yet.