Computes the prediction strength of a clustering of a dataset into different numbers of components. The prediction strength is defined according to Tibshirani and Walther (2005), who recommend to choose as optimal number of cluster the largest number of clusters that leads to a prediction strength above 0.8 or 0.9. See details.

Various clustering methods can be used, see argument
`clustermethod`

. In Tibshirani and Walther (2005), only
classification to the nearest centroid is discussed, but more methods
are offered here, see argument `classification`

.

```
prediction.strength(xdata, Gmin=2, Gmax=10, M=50,
clustermethod=kmeansCBI,
classification="centroid", centroidname = NULL,
cutoff=0.8,nnk=1,
distances=inherits(xdata,"dist"),count=FALSE,...)
# S3 method for predstr
print(x, ...)
```

xdata

data (something that can be coerced into a matrix).

Gmin

integer. Minimum number of clusters. Note that the
prediction strength for 1 cluster is trivially 1, which is
automatically included if `GMin>1`

. Therefore `GMin<2`

is
useless.

Gmax

integer. Maximum number of clusters.

M

integer. Number of times the dataset is divided into two halves.

clustermethod

an interface function (the function name, not a
string containing the name, has to be provided!). This defines the
clustering method. See the "Details"-section of `clusterboot`

and `kmeansCBI`

for the format. Clustering methods for
`prediction.strength`

must have a `k`

-argument for the number of
clusters, must operate on n times p data matrices
and must otherwise follow the specifications in
`clusterboot`

Note that `prediction.strength`

won't work
with CBI-functions that implicitly already estimate the number of
clusters such as `pamkCBI`

; use `claraCBI`

if you want to run it for pam/clara clustering.

classification

string.
This determines how non-clustered points are classified to given
clusters. Options are explained in `classifnp`

and
`classifdist`

, the latter for dissimilarity data.
Certain classification methods are connected to certain clustering
methods. `classification="averagedist"`

is recommended for
average linkage, `classification="centroid"`

is recommended for
k-means, clara and pam (with distances it will work with
`claraCBI`

only), `classification="knn"`

with
`nnk=1`

is recommended for single linkage and
`classification="qda"`

is recommended for Gaussian mixtures
with flexible covariance matrices.

centroidname

string. Indicates the name of the component of
`CBIoutput$result`

that contains the cluster centroids in case of
`classification="centroid"`

, where `CBIoutput`

is the
output object of `clustermethod`

. If `clustermethod`

is
`kmeansCBI`

or `claraCBI`

, centroids are recognised
automatically if `centroidname=NULL`

. If
`centroidname=NULL`

and `distances=FALSE`

, cluster means
are computed as the cluster centroids.

cutoff

numeric between 0 and 1. The optimal number of clusters
is the maximum one with prediction strength above `cutoff`

.

nnk

number of nearest neighbours if
`classification="knn"`

, see `classifnp`

.

distances

logical. If `TRUE`

, data will be interpreted as
dissimilarity matrix, passed on to clustering methods as
`"dist"`

-object, and `classifdist`

will be used for
classification.

count

logical. `TRUE`

will print current number of
clusters and simulation run number on the screen.

x

object of class `predstr`

.

...

arguments to be passed on to the clustering method.

`prediction.strength`

gives out an object of class
`predstr`

, which is a
list with components

list of vectors of length `M`

with relative
frequencies of correct predictions (clusterwise minimum). Every list
entry refers to a certain number of clusters.

means of `predcorr`

for all numbers of
clusters.

optimal number of clusters.

see above.

a string identifying the clustering method.

see above.

see above.

The prediction strength for a certain number of clusters k under a
random partition of the dataset in halves A and B is defined as
follows. Both halves are clustered with k clusters. Then the points of
A are classified to the clusters of B. In the original paper
this is done by assigning every
observation in A to the closest cluster centroid in B (corresponding
to `classification="centroid"`

), but other methods are possible,
see `classifnp`

. A pair of points A in
the same A-cluster is defined to be correctly predicted if both points
are classified into the same cluster on B. The same is done with the
points of B relative to the clustering on A. The prediction strength
for each of the clusterings is the minimum (taken over all clusters)
relative frequency of correctly predicted pairs of points of that
cluster. The final mean prediction strength statistic is the mean over
all 2M clusterings.

Tibshirani, R. and Walther, G. (2005) Cluster Validation by
Prediction Strength, *Journal of Computational and Graphical
Statistics*, 14, 511-528.

# NOT RUN { options(digits=3) set.seed(98765) iriss <- iris[sample(150,20),-5] prediction.strength(iriss,2,3,M=3) prediction.strength(iriss,2,3,M=3,clustermethod=claraCBI) # The examples are fast, but of course M should really be larger. # }