classifdist: Classification of unclustered points

Description

Various methods for classification of unclustered points from clustered points for use within functions nselectboot and prediction.strength.

Usage

classifdist(cdist,clustering,
                      method="averagedist",
                      centroids=NULL,nnk=1)
classifnp(data,clustering,
                      method="centroid",cdist=NULL,
                      centroids=NULL,nnk=1)

Value

An integer vector giving cluster numbers for all observations; those for the observations already clustered in the input are the same as in the input.

Arguments

cdist: dissimilarity matrix or dist-object. Necessary for classifdist but optional for classifnp and there only used if method="averagedist" (if not provided, dist is applied to data).
data: something that can be coerced into a an n*p-data matrix.
clustering: integer vector. Gives the cluster number (between 1 and k for k clusters) for clustered points and should be -1 for points to be classified.
method: one of "averagedist", "centroid", "qda", "knn". See details.
centroids: for classifnp a k times p matrix of cluster centroids. For classifdist a vector of numbers of centroid objects as provided by pam. Only used if method="centroid"; in that case mandatory for classifdist but optional for classifnp, where cluster mean vectors are computed if centroids=NULL.
nnk: number of nearest neighbours if method="knn".

Author

Christian Hennig christian.hennig@unibo.it https://www.unibo.it/sitoweb/christian.hennig/en/

Details

classifdist is for data given as dissimilarity matrix, classifnp is for data given as n times p data matrix. The following methods are supported:

"centroid": assigns observations to the cluster with closest cluster centroid as specified in argument centroids (this is associated to k-means and pam/clara-clustering).
"qda": only in classifnp. Classifies by quadratic discriminant analysis (this is associated to Gaussian clusters with flexible covariance matrices), calling qda with default settings. If qda gives an error (usually because a class was too small), lda is used.
"lda": only in classifnp. Classifies by linear discriminant analysis (this is associated to Gaussian clusters with equal covariance matrices), calling lda with default settings.
"averagedist": assigns to the cluster to which an observation has the minimum average dissimilarity to all points in the cluster (this is associated with average linkage clustering).
"knn": classifies by nnk nearest neighbours (for nnk=1, this is associated with single linkage clustering). Calls knn in classifnp.
"fn": classifies by the minimum distance to the farthest neighbour. This is associated with complete linkage clustering).

Examples

Run this code

set.seed(20000)
x1 <- rnorm(50)
y <- rnorm(100)
x2 <- rnorm(40,mean=20)
x3 <- rnorm(10,mean=25,sd=100)
x <-cbind(c(x1,x2,x3),y)
truec <- c(rep(1,50),rep(2,40),rep(3,10))
topredict <- c(1,2,51,52,91)
clumin <- truec
clumin[topredict] <- -1

classifnp(x,clumin, method="averagedist")
classifnp(x,clumin, method="qda")
classifdist(dist(x),clumin, centroids=c(3,53,93),method="centroid")
classifdist(dist(x),clumin,method="knn")

Run the code above in your browser using DataLab