Learn R Programming

clValid (version 0.6-2)

clValid: Validate Cluster Results

Description

clValid reports validation measures for clustering results. The function returns an object of class "clValid", which contains the clustering results in addition to the validation measures. The validation measures fall into three general categories: "internal", "stability", and "biological".

Usage

clValid(obj, nClust, clMethods = "hierarchical", validation =
"stability", maxitems = 600, metric = "euclidean", method = "average",
neighbSize = 10, annotation = NULL, GOcategory = "all",
goTermFreq=0.05, dropEvidence=NULL, verbose=FALSE, ...)

Arguments

obj
Either a numeric matrix, a data frame, or an ExpressionSet object. Data frames must contain all numeric columns. In all cases, the rows are the items to be clustered (e.g., genes),
nClust
A numeric vector giving the numbers of clusters to be evaluated. e.g., 4:6 would evaluate the number of clusters ranging from 4 to 6.
clMethods
A character vector giving the clustering methods. Available options are "hierarchical", "kmeans", "diana", "fanny", "som", "model", "sota", "pam", "clara", and "agnes", with multiple choices allowed.
validation
A character vector giving the type of validation measures to use. Available options are "internal", "stability", and "biological", with multiple choices allowed.
maxitems
The maximum number of items (rows in matrix) which can be clustered.
metric
The metric used to determine the distance matrix. Possible choices are "euclidean", "correlation", and "manhattan".
method
For hierarchical clustering (hclust and agnes), the agglomeration method used. Available choices are "ward", "single", "complete", and "average".
neighbSize
For internal validation, an integer giving the neighborhood size used for the connectivity measure.
annotation
For biological validation, either a character string naming the Bioconductor annotation package for mapping genes to GO categories, or a list with the names of the functional classes and the observations belonging to each class.
GOcategory
For biological validation, gives which GO categories to use for biological validation. Can be one of "BP", "MF", "CC", or "all".
goTermFreq
For the BSI, what threshold frequency of GO terms to use for functional annotation.
dropEvidence
Which GO evidence codes to omit. Either NULL or a character vector, see 'Details' below.
verbose
Logical - if TRUE will produce detailed output on the progress of cluster validation.
...
Additional arguments to pass to the clustering functions.

Value

  • clValid returns an object of class "clValid". See the help file for the class description.

Details

This function calculates validation measures for a given set of clustering algorithms and number of clusters. A variety of clustering algorithms are available, including hierarchical, self-organizing maps (SOM), K-means, self-organizing tree algorithm (SOTA), and model-based. The available validation measures fall into the three general categories of "internal", "stability", and "biological". A brief description of each measure is given below, for further details refer to the package vignette and the references. [object Object],[object Object],[object Object]

References

Brock, G., Pihur, V., Datta, S. and Datta, S. (2008) clValid: An R Package for Cluster Validation Journal of Statistical Software 25(4) http://www.jstatsoft.org/v25/i04 Datta, S. and Datta, S. (2003) Comparisons and validation of statistical clustering techniques for microarray gene expression data. Bioinformatics 19(4): 459-466 Datta, S. and Datta, S. (2006) Methods for evaluating clustering algorithms for gene expression data using a reference set of functional classes. BMC Bioinformatics 7:397 http://www.biomedcentral.com/1471-2105/7/397 Handl, J., Knowles, K., and Kell, D. (2005) Computational cluster validation in post-genomic data analysis. Bioinformatics 21(15): 3201-3212

See Also

For a description of the class 'clValid' and all available methods see clValidObj or clValid-class. For help on the clustering methods see hclust and kmeans in package stats, agnes, clara, diana, fanny, and pam in package cluster, som in package kohonen, Mclust in package mclust, and sota (in this package). For additional help on the validation measures see connectivity, dunn, stability, BHI, and BSI.

Examples

Run this code
data(mouse)

## internal validation
express <- mouse[1:25,c("M1","M2","M3","NC1","NC2","NC3")]
rownames(express) <- mouse$ID[1:25]
intern <- clValid(express, 2:6, clMethods=c("hierarchical","kmeans","pam"),
                  validation="internal")

## view results
summary(intern)
optimalScores(intern)
plot(intern)

## stability measures
stab <- clValid(express, 2:6, clMethods=c("hierarchical","kmeans","pam"),
                validation="stability")
optimalScores(stab)
plot(stab)

## biological measures
## first way - functional classes predetermined
fc <- tapply(rownames(express),mouse$FC[1:25], c)
fc <- fc[-match( c("EST","Unknown"), names(fc))]
bio <- clValid(express, 2:6, clMethods=c("hierarchical","kmeans","pam"),
               validation="biological", annotation=fc)
optimalScores(bio)
plot(bio)

## second way - using Bioconductor
if(require("Biobase") && require("annotate") && require("GO.db") && require("moe430a.db")) {
  bio2 <- clValid(express, 2:6, clMethods=c("hierarchical","kmeans","pam"),
                  validation="biological",annotation="moe430a.db",GOcategory="all")
  optimalScores(bio2)
  plot(bio2)
}

Run the code above in your browser using DataLab