clValid: Validate Cluster Results

Description

clValid reports validation measures for clustering results. The function returns an object of class "clValid", which contains the clustering results in addition to the validation measures. The validation measures fall into three general categories: "internal", "stability", and "biological".

Usage

clValid(obj, nClust, clMethods = "hierarchical", validation =
"stability", maxitems = 600, metric = "euclidean", method = "average",
neighbSize = 10, annotation = NULL, GOcategory = "all",
goTermFreq=0.05, dropEvidence=NULL, verbose=FALSE, ...)

Arguments

obj

Either a numeric matrix, a data frame, or an ExpressionSet object. Data frames must contain all numeric columns. In all cases, the rows are the items to be clustered (e.g., genes),

nClust

A numeric vector giving the numbers of clusters to be evaluated. e.g., 4:6 would evaluate the number of clusters ranging from 4 to 6.

clMethods

A character vector giving the clustering methods. Available options are "hierarchical", "kmeans", "diana", "fanny", "som", "model", "sota", "pam", "clara", and "agnes", with multiple choices allowed.

validation

A character vector giving the type of validation measures to use. Available options are "internal", "stability", and "biological", with multiple choices allowed.

maxitems

The maximum number of items (rows in matrix) which can be clustered.

metric

The metric used to determine the distance matrix. Possible choices are "euclidean", "correlation", and "manhattan".

method

For hierarchical clustering (hclust and agnes), the agglomeration method used. Available choices are "ward", "single", "complete", and "average".

neighbSize

For internal validation, an integer giving the neighborhood size used for the connectivity measure.

annotation

For biological validation, either a character string naming the Bioconductor annotation package for mapping genes to GO categories, or a list with the names of the functional classes and the observations belonging to each class.

GOcategory

For biological validation, gives which GO categories to use for biological validation. Can be one of "BP", "MF", "CC", or "all".

goTermFreq

For the BSI, what threshold frequency of GO terms to use for functional annotation.

dropEvidence

Which GO evidence codes to omit. Either NULL or a character vector, see 'Details' below.

verbose

Logical - if TRUE will produce detailed output on the progress of cluster validation.

...

Additional arguments to pass to the clustering functions.

Value

clValid returns an object of class "clValid". See the help file for the class description.

Details

This function calculates validation measures for a given set of clustering algorithms and number of clusters. A variety of clustering algorithms are available, including hierarchical, self-organizing maps (SOM), K-means, self-organizing tree algorithm (SOTA), and model-based. The available validation measures fall into the three general categories of "internal", "stability", and "biological". A brief description of each measure is given below, for further details refer to the package vignette and the references. [object Object],[object Object],[object Object]

References

Brock, G., Pihur, V., Datta, S. and Datta, S. (2008) clValid: An R Package for Cluster Validation Journal of Statistical Software 25(4) http://www.jstatsoft.org/v25/i04 Datta, S. and Datta, S. (2003) Comparisons and validation of statistical clustering techniques for microarray gene expression data. Bioinformatics 19(4): 459-466 Datta, S. and Datta, S. (2006) Methods for evaluating clustering algorithms for gene expression data using a reference set of functional classes. BMC Bioinformatics 7:397 http://www.biomedcentral.com/1471-2105/7/397 Handl, J., Knowles, K., and Kell, D. (2005) Computational cluster validation in post-genomic data analysis. Bioinformatics 21(15): 3201-3212

Examples

Run this code

data(mouse)

## internal validation
express <- mouse[1:25,c("M1","M2","M3","NC1","NC2","NC3")]
rownames(express) <- mouse$ID[1:25]
intern <- clValid(express, 2:6, clMethods=c("hierarchical","kmeans","pam"),
                  validation="internal")

## view results
summary(intern)
optimalScores(intern)
plot(intern)

## stability measures
stab <- clValid(express, 2:6, clMethods=c("hierarchical","kmeans","pam"),
                validation="stability")
optimalScores(stab)
plot(stab)

## biological measures
## first way - functional classes predetermined
fc <- tapply(rownames(express),mouse$FC[1:25], c)
fc <- fc[-match( c("EST","Unknown"), names(fc))]
bio <- clValid(express, 2:6, clMethods=c("hierarchical","kmeans","pam"),
               validation="biological", annotation=fc)
optimalScores(bio)
plot(bio)

## second way - using Bioconductor
if(require("Biobase") && require("annotate") && require("GO.db") && require("moe430a.db")) {
  bio2 <- clValid(express, 2:6, clMethods=c("hierarchical","kmeans","pam"),
                  validation="biological",annotation="moe430a.db",GOcategory="all")
  optimalScores(bio2)
  plot(bio2)
}

Run the code above in your browser using DataLab