optCluster (version 1.0.0)

optCluster: Determine Optimal Clustering Algorithm and Number of Clusters

Description

optCluster performs statistical and/or biological validation of clustering results and determines the optimal clustering algorithm and number of clusters through rank aggreation. The function returns an object of class "optCluster".

Usage

optCluster(obj, nClust, clMethods = "all", countData = FALSE,
  validation = c("internal", "stability"), hierMethod = "average",
  annotation = NULL, clVerbose = FALSE, rankMethod = "CE", 
  distance = "Spearman", rankVerbose = FALSE, ...)

Arguments

obj
The dataset to be evaluated as either a data frame, a numeric matrix, or an ExpressionSet object. The items to be clustered (e.g. genes) are the rows and the samples are the columns. I
nClust
A numeric vector providing the range of clusters to be evaluated (e.g. to evaluate the number of clusters ranging from 2 to 4, input 2:4). A single number can also be provided.
clMethods
A character vector providing the names of the clustering algorithms to be used. The available algorithms are: "agnes", "clara", "diana", "fanny", "hierarchical", "kmeans", "model", "pam", "som", "sota", "em.nbinom", "da.nbinom", "sa.nbinom",
countData
A logical argument, indicating whether the data is count based or not. Can also be used in conjuction with the "all" option for the 'clMethods' argument. If TRUE and 'clMethods' = "all", all of the clustering algorithms for count data are select
validation
A character vector providing the names of the types of validation measures to be used. The options of "internal", "stability", "biological", and "all" are available. Any number or combination of choices is allowed.
hierMethod
A character string, providing the agglomeration method to be used by the hierarchical clustering options (hclust and agnes). Available choices are "average", "complete", "single", and "ward".
annotation
Used in biological validation. Either a character string providing the name of the Bioconductor annotation package for mapping genes to GO categories, or the names of each functional class and the observations that belong to them in either a list
clVerbose
If TRUE, the progress of cluster validation will be produced as output.
rankMethod
A character string providing the method to be used for rank aggregation. The two options are the cross-entropy Monte Carlo algorithm ("CE") or Genetic algorithm ("GA"). Selection of only one method is allowed.
distance
A character string providing the type of distance to be used for measuring the similarity between ordered lists in rank aggregation. The two available methods are the weighted Spearman footrule distance ("Spearman") or the weighted Kendall's tau
rankVerbose
If TRUE, current rank aggregation results are displayed at each iteration.
...
Additional arguments that can be passed to internal functions. The internal functions include: clValid, RankAggreg, and all clustering algorithm functions.

Value

  • optCluster returns an object of class "optCluster". The class description is provided in the help file.

Details

This function has been created as an extension of the clValid function. In addition to the validation measures and clustering algorithms available in the clValid function, six clustering algorithms for count data are included in the optCluster function. This function also determines a unique solution for the optimal clustering algorithm and number of clusters through rank aggregation of validation measure lists. A brief description of the available clustering algorithms, validation measures, and rank aggregation algorithms is provided below. For more details, please refer to the references. [object Object],[object Object],[object Object],[object Object],[object Object]

References

Brock, G., Pihur, V., Datta, S. and Datta, S. (2008). clValid: An R Package for Cluster Validation. Journal of Statistical Software 25(4), http://www.jstatsoft.org/v25/i04. Datta, S. and Datta, S. (2003). Comparisons and validation of statistical clustering techniques for microarray gene expression data. Bioinformatics 19(4): 459-466. Pihur, V., Datta, S. and Datta, S. (2007). Weighted rank aggregation of cluster validation measures: A Mounte Carlo cross-entropy approach. Bioinformatics 23(13): 1607-1615. Pihur, V., Datta, S. and Datta, S. (2009). RankAggreg, an R package for weighted rank aggregation. BMC Bioinformatics, 10:62, http://www.biomedcentral.com/1471-2105/10/62.

Sekula, M. (2015). optCluster: An R Package for Determining the Optimal Clustering Algorithm and Optimal Number of Clusters. MS Thesis, University of Louisville. Si, Y., Liu, P., Li, P., & Brutnell, T. (2014). Model-based clustering for RNA-seq data. Bioinformatics 30(2): 197-205.

See Also

For a description of the clValid function, including all available arguments that can be passed to it, see clValid in the clValid package.

For a description of the RankAggreg function, including all available arguments that can be passed to it, see RankAggreg in the RankAggreg package. For details on the clustering algorithm functions for transformed data see agnes, clara, diana, fanny, and pam in package cluster, hclust and kmeans in package stats, som in package kohonen, Mclust in package mclust, and sota in package clValid. For details the on the clustering algorithm functions for count data see Cluster.RNASeq in package MBCluster.Seq. For details on the validation measure functions see BHI, BSI, stability, connectivity and dunn in package clValid and silhouette in package cluster.

Examples

Run this code
## These examples may each take a few minutes to compute
	## Obtain Dataset	
	data(arabid)	
		
	## Analysis of Count Data using Internal and Stability Validation Measures
	count1 <- optCluster(arabid, 2:4, clMethods = "all", countData = TRUE)
	summary(count1)
	
	## Analysis of Count Data using All Validation Measures
	if(require("Biobase") && require("annotate") && require("GO.db") && 
		require("org.At.tair.db")){
	count2 <- optCluster(arabid, 2:4, clMethods = "all", countData = TRUE, validation = "all", 
					annotation = "org.At.tair.db")
	summary(count2)	
	}
	
	## Normalize Data with Respect to Library Size	
	obj <- t(t(arabid)/colSums(arabid))
		
	## Analysis of Normalized Data using Internal and Stability Validation Measures
	norm1 <- optCluster(obj, 2:4, clMethods = "all")
	summary(norm1)
	
	## Analysis of Normalized Data using All Validation Measures
	if(require("Biobase") && require("annotate") && require("GO.db") && 
		require("org.At.tair.db")){
	norm2 <- optCluster(obj, 2:4, clMethods = "all", validation = "all", 
					annotation = "org.At.tair.db")
	summary(norm2)	
	}
	
	## Analysis with Only UPGMA using Internal and Stability Validation Measures
	hier1 <- optCluster(obj, 2:10, clMethods = "hierarchical")
	summary(hier1)
	
	## Analysis with Only UPGMA using All Validation Measures
	if(require("Biobase") && require("annotate") && require("GO.db") && 
		require("org.At.tair.db")){
	hier2 <- optCluster(obj, 2:10, clMethods = "hierarchical", validation = "all", 
					annotation = "org.At.tair.db")
	summary(hier2)	
	}

Run the code above in your browser using DataCamp Workspace