optCluster performs statistical and/or biological validation of
clustering results and determines the optimal clustering algorithm and
number of clusters through rank aggreation. The function returns an
object of class "optCluster".
optCluster(obj, nClust, clMethods = "all", countData = FALSE, validation = c("internal", "stability"), hierMethod = "average", annotation = NULL, clVerbose = FALSE, rankMethod = "CE", distance = "Spearman", importance = NULL, rankVerbose = FALSE, ...)ExpressionSet object. The items to be clustered (e.g. genes)
are the rows and the samples are the columns. In the case of data frames,
all columns must be numeric. hclust and agnes).
Available choices are "average", "complete", "single", and "ward".clValid or RankAggreg:
Additional clValid arguments:
metric - Metric used to determine distance matrix in validation measures. Possible choices are:
"eucliean" (default), "correlation", and "manhattan".
neighbSize - Integer giving neighborhood size used in "connectivity" validation measure.
GOcategory - For biological valdation, a character string providing which GO category to use. Options include:
"BP", "MF", "CC", or "all" (default).
goTermFreq - For BSI validation, the threshold frequency of GO terms to used for functional annotation.
dropEvidence - For biological validation, either NULL or a character vector of GO evidence codes to omit.
Additional RankAggreg arguments:
k - Size of top-k list in aggregation.
convIN - Stopping criteria for CE and GA algorithms. The algorithm converges once the "best" solution does not
change after convIN iterations. Default: 7 for CE and 30 for GA.
N - Number of samples generated by MCMC in the CE algorithm. Default = 10*k^2
rho - For CE algorithm, (rho*N) is the qunatile of candidate list sorted by function values.
weight - For CE algorithm, the learning factor used in the probability update feature. Default = 0.25
popSize - For GA algorithm population size in each generation. Default = 100
CP - For GA algorithm, the cross-over probability. Default = 0.4
MP - For GA algorithm, the mutation probability. Default = 0.01
optCluster returns an object of class "optCluster". The class description
is provided in the help file.clValid function. In addition to the validation
measures and clustering algorithms available in the clValid function, six clustering algorithms
for count data are included in the optCluster function. This function also determines a
unique solution for the optimal clustering algorithm and number of clusters through rank aggregation of
validation measure lists. A brief description of the available clustering algorithms, validation measures,
and rank aggregation algorithms is provided below. For more details, please refer to the references.
clValid:
"agnes", "clara", "diana", "fanny", "hierarchical", "kmeans", "model", "pam", "som", and "sota".
optCluster function by using the argument
'Normalizer'.
importance argument. The default value of equal weights (NULL) is
represented by rep(1, length(x)), where x is the character vector of validation measure names.
To manually change the weights, the order of the validation measures selected needs to be known.
The order of validation measures used in optCluster is provided below:
Brock, G., Pihur, V., Datta, S. and Datta, S. (2008). clValid: An R Package for Cluster Validation. Journal of Statistical Software 25(4), http://www.jstatsoft.org/v25/i04. Datta, S. and Datta, S. (2003). Comparisons and validation of statistical clustering techniques for microarray gene expression data. Bioinformatics 19(4): 459-466. Pihur, V., Datta, S. and Datta, S. (2007). Weighted rank aggregation of cluster validation measures: A Mounte Carlo cross-entropy approach. Bioinformatics 23(13): 1607-1615. Pihur, V., Datta, S. and Datta, S. (2009). RankAggreg, an R package for weighted rank aggregation. BMC Bioinformatics, 10:62, http://www.biomedcentral.com/1471-2105/10/62.
Sekula, M. (2015). optCluster : An R package for Determining the Optimal Clustering Algorithm and Optimal Number of Clusters. Electronic Theses and Dissertations. Paper 2147. http://ir.library.louisville.edu/etd/2147 Si, Y., Liu, P., Li, P., & Brutnell, T. (2014). Model-based clustering for RNA-seq data. Bioinformatics 30(2): 197-205.
clValid function, including all available arguments that can be
passed to it, see clValid in the clValid package. For a description of the RankAggreg function, including all available arguments that can be
passed to it, see RankAggreg in the RankAggreg package.
For details on the clustering algorithm functions for continuous data see
agnes, clara, diana,
fanny, and pam in package cluster,
hclust and kmeans in package stats,
som in package kohonen,
Mclust in package mclust,
and sota in package clValid.
For details the on the clustering algorithm functions for count data see
Cluster.RNASeq in package MBCluster.Seq.
For details on the validation measure functions see
BHI, BSI,
stability, connectivity and dunn
in package clValid
and silhouette in package cluster.
## These examples may each take a few minutes to compute
## Obtain Dataset
data(arabid)
## Analysis of Count Data using Internal and Stability Validation Measures
count1 <- optCluster(arabid, 2:4, clMethods = "all", countData = TRUE)
summary(count1)
## Analysis of Count Data using All Validation Measures
if(require("Biobase") && require("annotate") && require("GO.db") &&
require("org.At.tair.db")){
count2 <- optCluster(arabid, 2:4, clMethods = "all", countData = TRUE, validation = "all",
annotation = "org.At.tair.db")
summary(count2)
}
## Normalize Data with Respect to Library Size
obj <- t(t(arabid)/colSums(arabid))
## Analysis of Normalized Data using Internal and Stability Validation Measures
norm1 <- optCluster(obj, 2:4, clMethods = "all")
summary(norm1)
## Analysis of Normalized Data using All Validation Measures
if(require("Biobase") && require("annotate") && require("GO.db") &&
require("org.At.tair.db")){
norm2 <- optCluster(obj, 2:4, clMethods = "all", validation = "all",
annotation = "org.At.tair.db")
summary(norm2)
}
## Analysis with Only UPGMA using Internal and Stability Validation Measures
hier1 <- optCluster(obj, 2:10, clMethods = "hierarchical")
summary(hier1)
## Analysis with Only UPGMA using All Validation Measures
if(require("Biobase") && require("annotate") && require("GO.db") &&
require("org.At.tair.db")){
hier2 <- optCluster(obj, 2:10, clMethods = "hierarchical", validation = "all",
annotation = "org.At.tair.db")
summary(hier2)
}
Run the code above in your browser using DataLab