optCluster
performs statistical and/or biological validation of
clustering results and determines the optimal clustering algorithm and
number of clusters through rank aggreation. The function returns an
object of class "optCluster"
.
optCluster(obj, nClust, clMethods = "all", countData = FALSE, validation = c("internal", "stability"), hierMethod = "average", annotation = NULL, clVerbose = FALSE, rankMethod = "CE", distance = "Spearman", importance = NULL, rankVerbose = FALSE, ...)
ExpressionSet
object. The items to be clustered (e.g. genes)
are the rows and the samples are the columns. In the case of data frames,
all columns must be numeric. hclust
and agnes
).
Available choices are "average", "complete", "single", and "ward".clValid
or RankAggreg
:
Additional clValid
arguments:
metric
- Metric used to determine distance matrix in validation measures. Possible choices are:
"eucliean" (default), "correlation", and "manhattan".
neighbSize
- Integer giving neighborhood size used in "connectivity" validation measure.
GOcategory
- For biological valdation, a character string providing which GO category to use. Options include:
"BP", "MF", "CC", or "all" (default).
goTermFreq
- For BSI validation, the threshold frequency of GO terms to used for functional annotation.
dropEvidence
- For biological validation, either NULL or a character vector of GO evidence codes to omit.
Additional RankAggreg
arguments:
k
- Size of top-k list in aggregation.
convIN
- Stopping criteria for CE and GA algorithms. The algorithm converges once the "best" solution does not
change after convIN iterations. Default: 7 for CE and 30 for GA.
N
- Number of samples generated by MCMC in the CE algorithm. Default = 10*k^2
rho
- For CE algorithm, (rho*N) is the qunatile of candidate list sorted by function values.
weight
- For CE algorithm, the learning factor used in the probability update feature. Default = 0.25
popSize
- For GA algorithm population size in each generation. Default = 100
CP
- For GA algorithm, the cross-over probability. Default = 0.4
MP
- For GA algorithm, the mutation probability. Default = 0.01
optCluster
returns an object of class "optCluster"
. The class description
is provided in the help file.clValid
function. In addition to the validation
measures and clustering algorithms available in the clValid
function, six clustering algorithms
for count data are included in the optCluster
function. This function also determines a
unique solution for the optimal clustering algorithm and number of clusters through rank aggregation of
validation measure lists. A brief description of the available clustering algorithms, validation measures,
and rank aggregation algorithms is provided below. For more details, please refer to the references.
clValid
:
"agnes", "clara", "diana", "fanny", "hierarchical", "kmeans", "model", "pam", "som", and "sota".
optCluster
function by using the argument
'Normalizer'.
importance
argument. The default value of equal weights (NULL) is
represented by rep(1, length(x)), where x is the character vector of validation measure names.
To manually change the weights, the order of the validation measures selected needs to be known.
The order of validation measures used in optCluster
is provided below:
Brock, G., Pihur, V., Datta, S. and Datta, S. (2008). clValid: An R Package for Cluster Validation. Journal of Statistical Software 25(4), http://www.jstatsoft.org/v25/i04. Datta, S. and Datta, S. (2003). Comparisons and validation of statistical clustering techniques for microarray gene expression data. Bioinformatics 19(4): 459-466. Pihur, V., Datta, S. and Datta, S. (2007). Weighted rank aggregation of cluster validation measures: A Mounte Carlo cross-entropy approach. Bioinformatics 23(13): 1607-1615. Pihur, V., Datta, S. and Datta, S. (2009). RankAggreg, an R package for weighted rank aggregation. BMC Bioinformatics, 10:62, http://www.biomedcentral.com/1471-2105/10/62.
Sekula, M. (2015). optCluster : An R package for Determining the Optimal Clustering Algorithm and Optimal Number of Clusters. Electronic Theses and Dissertations. Paper 2147. http://ir.library.louisville.edu/etd/2147 Si, Y., Liu, P., Li, P., & Brutnell, T. (2014). Model-based clustering for RNA-seq data. Bioinformatics 30(2): 197-205.
clValid
function, including all available arguments that can be
passed to it, see clValid
in the clValid package. For a description of the RankAggreg
function, including all available arguments that can be
passed to it, see RankAggreg
in the RankAggreg package.
For details on the clustering algorithm functions for continuous data see
agnes
, clara
, diana
,
fanny
, and pam
in package cluster,
hclust
and kmeans
in package stats,
som
in package kohonen,
Mclust
in package mclust,
and sota
in package clValid.
For details the on the clustering algorithm functions for count data see
Cluster.RNASeq
in package MBCluster.Seq.
For details on the validation measure functions see
BHI
, BSI
,
stability
, connectivity
and dunn
in package clValid
and silhouette
in package cluster.
## These examples may each take a few minutes to compute
## Obtain Dataset
data(arabid)
## Analysis of Count Data using Internal and Stability Validation Measures
count1 <- optCluster(arabid, 2:4, clMethods = "all", countData = TRUE)
summary(count1)
## Analysis of Count Data using All Validation Measures
if(require("Biobase") && require("annotate") && require("GO.db") &&
require("org.At.tair.db")){
count2 <- optCluster(arabid, 2:4, clMethods = "all", countData = TRUE, validation = "all",
annotation = "org.At.tair.db")
summary(count2)
}
## Normalize Data with Respect to Library Size
obj <- t(t(arabid)/colSums(arabid))
## Analysis of Normalized Data using Internal and Stability Validation Measures
norm1 <- optCluster(obj, 2:4, clMethods = "all")
summary(norm1)
## Analysis of Normalized Data using All Validation Measures
if(require("Biobase") && require("annotate") && require("GO.db") &&
require("org.At.tair.db")){
norm2 <- optCluster(obj, 2:4, clMethods = "all", validation = "all",
annotation = "org.At.tair.db")
summary(norm2)
}
## Analysis with Only UPGMA using Internal and Stability Validation Measures
hier1 <- optCluster(obj, 2:10, clMethods = "hierarchical")
summary(hier1)
## Analysis with Only UPGMA using All Validation Measures
if(require("Biobase") && require("annotate") && require("GO.db") &&
require("org.At.tair.db")){
hier2 <- optCluster(obj, 2:10, clMethods = "hierarchical", validation = "all",
annotation = "org.At.tair.db")
summary(hier2)
}
Run the code above in your browser using DataLab