COMMUNAL: Run clustering algorithms and evaluate validation metrics.

Description

This functions runs various (user-specified) clustering algorithms on the data, for each potential number of clusters k. It then runs internal validation measures the quantify the fit of each clustering. The returned object is of class "COMMUNAL", and can be used to identify 'core' clusters in the data. Currently supported clustering algorithms are those in packages "clValid", "NMF", and "ConsensusClusterPlus".

The COMMUNAL algorithm is designed to be run with clusterRange, via a call to COMMUNAL() (although this may still be useful to some researchers). After running clusterRange, use getGoodAlgs and getNonCorrNonMonoMeasures to get locally optimized clustering algorithms and validity measures.

To determine the optimal number of clusters, use the plotRange3D function.

Usage

COMMUNAL(data, ks, clus.methods = c("hierarchical", "kmeans", "diana", "som", "sota", "pam", "clara", "agnes"),  validation = c("Connectivity", "dunn", "wb.ratio", "g3",  "g2", "pearsongamma", "avg.silwidth", "sindex"),  dist.metric = "euclidean", aggl.method = "ward",  neighb.size = 10, seed = NULL, parallel=F, gapBoot=20,  verbose=F, mc.cores=NULL, ...)

Arguments

data

The data to cluster (numeric matrix or data frame). The columns are clustered, rows are features. If using cluster method nmf, all entries must be non-negative.

A numeric vector of integers greater than 1, for the number of clusters to consider. For example, 2:4 tells the function to try clusterings with 2, 3, and 4 clusters.

clus.methods

Character vector of which clustering methods to use. Valid options: "hierarchical", "kmeans", "diana", "fanny", "som", "model", "sota", "pam", "clara","agnes", "ccp-hc","ccp-km", "ccp-pam", "nmf". In this list, "nmf" corresponds to "nmf" in package NMF, "ccp-xx" corresponds to "xx" in package pkgConsensusClusterPlus, and the rest match to the method of the same name in package clValid.

validation

A character vector of the validation measures to consider. Valid options: "Connectivity", "average.between", "g2", "ch", "sindex","avg.silwidth", "average.within", "dunn", "widestgap", "wb.ratio", "entropy", "dunn2", "pearsongamma", "g3", "within.cluster.ss", "min.separation", "max.diameter", "gapStatistic". With the exception of "Connectivity", which is calculated by clValid::connectivity, and "gapStatistic", which is implemented by COMMUNAL based on cluster::clusGap(), these are calculated with fpc::cluster.stats.

dist.metric

Which metric to use when calculating the distance matrix. Used by clValid clustering algorithms, and in calculating validation measures. Available choices are "euclidean", "correlation", "manhattan".

aggl.method

The agglomeration method to use for "hclust" and "agnes" (if specified in clus.methods). Available choices are "ward", "ward.D", "ward.D2", "single", "complete", "average". The ward methods have not been implemented in clValid as of this writing.

neighb.size

Numeric value. The neighborhood size used for calculating the Connectivity validation measure.

seed

Numeric value. Random seed to use in ConsensusClusterPlus and NMF.

parallel

Allows for parallel computation of the gap statistic bootstraps. WILL NOT WORK ON WINDOWS MACHINES (sorry).

gapBoot

The number of gap statistic bootstraps to perform. This recursively calls COMMUNAL for each bootstrap, though the other validation measures do not have to be calculated for each call.

verbose

Mostly output regarding clustering algorithms and gap statistic.

mc.cores

If null, uses detectCores(). Ignored if parallel=F.

...

Other arguments to pass down to ConsensusClusterPlus, NMF, and clValid.

Value

Return object is an object of class COMMUNAL. The class has a getClustering method to extract a data frame of cluster assignments. Alternatively, functions clusterKeys and returnCore are provided to identify core clusters. See examples below.

Examples

Run this code

## Not run: 
# ## create artificial data set with 3 distinct clusters
# set.seed(1)
# V1 = c(abs(rnorm(100, 2)), abs(rnorm(100, 50)), abs(rnorm(100, 140)))
# V2 = c(abs(rnorm(100, 2, 8)), abs(rnorm(100, 55, 4)), abs(rnorm(100, 105, 1)))
# data <- t(data.frame(V1, V2))
# colnames(data) <- paste("Sample", 1:ncol(data), sep="")
# rownames(data) <- paste("Gene", 1:nrow(data), sep="")
# 
# ## run COMMUNAL
# result <- COMMUNAL(data=data, ks=seq(2,5))  # result is a COMMUNAL object
# k <- 3                                # suppose optimal cluster number is 3
# clusters <- result$getClustering(k)   # method to extract clusters
# mat.key <- clusterKeys(clusters) # get core clusters
# examineCounts(mat.key)                # help decide agreement.thresh
# core <- returnCore(mat.key, agreement.thresh=50) # find 'core' clusters (all algs agree)
# table(core) # the 'core' cluster sizes
# ## Note: could try a different value for k to
# ##  see clusters with sub-optimal k
# 
# ## Can specify clustering methods and validation measures
# result <- COMMUNAL(data = data, ks=c(2,3),
#                       clus.methods = c("diana", "som", "pam", "kmeans", "ccp-hc", "nmf"),
#                       validation=c('pearsongamma', 'avg.silwidth'))
# clusters <- result$getClustering(k=3)
# mat.key <- clusterKeys(clusters)
# examineCounts(mat.key)
# core <- returnCore(mat.key, agreement.thresh=50) # find 'core' clusters
# table(core) # the 'core' clusters
# 
# ## Additional arguments are passed down to clValid, NMF, ConsensusClusterPlus
# result <- COMMUNAL(data=data, ks=2:5,
#                       clus.methods=c("diana", "ccp-hc", "nmf"), reps=20, nruns=2)
# ## End(Not run)

Run the code above in your browser using DataLab

Description

Usage

Arguments

Value

See Also

Examples