COMMUNAL: Run clustering algorithms and evaluate validation metrics.

Description

This functions runs various (user-specified) clustering algorithms on the data, for each potential number of clusters k. It then runs internal validation measures the quantify the fit of each clustering. The returned object is of class "COMMUNAL", and can be used to identify 'core' clusters in the data. Currently supported clustering algorithms are those in packages "clValid", "NMF", and "ConsensusClusterPlus".

To determine the optimal number of clusters, use the clusterRange and plotRange3D functions.

Usage

COMMUNAL(data, ks = 2:10, clus.methods = c("hierarchical", "kmeans"),
         validation = c("Connectivity", "dunn", "wb.ratio", "g3", "g2",
                        "pearsongamma", "avg.silwidth", "sindex"), 
         dist.metric = "euclidean", aggl.method = "average", 
         neighb.size = 10, seed = NULL, ...)

Arguments

data

The data to cluster (numeric matrix or data frame). The columns are clustered, rows are features. If using cluster method nmf, all entries must be non-negative.

A numeric vector of integers greater than 1, for the number of clusters to consider. For example, 2:4 tells the function to try clusterings with 2, 3, and 4 clusters.

clus.methods

Character vector of which clustering methods to use. Valid options: "hierarchical", "kmeans", "diana", "fanny", "som", "model", "sota", "pam", "c

validation

A character vector of the validation measures to consider. Valid options: "Connectivity", "average.between", "g2", "ch", "sindex","avg.silwidth", "average.within",

dist.metric

Which metric to use when calculating the distance matrix. Used by clValid clustering algorithms, and in calculating validation measures. Available choices are "euclidean", "correlation", "manhattan".

aggl.method

The agglomeration method to use for "hclust" and "agnes" (if specified in clus.methods). Available choices are "ward", "single", "complete", "average".

neighb.size

Numeric value. The neighborhood size used for calculating the Connectivity validation measure.

seed

Numeric value. Random seed to use in ConsensusClusterPlus and NMF.

...

Other arguments to pass down to ConsensusClusterPlus, NMF, and clValid.

Value

Return object is an object of class COMMUNAL. The class has a getClustering method to extract a data frame of cluster assignments. Alternatively, functions clusterKeys and returnCore are provided to identify core clusters. See examples below.

Examples

Run this code

## create artificial data set with 3 distinct clusters
set.seed(1)
V1 = c(abs(rnorm(100, 2)), abs(rnorm(100, 50)), abs(rnorm(100, 140)))
V2 = c(abs(rnorm(100, 2, 8)), abs(rnorm(100, 55, 4)), abs(rnorm(100, 105, 1)))
data <- t(data.frame(V1, V2))
colnames(data) <- paste("Sample", 1:ncol(data), sep="")
rownames(data) <- paste("Gene", 1:nrow(data), sep="")

## run COMMUNAL
result <- COMMUNAL(data=data, ks=seq(2,5))  # result is a COMMUNAL object
k <- 3                                # suppose optimal cluster number is 3
clusters <- result$getClustering(k)   # method to extract clusters
mat.key <- clusterKeys(clusters, k=k) # get core clusters
examineCounts(mat.key)                # help decide agreement.thresh
core <- returnCore(mat.key, agreement.thresh=50) # find 'core' clusters (all algs agree)
table(core) # the 'core' cluster sizes
## Note: could try a different value for k to
##  see clusters with sub-optimal k

## Can specify clustering methods and validation measures
result <- COMMUNAL(data = data, ks=c(2,3),
                      clus.methods = c("diana", "som", "pam", "kmeans", "ccp-hc", "nmf"),
                      validation=c('pearsongamma', 'avg.silwidth'))
clusters <- result$getClustering(k=3)
mat.key <- clusterKeys(clusters, k=3)
examineCounts(mat.key)
core <- returnCore(mat.key, agreement.thresh=50) # find 'core' clusters
table(core) # the 'core' clusters

## Additional arguments are passed down to clValid, NMF, ConsensusClusterPlus
result <- COMMUNAL(data=data, ks=2:5,
                      clus.methods=c("diana", "ccp-hc", "nmf"), reps=20, nruns=2)

Run the code above in your browser using DataLab

Description

Usage

Arguments

Value

See Also

Examples