Runs consensus clustering across subsamples, algorithms, and number of clusters (k).
dice(data, nk, reps = 10, algorithms = NULL, k.method = NULL,
nmf.method = c("brunet", "lee"), distance = "euclidean",
cons.funs = c("kmodes", "majority", "CSPA", "LCE"), sim.mat = c("cts",
"srs", "asrs"), prep.data = c("none", "full", "sampled"), min.var = 1,
seed = 1, trim = FALSE, reweigh = FALSE, n = 5, evaluate = TRUE,
plot = FALSE, ref.cl = NULL, progress = TRUE)data matrix with rows as samples and columns as variables
number of clusters (k) requested; can specify a single integer or a range of integers to compute multiple k
number of subsamples
vector of clustering algorithms for performing consensus clustering. Must be any number of the following: "nmf", "hc", "diana", "km", "pam", "ap", "sc", "gmm", "block", "som", "cmeans", "hdbscan". A custom clustering algorithm can be used.
determines the method to choose k when no reference class is
given. When ref.cl is not NULL, k is the number of distinct
classes of ref.cl. Otherwise the input from k.method chooses
k. The default is to use the PAC to choose the best k(s). Specifying an
integer as a user-desired k will override the best k chosen by PAC.
Finally, specifying "all" will produce consensus results for all k. The
"all" method is implicitly performed when there is only one k used.
specify NMF-based algorithms to run. By default the
"brunet" and "lee" algorithms are called. See nmf for
details.
a vector of distance functions. Defaults to "euclidean".
Other options are given in dist. A custom distance
function can be used.
consensus functions to use. Current options are "kmodes" (k-modes), "majority" (majority voting), "CSPA" (Cluster-based Similarity Partitioning Algorithm), "LCE" (linkage clustering ensemble)
similarity matrix; choices are "cts", "srs", "asrs".
Prepare the data on the "full" dataset, the "sampled" dataset, or "none" (default).
minimum variability measure threshold used to filter the
feature space for only highly variable features. Only features with a
minimum variability measure across all samples greater than min.var
will be used. If type = "conventional", the standard deviation is
the measure used, and if type = "robust", the MAD is the measure
used.
random seed for knn imputation reproducibility
logical; if TRUE, algorithms that score low on internal
indices will be trimmed out
logical; if TRUE, after trimming out poor performing
algorithms, each algorithm is reweighed depending on its internal indices.
an integer specifying the top n algorithms to keep after
trimming off the poor performing ones using Rank Aggregation. If the total
number of algorithms is less than n no trimming is done.
logical; if TRUE (default), validity indices are
returned. Internal validity indices are always computed. If ref.cl
is not NULL, then external validity indices will also be computed.
logical; if TRUE, graph_all is called and a summary
evaluation heatmap of ranked algorithms vs. internal validity indices is
plotted as well.
reference class
logical; should a progress bar be displayed?
A list with the following elements
raw clustering ensemble object
clustering ensemble object with knn imputation used on E
flattened ensemble object with remaining missing entries imputed by majority voting
final clustering assignment from the diverse clustering ensemble method
if evaluate = TRUE, shows cluster evaluation indices;
otherwise NULL
There are three ways to handle the input data before clustering via argument
prep.data. The default is to use the raw data as-is ("none"). Or, we
can enact prepare_data on the full dataset ("full"), or the
bootstrap sampled datasets ("sampled").
# NOT RUN {
library(dplyr)
data(hgsc)
dat <- hgsc[1:100, 1:50]
ref.cl <- strsplit(rownames(dat), "_") %>%
purrr::map_chr(2) %>%
factor() %>%
as.integer()
dice.obj <- dice(dat, nk = 4, reps = 5, algorithms = "hc", cons.funs =
"kmodes", ref.cl = ref.cl, progress = FALSE)
str(dice.obj, max.level = 2)
# }
Run the code above in your browser using DataLab