salso: Perform Sequentially-Allocated Latent Structure Optimization

Description

This function implements the sequentially-allocated latent structure optimization (SALSO) to find a clustering or feature allocation that minimizes various loss functions. The SALSO method was presented at the workshop "Bayesian Nonparametric Inference: Dependence Structures and their Applications" in Oaxaca, Mexico on December 6, 2017.

Usage

salso(expectedPairwiseAllocationMatrix, structure = c("clustering",
  "featureAllocation")[1], loss = c("squaredError", "absoluteError", "binder",
  "lowerBoundVariationOfInformation")[1], nCandidates = 100,
  budgetInSeconds = 10, maxSize = 0)

Arguments

expectedPairwiseAllocationMatrix

A n-by-n symmetric matrix whose (i,j) elements gives the estimated expected number of times that items i and j are in the same subset (i.e., cluster or feature).

structure

Either "clustering" or "featureAllocation" to indicate the optimization seeks to produce a clustering or a feature allocation.

loss

One of "squaredError", "absoluteError", "binder", or "lowerBoundVariationOfInformation" to indicate the optimization should seeks to minimize squared error loss, absolute error loss, Binder loss (Binder 1978), or the lower bound of the variation of information loss (Wade & Ghahramani 2017), respectively. When structure="clustering", the first three are equivalent. When structure="featureAllocation", only the first two are valid.

nCandidates

The (maximum) number of candidates to consider. Fewer than nCandidates may be considered if the time in budgetInSeconds is exceeded. The computational cost is linear in the number of candidates and there are rapidly diminishing returns to more candidates.

budgetInSeconds

The (maximum) number of seconds to devote to the optimization. When this time is exceeded, no more candidates are considered.

maxSize

Either zero or a positive integer. If a positive integer, the optimization is constrained to produce solutions whose number of clusters or number of features is no more than the supplied value. If zero, the size is not constrained. To avoid overfitting in feature allocation estimation, it is recommended that "maxSize" be close the mean number of features (i.e., columns) in the feature allocations that generated the expectedPairwiseAllocationMatrix.

Value

A clustering (as a vector of cluster labels) or a feature allocation (as a binary matrix of feature indicators).

References

Wade, S. and Ghahramani, Z. (2017). Bayesian cluster analysis: Point estimation and credible balls. Bayesian analysis.

Binder, D. (1978). Bayesian Cluster Analysis. Biometrika, 65: 31<U+2013>38.

Examples

Run this code

# NOT RUN {
probabilities <- expectedPairwiseAllocationMatrix(iris.clusterings)
salso(probabilities)

expectedCounts <- expectedPairwiseAllocationMatrix(USArrests.featureAllocations)
salso(expectedCounts,"featureAllocation")
# }
# NOT RUN {
# }