boot.kmeans: Bootstrap augmented \(k\)-means algorithm for fuzzy partitions

Description

Repeatedly bootstraps the rows of a data matrix, runs kmeans on each resample (with optional seeding for given centres), tracks per-observation allocations using squared Euclidean distance, and aggregates results into out-of-bag (OOB) fuzzy memberships, hard clusters, and averaged cluster centres. Iterations can stop adaptively using a serial-correlation test on the objective trace.

Usage

boot.kmeans(
  data = NULL,
  groups = NULL,
  iterations = 500,
  nstart = 1,
  export = FALSE,
  display = FALSE,
  pval = 0.05,
  itermax = 10,
  maxsamp = 1000,
  verbose = FALSE,
  returnall = FALSE
)

Value

An object of class "BSKMeans": a list with components

U: \(n \times K\) matrix of OOB fuzzy cluster memberships.
clusters: Integer vector of length \(n\) of hard cluster labels.
centres: \(K \times p\) matrix of averaged centres over the last iterations runs.
p.value: Final Breusch–Godfrey test p-value used for stopping.
iterations: Total number of iterations actually run.
occurences: \(n \times \text{iterations}\) matrix of per-iteration allocations for all observations.
size: Number of clusters \(K\).
soslist: Numeric vector of objective values by iteration.
centrelist: (If returnall = TRUE) list of per-iteration centre matrices; otherwise NULL.
ooblist: (If returnall = TRUE) list of OOB index vectors by iteration; otherwise NULL.
kmlist: (If returnall = TRUE) list of kmeans fit objects by iteration; otherwise NULL.

Arguments

data: Numeric matrix or data frame of row observations and column variables. Required.
groups: Either and integer number of clusters \(K\); or a \(K \times p\) numeric matrix of initial centres. Required.
iterations: Initial number of bootstrap iterations to run before considering stopping (default = 500).
nstart: Passed to kmeans when groups is an integer (number of random starts, default = 1).
export: Logical; if TRUE, saves a JPEG of the objective trace at each iteration (plot<i>.jpg). Defaults to FALSE.
display: Logical; if TRUE, plots the most recent objective values during fitting. Defaults to FALSE.
pval: Significance threshold for adaptive stopping. When the Breusch–Godfrey test p-value on the last iterations objective values is not below pval, the procedure stops.
itermax: Maximum number of iterations per \(k\)-means run (passed to kmeans(iter.max = ...)).
maxsamp: Upper bound on total iterations if adaptive stopping keeps extending (default = 1000).
verbose: Logical; if TRUE, print iteration counter and latest test p-value while running. Defaults to FALSE.
returnall: Logical; if TRUE, return full per-iteration objects (centres, \(k\)-means fits, OOB lists); otherwise a smaller object of final results if return. Defaults to TRUE.

Author

Jesse S. Ghashti jesse.ghashti@ubc.ca and Jeffrey L. Andrews jeff.andrews@ubc.ca

Details

Each iteration draws a bootstrap sample of rows, runs kmeans on the resample (first using either supplied centres or nstart random starts; subsequent iterations use the previous iteration's centres), and computes squared Euclidean distances from every original observation to each current centre using mahalanobis with the identity covariance. Observations are allocated to their nearest centre and these allocations are tracked across iterations.

Out-of-bag (OOB) sets are the observations note included in a given bootstrap sample. For each observation, its OOB allocations across the most recent iterations runs are tallied to produce a fuzzy membership matrix (\(U\)) and a hard label by maximum membership.

Convergence is assessed adaptively: on the trace of summed per-observation minimum squared distances (the \(k\)-means objective) over the most recent iterations runs, a Breusch–Godfrey serial-correlation test (bgtest applied to a regression of the objective on iteration index) is computed. If the p-value is below pval and iterations < maxsamp, one more iteration is added; otherwise the loop terminates. Final centres are the elementwise mean of the centres over the last iterations runs.

References

Ghashti, J.S., Andrews, J.L., Thompson, J.R.J., Epp, J. and H.S. Kochar (2025). A bootstrap augmented \(k\)-means algorithm for fuzzy partitions. Submitted.

Breusch, T.S. (1978). Testing for Autocorrelation in Dynamic Linear Models, Australian Economic Papers, 17, 334-355.

Godfrey, L.G. (1978). Testing Against General Autoregressive and Moving Average Error Models when the Regressors Include Lagged Dependent Variables', Econometrica, 46, 1293-1301.

Examples

Run this code

set.seed(1)

# basic usage
x <- as.matrix(iris[, -5])
fit <- boot.kmeans(data = x, groups = 3, iterations = 50, itermax = 20, verbose = TRUE)
table(fit$clusters, iris$Species)

# basic usage with initial cluster centres supplied
centres.init <- x[sample(nrow(x), 3), ]
fit2 <- boot.kmeans(data = x, groups = centres.init, iterations = 50)

# plot objective trace
plot(fit$soslist, type = "l", xlab = "Iteration", ylab = "Objective Function Value")

Run the code above in your browser using DataLab