Learn R Programming

bootkmeans (version 1.0.0)

boot.kmeans: Bootstrap augmented \(k\)-means algorithm for fuzzy partitions

Description

Repeatedly bootstraps the rows of a data matrix, runs kmeans on each resample (with optional seeding for given centres), tracks per-observation allocations using squared Euclidean distance, and aggregates results into out-of-bag (OOB) fuzzy memberships, hard clusters, and averaged cluster centres. Iterations can stop adaptively using a serial-correlation test on the objective trace.

Usage

boot.kmeans(
  data = NULL,
  groups = NULL,
  iterations = 500,
  nstart = 1,
  export = FALSE,
  display = FALSE,
  pval = 0.05,
  itermax = 10,
  maxsamp = 1000,
  verbose = FALSE,
  returnall = FALSE
)

Value

An object of class "BSKMeans": a list with components

U

\(n \times K\) matrix of OOB fuzzy cluster memberships.

clusters

Integer vector of length \(n\) of hard cluster labels.

centres

\(K \times p\) matrix of averaged centres over the last iterations runs.

p.value

Final Breusch–Godfrey test p-value used for stopping.

iterations

Total number of iterations actually run.

occurences

\(n \times \text{iterations}\) matrix of per-iteration allocations for all observations.

size

Number of clusters \(K\).

soslist

Numeric vector of objective values by iteration.

centrelist

(If returnall = TRUE) list of per-iteration centre matrices; otherwise NULL.

ooblist

(If returnall = TRUE) list of OOB index vectors by iteration; otherwise NULL.

kmlist

(If returnall = TRUE) list of kmeans fit objects by iteration; otherwise NULL.

Arguments

data

Numeric matrix or data frame of row observations and column variables. Required.

groups

Either and integer number of clusters \(K\); or a \(K \times p\) numeric matrix of initial centres. Required.

iterations

Initial number of bootstrap iterations to run before considering stopping (default = 500).

nstart

Passed to kmeans when groups is an integer (number of random starts, default = 1).

export

Logical; if TRUE, saves a JPEG of the objective trace at each iteration (plot<i>.jpg). Defaults to FALSE.

display

Logical; if TRUE, plots the most recent objective values during fitting. Defaults to FALSE.

pval

Significance threshold for adaptive stopping. When the Breusch–Godfrey test p-value on the last iterations objective values is not below pval, the procedure stops.

itermax

Maximum number of iterations per \(k\)-means run (passed to kmeans(iter.max = ...)).

maxsamp

Upper bound on total iterations if adaptive stopping keeps extending (default = 1000).

verbose

Logical; if TRUE, print iteration counter and latest test p-value while running. Defaults to FALSE.

returnall

Logical; if TRUE, return full per-iteration objects (centres, \(k\)-means fits, OOB lists); otherwise a smaller object of final results if return. Defaults to TRUE.

Author

Jesse S. Ghashti jesse.ghashti@ubc.ca and Jeffrey L. Andrews jeff.andrews@ubc.ca

Details

Each iteration draws a bootstrap sample of rows, runs kmeans on the resample (first using either supplied centres or nstart random starts; subsequent iterations use the previous iteration's centres), and computes squared Euclidean distances from every original observation to each current centre using mahalanobis with the identity covariance. Observations are allocated to their nearest centre and these allocations are tracked across iterations.

Out-of-bag (OOB) sets are the observations note included in a given bootstrap sample. For each observation, its OOB allocations across the most recent iterations runs are tallied to produce a fuzzy membership matrix (\(U\)) and a hard label by maximum membership.

Convergence is assessed adaptively: on the trace of summed per-observation minimum squared distances (the \(k\)-means objective) over the most recent iterations runs, a Breusch–Godfrey serial-correlation test (bgtest applied to a regression of the objective on iteration index) is computed. If the p-value is below pval and iterations < maxsamp, one more iteration is added; otherwise the loop terminates. Final centres are the elementwise mean of the centres over the last iterations runs.

References

Ghashti, J.S., Andrews, J.L., Thompson, J.R.J., Epp, J. and H.S. Kochar (2025). A bootstrap augmented \(k\)-means algorithm for fuzzy partitions. Submitted.

Breusch, T.S. (1978). Testing for Autocorrelation in Dynamic Linear Models, Australian Economic Papers, 17, 334-355.

Godfrey, L.G. (1978). Testing Against General Autoregressive and Moving Average Error Models when the Regressors Include Lagged Dependent Variables', Econometrica, 46, 1293-1301.

See Also

compare.clusters, compare.tables, bootk.hardsoftvis, kmeans, bgtest

Examples

Run this code
set.seed(1)

# basic usage
x <- as.matrix(iris[, -5])
fit <- boot.kmeans(data = x, groups = 3, iterations = 50, itermax = 20, verbose = TRUE)
table(fit$clusters, iris$Species)

# basic usage with initial cluster centres supplied
centres.init <- x[sample(nrow(x), 3), ]
fit2 <- boot.kmeans(data = x, groups = centres.init, iterations = 50)

# plot objective trace
plot(fit$soslist, type = "l", xlab = "Iteration", ylab = "Objective Function Value")

Run the code above in your browser using DataLab