PVS: Pooled variable scaling for cluster analysis

Description

The function computes a scale for each variable in the data. The result can then be used to standardize a dataset before applying a clustering algorithm (such as k-means). The scale estimation is based on pooled scale estimators, which result from clustering the individual variables in the data. The method is proposed in Raymaekers, and Zamar (2020) <doi:10.1093/bioinformatics/btaa243>.

Usage

PVS(X, kmax = 3, dist = "euclidean",
    method = "gap", B = 1000,
    gapMethod = "firstSEmax",
    minSize = 0.05, rDist = runif,
    SE.factor = 1, refDist = NULL)

Value

A vector of length p containing the estimated scales for the variables.

Arguments

X: an \(n\) by \(p\) data matrix.
kmax: maximum number of clusters in one variable. Default is 3.
dist: "euclidean" for pooled standard deviation and "manhattan" for pooled mean absolute deviation. Default is "euclidean".
method: either "gap" or "jump" to determine the number of clusters. Default is "gap".
B: number of bootstrap samples for the reference distribution of the gap statistic. Default is 1000.
gapMethod: method to define number of clusters in the gap statistic. See cluster::maxSE for more info. Defaults to "firstSEmax".
minSize: minimum cluster size as a percentage of the total number of observations. Defaults to 0.05.
rDist: Optional. Reference distribution (as a function) for the gap statistic. Defaults to runif, the uniform distribution.
SE.factor: factor for determining number of clusters when using the gap statistic. See cluster::maxSE for more details. Defaults to 1
refDist: Optional. A k by 2 matrix with the mean and standard error of the reference distribution of the gap statistic in its columns. Can be used to avoid bootstrapping when repeatedly applying the function to same size data.

Author

Jakob Raymaekers

References

Raymaekers, J, Zamar, R.H. (2020). Pooled variable scaling for cluster analysis. Bioinformatics, 36(12), 3849-3855. tools:::Rd_expr_doi("10.1093/bioinformatics/btaa243")

Examples

Run this code



X <- iris[, -5]
y <- unclass(iris[, 5])

# Compute scales using different scale estimators.
# the pooled standard deviation is considerably smaller for variable 3 and 4:
sds     <- apply(X, 2, sd); round(sds, 2)
ranges  <- apply(X, 2, function(y) diff(range(y))); round(ranges, 2)
psds    <- PVS(X); round(psds, 2)

# Now cluster using k-means after scaling the data

nbclus <- 3
kmeans.std <- kmeans(X, nbclus, nstart = 100) # no scaling
kmeans.sd  <- kmeans(scale(X), nbclus, nstart = 100)
kmeans.rg  <- kmeans(scale(X, scale = ranges), nbclus, nstart = 100)
kmeans.psd <- kmeans(scale(X, scale = psds), nbclus, nstart = 100)

# Calculate the Adjusted Rand Index for each of the clustering outcomes
round(mclust::adjustedRandIndex(y, kmeans.std$cluster), 2) 
round(mclust::adjustedRandIndex(y, kmeans.sd$cluster), 2) 
round(mclust::adjustedRandIndex(y, kmeans.rg$cluster), 2) 
round(mclust::adjustedRandIndex(y, kmeans.psd$cluster), 2)

Run the code above in your browser using DataLab