nroKmeans: K-means clustering

Description

K-means clustering for multi-dimensional data.

Usage

nroKmeans(data, k = 3, subsample = NULL, balance = 0, metric = "euclid")

Arguments

data

A data frame or a matrix.

Number of centroids.

subsample

Number of randomly selected rows used during a single training cycle.

balance

Penalty parameter for size difference between clusters.

metric

Distance metric in data space, either "euclid" or "pearson".

Value

A list with four named elements: centroids is a matrix of the main results, layout contains the best-matching centroid labels and model residuals for each usable data point, history is the chronological record of training errors, and metric is the distance metric that was used.

Details

The K centroids are determined by Lloyd's algorithm with Euclidean distances or by using 1 - Pearson correlation as the distance measure. If subsample is less than the number of data rows, a random subset of the specified size is used for each training cycle.

If balance = 0.0, the algorithm is applied with no balancing, if balance = 1.0 all the clusters will be forced to be of equal size. Intermediate values are permitted.

References

Lloyd SP (1982) Least squares quantization in PCM. IEEE Transactions on Information Theory, 28:129<U+2013>137

Examples

Run this code

# NOT RUN {
# Import data.
fname <- system.file("extdata", "finndiane.txt", package = "Numero")
dataset <- read.delim(file = fname)

# Prepare training data.
trvars <- c("CHOL", "HDL2C", "TG", "CREAT", "uALB")
trdata <- scale.default(dataset[,trvars]) 

# Unbalanced K-means clustering.
km0 <- nroKmeans(data = trdata, k = 5, balance = 0.0)
print(table(km0$layout$BMC))
print(km0$centroids)

# Balanced K-means clustering.
km1 <- nroKmeans(data = trdata, k = 5, balance = 1.0)
print(table(km1$layout$BMC))
print(km1$centroids)
# }

Run the code above in your browser using DataLab