nroKmeans: K-means clustering

Description

K-means clustering for multi-dimensional data.

Usage

nroKmeans(data, k = 3, subsample = NULL, balance = 0,
          metric = "euclid", message = NULL)

Arguments

data

A data frame or a matrix.

Number of centroids.

subsample

Number of randomly selected rows used during a single training cycle.

balance

Penalty parameter for size difference between clusters.

metric

Distance metric in data space, either "euclid" or "pearson".

message

If positive, progress information is printed at the specified interval in seconds.

Value

A list with named elements: centroids is a matrix of the main results, layout contains the best-matching centroid labels and model residuals for each usable data point, history is the chronological record of training errors, and metric is the distance metric that was used. The subsampling parameter that was used during training is stored in the element subsample.

Details

The K centroids are determined by Lloyd's algorithm with Euclidean distances or by using 1 - Pearson correlation as the distance measure.

If subsample is less than the number of data rows, a random subset of the specified size is used for each training cycle. By default, subsample is set automatically depending on the size of the dataset.

If balance = 0.0, the algorithm is applied with no balancing, if balance = 1.0 all the clusters will be forced to be of equal size. Intermediate values are permitted. Note that if subsampling is applied, balancing may become less accurate.

References

Lloyd SP (1982) Least squares quantization in PCM. IEEE Transactions on Information Theory, 28:129<U+2013>137

Examples

Run this code

# NOT RUN {
# Import data.
fname <- system.file("extdata", "finndiane.txt", package = "Numero")
dataset <- read.delim(file = fname)

# Prepare training data.
trvars <- c("CHOL", "HDL2C", "TG", "CREAT", "uALB")
trdata <- scale.default(dataset[,trvars]) 

# Unbalanced K-means clustering.
km0 <- nroKmeans(data = trdata, k = 5, balance = 0.0)
print(table(km0$layout$BMC))
print(km0$centroids)

# Balanced K-means clustering.
km1 <- nroKmeans(data = trdata, k = 5, balance = 1.0)
print(table(km1$layout$BMC))
print(km1$centroids)
# }

Run the code above in your browser using DataLab