MiniBatchKmeans: A randomized dataset sub-sample algorithm that approximates the k-means
algorithm. See: https://www.eecs.tufts.edu/~dsculley/papers/fastkmeans.pdf
Description
A randomized dataset sub-sample algorithm that approximates the k-means
algorithm. See: https://www.eecs.tufts.edu/~dsculley/papers/fastkmeans.pdf
Data file name on disk (NUMA optimized) or In memory data matrix
centers
Either (i) The number of centers (i.e., k), or
(ii) an In-memory data matrix, or (iii) A 2-Element list with element 1
being a filename for precomputed centers, and element 2
the number of centroids.
nrow
The number of samples in the dataset
ncol
The number of features in the dataset
batch.size
Size of the mini batches
iter.max
The maximum number of iteration of k-means to perform
nthread
The number of parallel threads to run
init
The type of initialization to use c("kmeanspp", "random",
"forgy", "none")
tolerance
The convergence tolerance
dist.type
What dissimilarity metric to use
max.no.improvement
Control early stopping based on the consecutive
number of mini batches that does not yield an improvement on the
smoothed inertia
Value
A list containing the attributes of the output of kmeans.
cluster: A vector of integers (from 1:k) indicating the cluster to
which each point is allocated.
centers: A matrix of cluster centres.
size: The number of points in each cluster.
iter: The number of (outer) iterations.
# NOT RUN {iris.mat <- as.matrix(iris[,1:4])
k <- length(unique(iris[, dim(iris)[2]])) # Number of unique classeskms <- MiniBatchKmeans(iris.mat, k, batch.size=5)
# }