# KMconstrainedSparse

##### K-means over sparse data input with constraints on cluster weights

Multithreaded weighted Minkowski and spherical K-means via Lloyd's algorithm over sparse representation of data given cluster size (weight) constraints.

##### Usage

```
KMconstrainedSparse(
X,
d,
centroid,
Xw = rep(1, length(X)),
clusterWeightUB = rep(length(X) + 1, length(centroid)),
minkP = 2,
convergenceTail = 5L,
tailConvergedRelaErr = 1e-04,
maxIter = 100L,
maxCore = 7L,
paraSortInplaceMerge = FALSE,
verbose = TRUE
)
```

##### Arguments

- X
A list of size

`N`

, the number of observations.`X[[i]]`

is a 2-column data frame. The 1st column is a sorted**integer vector**of the indexes of nonzero dimensions. Values in these dimensions are stored in the 2nd column as a**numeric vector**. Internally the algorithm sets a 32-bit*int*pointer to the beginning of the 1st column and a 64-bit*double*pointer to the beginning of the 2nd column, so it is critical that the input has the correct type.- d
An integer. The dimensionality of

`X`

.`d`

MUST be no less than the maximum of all index vectors in`X`

.- centroid
A list of size

`K`

, the number of clusters.`centroid[[i]]`

can be in dense or sparse representation. If dense, a numeric vector of size`d`

. If sparse, a 2-column data frame in the same sense as`X[[i]]`

.- Xw
A numeric vector of size

`N`

.`Xw[i]`

is the weight on observation`X[[i]]`

. Users should normalize`Xw`

such that the elements sum up to`N`

. Default uniform weights for all observations.- clusterWeightUB
An integer vector of size

`K`

. The upper bound of weight for each cluster. If`Xw`

are all 1s,`clusterWeightUB`

upper-bound cluster sizes.- minkP
A numeric value or a character string. If numeric,

`minkP`

is the power`p`

in the definition of Minkowski distance. If character string,`"max"`

implies Chebyshev distance,`"cosine"`

implies cosine dissimilarity. Default 2.- convergenceTail
An integer. The algorithm may end up with "cyclical convergence" due to the size / weight constraints, that is, every few iterations produce the same clustering. If the cost (total in-cluster distance) of each of the last

`convergenceTail`

iterations has a relative difference less than`tailConvergedRelaErr`

against the cost from the prior iteration, the program stops.- tailConvergedRelaErr
A numeric value, explained in

`convergenceTail`

.- maxIter
An integer. The maximal number of iterations. Default 100.

- maxCore
An integer. The maximal number of threads to invoke. No more than the total number of logical processors on machine. Default 7.

- paraSortInplaceMerge
A boolean value.

`TRUE`

let the algorithm call`std::inplace_merge()`

(`std`

refers to C++ STL namespace) instead of`std::merge()`

for parallel-sorting the observation-centroid distances. In-place merge is slower but requires no extra memory.- verbose
A boolean value.

`TRUE`

prints progress.

##### Details

See details for `KMconstrained()`

and `KM()`

##### Value

A list of size `K`

, the number of clusters. Each element is a list of 3 vectors:

a numeric vector of size `d`

.

an integer vector of indexes of elements grouped to `centroid`

.

a numeric vector of the same size of `clusterMember`

. The `i`

th element is the Minkowski distance or cosine dissimilarity from `centroid`

to the `clusterMember[i]`

th observation in `X`

.

Empty clusterMember implies empty cluster.

##### Examples

```
# NOT RUN {
N = 5000L # Number of points.
d = 500L # Dimensionality.
K = 50L # Number of clusters.
# Create a data matrix, about 95% of which are zeros.
dat = matrix(unlist(lapply(1L : N, function(x)
{
tmp = numeric(d)
# Nonzero entries.
Nnz = as.integer(max(1, d * runif(1, 0, 0.05)))
tmp[sample(d, Nnz)] = runif(Nnz) + rnorm(Nnz)
tmp
})), nrow = d); gc()
# Convert to sparse representation.
# GMKMcharlie::d2s() is equivalent.
sparsedat = apply(dat, 2, function(x)
{
nonz = which(x != 0)
list(nonz, x[nonz])
}); gc()
centroidInd = sample(length(sparsedat), K)
# Test speed using sparse representation.
sparseCentroid = sparsedat[centroidInd]
# Size upper bounds vary in [N / K * 1.5, N / K * 2]
sizeConstraints = as.integer(round(runif(K, N / K * 1.5, N / K * 2)))
system.time({sparseRst = GMKMcharlie::KMconstrainedSparse(
X = sparsedat, d = d, centroid = sparseCentroid,
clusterWeightUB = sizeConstraints,
tailConvergedRelaErr = 1e-6,
maxIter = 100, minkP = 2, maxCore = 2, verbose = TRUE)})
# }
```

*Documentation reproduced from package GMKMcharlie, version 1.0.3, License: GPL-3*