clustMixType (version 0.1-16)

kproto: k prototypes clustering

Description

Computes k prototypes clustering for mixed type data.

Usage

kproto(x, ...)
"kproto"(x, k, lambda = NULL, iter.max = 100, nstart = 1, ...)

Arguments

x
Data frame with both mumerics and factors.
k
Either the number of clusters or a vector specifying indices of initial prototypes.
lambda
Parameter > 0 to trade off between Euclidean distance of numeric variables and simple matching coefficient between categorical variables.
iter.max
Maximum number of iterations if no convergence before.
nstart
If > 1 repetetive computations with random initializations are computed and the result with minimum tot.dist is returned.
...
Currently not used.

Value

kmeans like object of class kproto:
cluster
Vector of cluster memberships.
centers
Data frame of cluster prototypes.
lambda
Distance parameter lambda.
size
Vector of cluster sizes.
withinss
Vector of summed distances to the cluster prototype per cluster.
tot.withinss
Target function: sum of all distances to clsuter prototype.
dists
Matrix with distances of observations to all cluster prototypes.
iter
Prespecified maximum number of iterations.
trace
List with two elements (vectors) tracing the iteration process: tot.dists and moved number of observations over all iterations.

Details

The algorithm like k means iteratively recomputes cluster prototypes and reassigns clusters. Clusters are assigned using $d(x,y) = d_{euclid}(x,y) + \lambda d_{simple\,matching}(x,y)$. Cluster prototypes are computed as cluster means for numeric variables and modes for factors (cf. Huang, 1998).

References

Z.Huang (1998): Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Variables, Data Mining and Knowledge Discovery 2, 283-304.

Examples

# generate toy data with factors and numerics

n   <- 100
prb <- 0.9
muk <- 1.5 
clusid <- rep(1:4, each = n)

x1 <- sample(c("A","B"), 2*n, replace = TRUE, prob = c(prb, 1-prb))
x1 <- c(x1, sample(c("A","B"), 2*n, replace = TRUE, prob = c(1-prb, prb)))
x1 <- as.factor(x1)

x2 <- sample(c("A","B"), 2*n, replace = TRUE, prob = c(prb, 1-prb))
x2 <- c(x2, sample(c("A","B"), 2*n, replace = TRUE, prob = c(1-prb, prb)))
x2 <- as.factor(x2)

x3 <- c(rnorm(n, mean = -muk), rnorm(n, mean = muk), rnorm(n, mean = -muk), rnorm(n, mean = muk))
x4 <- c(rnorm(n, mean = -muk), rnorm(n, mean = muk), rnorm(n, mean = -muk), rnorm(n, mean = muk))

x <- data.frame(x1,x2,x3,x4)

# apply k prototyps
kpres <- kproto(x, 4)
clprofiles(kpres, x)

# in real world  clusters are often not as clear cut
# by variation of lambda the emphasize is shifted towards factor / numeric variables    
kpres <- kproto(x, 2)
clprofiles(kpres, x)

kpres <- kproto(x, 2, lambda = 0.1)
clprofiles(kpres, x)

kpres <- kproto(x, 2, lambda = 25)
clprofiles(kpres, x)