kproto: k prototypes clustering

Description

Computes k prototypes clustering for mixed type data.

Usage

kproto(x, ...)
"kproto"(x, k, lambda = NULL, iter.max = 100, nstart = 1, ...)

Arguments

Data frame with both mumerics and factors.

Either the number of clusters or a vector specifying indices of initial prototypes.

lambda

Parameter > 0 to trade off between Euclidean distance of numeric variables and simple matching coefficient between categorical variables.

iter.max

Maximum number of iterations if no convergence before.

nstart

If > 1 repetetive computations with random initializations are computed and the result with minimum tot.dist is returned.

...

Currently not used.

Value

cluster: Vector of cluster memberships.
centers: Data frame of cluster prototypes.
lambda: Distance parameter lambda.
size: Vector of cluster sizes.
withinss: Vector of summed distances to the cluster prototype per cluster.
tot.withinss: Target function: sum of all distances to clsuter prototype.
dists: Matrix with distances of observations to all cluster prototypes.
iter: Prespecified maximum number of iterations.
trace: List with two elements (vectors) tracing the iteration process: tot.dists and moved number of observations over all iterations.

Details

The algorithm like k means iteratively recomputes cluster prototypes and reassigns clusters. Clusters are assigned using $d(x,y) = d_{euclid}(x,y) + \lambda d_{simple\,matching}(x,y)$. Cluster prototypes are computed as cluster means for numeric variables and modes for factors (cf. Huang, 1998).

References

Z.Huang (1998): Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Variables, Data Mining and Knowledge Discovery 2, 283-304.

Examples

Run this code

# generate toy data with factors and numerics

n   <- 100
prb <- 0.9
muk <- 1.5 
clusid <- rep(1:4, each = n)

x1 <- sample(c("A","B"), 2*n, replace = TRUE, prob = c(prb, 1-prb))
x1 <- c(x1, sample(c("A","B"), 2*n, replace = TRUE, prob = c(1-prb, prb)))
x1 <- as.factor(x1)

x2 <- sample(c("A","B"), 2*n, replace = TRUE, prob = c(prb, 1-prb))
x2 <- c(x2, sample(c("A","B"), 2*n, replace = TRUE, prob = c(1-prb, prb)))
x2 <- as.factor(x2)

x3 <- c(rnorm(n, mean = -muk), rnorm(n, mean = muk), rnorm(n, mean = -muk), rnorm(n, mean = muk))
x4 <- c(rnorm(n, mean = -muk), rnorm(n, mean = muk), rnorm(n, mean = -muk), rnorm(n, mean = muk))

x <- data.frame(x1,x2,x3,x4)

# apply k prototyps
kpres <- kproto(x, 4)
clprofiles(kpres, x)

# in real world  clusters are often not as clear cut
# by variation of lambda the emphasize is shifted towards factor / numeric variables    
kpres <- kproto(x, 2)
clprofiles(kpres, x)

kpres <- kproto(x, 2, lambda = 0.1)
clprofiles(kpres, x)

kpres <- kproto(x, 2, lambda = 25)
clprofiles(kpres, x)

Run the code above in your browser using DataLab