clustMixType (version 0.3-14)

kproto: k-Prototypes Clustering

Description

Computes k-prototypes clustering for mixed-type data.

Usage

kproto(x, ...)

# S3 method for default kproto( x, k, lambda = NULL, type = "standard", iter.max = 100, nstart = 1, na.rm = "yes", keep.data = TRUE, verbose = TRUE, init = NULL, p_nstart.m = 0.9, ... )

Value

kmeans like object of class kproto:

cluster

Vector of cluster memberships.

centers

Data frame of cluster prototypes.

lambda

Distance parameter lambda.

size

Vector of cluster sizes.

withinss

Vector of within cluster distances for each cluster, i.e. summed distances of all observations belonging to a cluster to their respective prototype.

tot.withinss

Target function: sum of all observations' distances to their corresponding cluster prototype.

dists

Matrix with distances of observations to all cluster prototypes.

iter

Prespecified maximum number of iterations.

trace

List with two elements (vectors) tracing the iteration process: tot.dists and moved number of observations over all iterations.

inits

Initial prototypes determined by specified initialization strategy, if init is either 'nbh.dens' or 'sel.cen'.

nstart.m

only for 'init = nstart_m': determined number of randomly choosen sets.

data

if 'keep.data = TRUE' than the original data will be added to the output list.

type

Type argument of the function call.

stdization

Only returned for type = "gower": List of standardized ranks for ordinal variables and an additional element num_ranges with ranges of all numeric variables. Used by predict.kproto.

Arguments

x

Data frame with both numerics and factors.

...

Currently not used.

k

Either the number of clusters, a vector specifying indices of initial prototypes, or a data frame of prototypes of the same columns as x.

lambda

Parameter > 0 to trade off between Euclidean distance of numeric variables and simple matching coefficient between categorical variables. Also a vector of variable specific factors is possible where the order must correspond to the order of the variables in the data. In this case all variables' distances will be multiplied by their corresponding lambda value.

type

Character, to specify the distance for clustering. Either "standard" (cf. details below) or "gower". The latter calls kproto_gower.

iter.max

Maximum number of iterations if no convergence before.

nstart

If > 1 repetitive computations with random initializations are computed and the result with minimum tot.dist is returned.

na.rm

Character; Either "yes" to strip NA values for complete case analysis, "no" to keep and ignore NA values, "imp.internal" to impute the NAs within the algorithm or "imp.onestep" to apply the algorithm ignoring the NAs and impute them after the partition is determined.

keep.data

Logical whether original should be included in the returned object.

verbose

Logical whether additional information about process should be printed. Caution: For verbose=FALSE, if the number of clusters is reduced during the iterations it will not mentioned.

init

Character, to specify the initialization strategy. Either "nbh.dens", "sel.cen" or "nstart.m". Default is "NULL", which results in nstart repetitive algorithm computations with random starting prototypes. Otherwise, nstart is not used. Argument k must be a number if a specific initialization strategy is choosen!

p_nstart.m

Numeric, probability(=0.9 is default) for init="nstart.m", where the strategy assures that with a probability of p_nstart.m at least one of the m sets of initial prototypes contains objects of every cluster group (cf. Aschenbruck et al. (2023): Random-based Initialization for clustering mixed-type data with the k-Prototypes algorithm. In: Cladag 2023 Book of abstracts and short spapers, isbn: 9788891935632.).

Details

The algorithm like k-means iteratively recomputes cluster prototypes and reassigns clusters. For type = "standard" clusters are assigned using \(d(x,y) = d_{euclid}(x,y) + \lambda d_{simple\,matching}(x,y)\). Cluster prototypes are computed as cluster means for numeric variables and modes for factors (cf. Huang, 1998). Ordered factors variables are treated as categorical variables. In case of na.rm = FALSE: for each observation variables with missings are ignored (i.e. only the remaining variables are considered for distance computation). In consequence for observations with missings this might result in a change of variable's weighting compared to the one specified by lambda. For these observations distances to the prototypes will typically be smaller as they are based on fewer variables. For type = "gower" cf. kproto_gower.

References

  • Szepannek, G. (2018): clustMixType: User-Friendly Clustering of Mixed-Type Data in R, The R Journal 10/2, 200-208, tools:::Rd_expr_doi("10.32614/RJ-2018-048").

  • Aschenbruck, R., Szepannek, G., Wilhelm, A. (2022): Imputation Strategies for Clustering Mixed‑Type Data with Missing Values, Journal of Classification, tools:::Rd_expr_doi("10.1007/s00357-022-09422-y").

  • Z.Huang (1998): Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Variables, Data Mining and Knowledge Discovery 2, 283-304.

Examples

Run this code
# generate toy data with factors and numerics

n   <- 100
prb <- 0.9
muk <- 1.5 
clusid <- rep(1:4, each = n)

x1 <- sample(c("A","B"), 2*n, replace = TRUE, prob = c(prb, 1-prb))
x1 <- c(x1, sample(c("A","B"), 2*n, replace = TRUE, prob = c(1-prb, prb)))
x1 <- as.factor(x1)

x2 <- sample(c("A","B"), 2*n, replace = TRUE, prob = c(prb, 1-prb))
x2 <- c(x2, sample(c("A","B"), 2*n, replace = TRUE, prob = c(1-prb, prb)))
x2 <- as.factor(x2)

x3 <- c(rnorm(n, mean = -muk), rnorm(n, mean = muk), rnorm(n, mean = -muk), rnorm(n, mean = muk))
x4 <- c(rnorm(n, mean = -muk), rnorm(n, mean = muk), rnorm(n, mean = -muk), rnorm(n, mean = muk))

x <- data.frame(x1,x2,x3,x4)

# apply k-prototypes
kpres <- kproto(x, 4)
clprofiles(kpres, x)

# in real world clusters are often not as clear cut
# by variation of lambda the emphasize is shifted towards factor / numeric variables    
kpres <- kproto(x, 2)
clprofiles(kpres, x)

kpres <- kproto(x, 2, lambda = 0.1)
clprofiles(kpres, x)

kpres <- kproto(x, 2, lambda = 25)
clprofiles(kpres, x)

Run the code above in your browser using DataLab