ptbiserial_kproto: Validating k Prototypes Clustering: Ptbiserial index

Description

Calculating the Ptbiserial index for a k-Prototypes clustering with k clusters or computing the optimal number of clusters based on the Ptbiserial index for k-Prototype clustering.

Usage

ptbiserial_kproto(object = NULL, data = NULL, k = NULL, s_d = NULL,
  ...)

Arguments

object

Object of class kproto resulting from a call with kproto(..., keep.data=TRUE)

data

Original data; only required if object == NULL.

Vector specifying the search range for optimum number of clusters; if NULL the range will set as 2:sqrt(n). Only required if object == NULL.

s_d

for internal purposes only

...

Further arguments passed to kproto, like:

nstart: If > 1 repetetive computations of kproto with random initializations are computed.
lambda: Factor to trade off between Euclidean distance of numeric variables and simple matching coefficient between categorical variables.
verbose: Logical whether information about the cluster procedure should be given. Caution: If verbose=FALSE, the reduction of the number of clusters is not mentioned.

Value

For computing the optimal number of clusters based on the Ptbiserial index for k-Prototype clustering the output contains:

k_opt

optimal number of clusters

indices

calculated indices for $k=2,...,k_{max}$

For computing the Ptbiserial index-value for a given k-Prototype clustering the output contains:

index

calculated index-value

Details

$$Ptbiserial = \frac{(\bar{S}_b-\bar{S}_w) \cdot (\frac{N_w \cdot N_b}{N_t^2})^{0.5}}{s_d}$$ $\bar{S}_w$ is the sum of within-cluster distances divided by the number of within-cluster distances and $\bar{S}_b$ is the sum of between-cluster distances divided by the number of between-cluster distances. $N_t$ is the total number of pairs of objects in the data, $N_w$ is the total number of pairs of objects belonging to the samecluster and $N_b$ is the total number of pairs of objects belonging to different clusters. $s_d$ is the standard deviation of all distances. The maximum value of the index is used to indicate the optimal number of clusters.

References

Charrad, M., Ghazzali, N., Boiteau, V., Niknafs, A. (2014): NbClust: An R Package for Determining the Relevant Number of Clusters in a Data Set. Journal of Statistical Software, Vol 61, Issue 6.

Examples

Run this code

# NOT RUN {
# generate toy data with factors and numerics

n   <- 10
prb <- 0.99
muk <- 2.5

x1 <- sample(c("A","B"), 2*n, replace = TRUE, prob = c(prb, 1-prb))
x1 <- c(x1, sample(c("A","B"), 2*n, replace = TRUE, prob = c(1-prb, prb)))
x1 <- as.factor(x1)

x2 <- sample(c("A","B"), 2*n, replace = TRUE, prob = c(prb, 1-prb))
x2 <- c(x2, sample(c("A","B"), 2*n, replace = TRUE, prob = c(1-prb, prb)))
x2 <- as.factor(x2)

x3 <- c(rnorm(n, mean = -muk), rnorm(n, mean = muk), rnorm(n, mean = -muk), rnorm(n, mean = muk))
x4 <- c(rnorm(n, mean = -muk), rnorm(n, mean = muk), rnorm(n, mean = -muk), rnorm(n, mean = muk))

x <- data.frame(x1,x2,x3,x4)

# apply k prototyps
kpres <- kproto(x, 4, keep.data=TRUE)

# calculate index-value
Ptbiserial_value <- ptbiserial_kproto(object = kpres)

# calculate optimal number of cluster
k_opt <- ptbiserial_kproto(data = x, k = 3:5, nstart = 5, verbose = FALSE)

# }

Run the code above in your browser using DataLab