clustMixType (version 0.2-2)

ptbiserial_kproto: Validating k Prototypes Clustering: Ptbiserial index

Description

Calculating the Ptbiserial index for a k-Prototypes clustering with k clusters or computing the optimal number of clusters based on the Ptbiserial index for k-Prototype clustering.

Usage

ptbiserial_kproto(object = NULL, data = NULL, k = NULL, s_d = NULL,
  ...)

Arguments

object

Object of class kproto resulting from a call with kproto(..., keep.data=TRUE)

data

Original data; only required if object == NULL.

k

Vector specifying the search range for optimum number of clusters; if NULL the range will set as 2:sqrt(n). Only required if object == NULL.

s_d

for internal purposes only

...

Further arguments passed to kproto, like:

  • nstart: If > 1 repetetive computations of kproto with random initializations are computed.

  • lambda: Factor to trade off between Euclidean distance of numeric variables and simple matching coefficient between categorical variables.

  • verbose: Logical whether information about the cluster procedure should be given. Caution: If verbose=FALSE, the reduction of the number of clusters is not mentioned.

Value

For computing the optimal number of clusters based on the Ptbiserial index for k-Prototype clustering the output contains:

k_opt

optimal number of clusters

indices

calculated indices for \(k=2,...,k_{max}\)

For computing the Ptbiserial index-value for a given k-Prototype clustering the output contains:

index

calculated index-value

Details

$$Ptbiserial = \frac{(\bar{S}_b-\bar{S}_w) \cdot (\frac{N_w \cdot N_b}{N_t^2})^{0.5}}{s_d}$$ \(\bar{S}_w\) is the sum of within-cluster distances divided by the number of within-cluster distances and \(\bar{S}_b\) is the sum of between-cluster distances divided by the number of between-cluster distances. \(N_t\) is the total number of pairs of objects in the data, \(N_w\) is the total number of pairs of objects belonging to the samecluster and \(N_b\) is the total number of pairs of objects belonging to different clusters. \(s_d\) is the standard deviation of all distances. The maximum value of the index is used to indicate the optimal number of clusters.

References

See Also

Other clustervalidation indices: dunn_kproto, dunn_kproto, gamma_kproto, gplus_kproto, mcclain_kproto, silhouette_kproto, tau_kproto

Examples

Run this code
# NOT RUN {
# generate toy data with factors and numerics

n   <- 10
prb <- 0.99
muk <- 2.5

x1 <- sample(c("A","B"), 2*n, replace = TRUE, prob = c(prb, 1-prb))
x1 <- c(x1, sample(c("A","B"), 2*n, replace = TRUE, prob = c(1-prb, prb)))
x1 <- as.factor(x1)

x2 <- sample(c("A","B"), 2*n, replace = TRUE, prob = c(prb, 1-prb))
x2 <- c(x2, sample(c("A","B"), 2*n, replace = TRUE, prob = c(1-prb, prb)))
x2 <- as.factor(x2)

x3 <- c(rnorm(n, mean = -muk), rnorm(n, mean = muk), rnorm(n, mean = -muk), rnorm(n, mean = muk))
x4 <- c(rnorm(n, mean = -muk), rnorm(n, mean = muk), rnorm(n, mean = -muk), rnorm(n, mean = muk))

x <- data.frame(x1,x2,x3,x4)

# apply k prototyps
kpres <- kproto(x, 4, keep.data=TRUE)

# calculate index-value
Ptbiserial_value <- ptbiserial_kproto(object = kpres)

# calculate optimal number of cluster
k_opt <- ptbiserial_kproto(data = x, k = 3:5, nstart = 5, verbose = FALSE)

# }

Run the code above in your browser using DataLab