Learn R Programming

clustering.sc.dp (version 1.0)

findwithinss.sc.dp: Finding Optimal Withinss in Clustering Multidimensional Data with Sequential Constraint by Dynamic Programming

Description

Performs the main step of clustering multidimensional data with sequential constraint by a dynamic programming approach guaranteeing optimality. It returns the minimum withinss for each number of clusters less than or equal to k and backtracking data that can be used to find quickly the optimal clustering for a specific cluster number. This function was created in order to support the case when the number of clusters is unknown in advance.

Usage

findwithinss.sc.dp(x, k)

Arguments

x
a multi-dimensional array containing input data to be clustered
k
the maximal number of clusters, the output will be generated for cluster numbers between 1 and k

Value

A list with components:
twithinss
a vector of total within-cluster sums of the optimal clusterings for each number of clusters less than or equal to k.
backtrack
backtrack data used by backtracking.sc.dp.

Details

Method clustering.sc.dp is split into two methods (findwithinss.sc.dp and backtracking.sc.dp) in order to support the case when the number of clusters is not known in advance. Method findwithinss.sc.dp returns the minimal sum of squares of within-cluster distances (withinss) for each number of clusters less than or equal to k and the backtrack data which can be used to quickly determine the optimal clustering for a specific cluster number. The returned withinss are guaranteed to be optimal among the solutions where only subsequent items form a cluster.

The outputs of the method can be used to select the proper number of clusters. The user may analyse the withinss in order to select the proper number of clusters. Another option is to run findwithinss.sc.dp once, repeat the backtracking.sc.dp step for a range of potential cluster numbers and then the user may evaluate the optimal solutions created for different number of clusters. This requires much less time than repeating the whole clustering algorithm.

See Also

clustering.sc.dp, backtracking.sc.dp

Examples

Run this code
# Example: clustering data generated from a random walk with small withinss
x<-matrix(, nrow = 100, ncol = 2)
x[1,]<-c(0,0)
for(i in 2:100) {
  x[i,1]<-x[i-1,1] + rnorm(1,0,0.1)
  x[i,2]<-x[i-1,2] + rnorm(1,0,0.1)
}
k<-10
r<-findwithinss.sc.dp(x,k)

# select the first cluster number where withinss drops below a threshold
thres <- 5.0
k_th <- 1;
while(r$twithinss[k_th] > thres & k_th < k) {
    k_th <- k_th + 1
}

# backtrack
result<-backtracking.sc.dp(x,k_th, r$backtrack)
plot(x, type = 'b', col = result$cluster)
points(result$centers, pch = 24, bg = (1:k_th))

Run the code above in your browser using DataLab