Estimate number of clusters by bootstrapping stability
Usage
k.select_ref(df, k_range = 2:7, n_ref = 5, B = 100, B_ref = 50, r = 5)
Value
profile
vector of ( Smin_diff(k) - ( Smin_diff(k+1) + se(Smin_diff(k+1)) ) ) measures for researchers's inspection
k
estimated number of clusters
Arguments
df
data.frame of the input dataset
k_range
integer valued vector of the numbers of clusters k to be tested upon
n_ref
number of reference distribution to be generated
B
number of bootstrap re-samples
B_ref
number of bootstrap resamples for the reference distributions
r
number of runs of k-means
Author
Tianmou Liu
Details
This function uses the out-of-bag scheme to estimate the number of clusters
in a dataset. The function calculate the Smin of the dataset and at the same time, generate
a reference dataset with the same range as the original dataset in each dimension and calculate
the Smin_ref. The differences between Smin and Smin_ref at each k,Smin_diff(k), is taken into consideration as well as the
standard deviation of the differences. We choose the k to be the argmax of ( Smin_diff(k) - ( Smin_diff(k+1) + (Smin_diff(k+1)) ) ).
If Smin_diff(k) less than 0.1 for all k in k_range, we say k = 1
References
Bootstrapping estimates of stability for clusters, observations and model selection.
Han Yu, Brian Chapman, Arianna DiFlorio, Ellen Eischen, David Gotz, Matthews Jacob and Rachael Hageman Blair.