h2o.gapStatistic: Compute Gap Statistic from H2O Dataset

Description

Compute the gap statistic of a H2O dataset. The gap statistic is a measure of the goodness of fit of a clustering algorithm. For each number of clusters k, it compares $\log(W(k))$ with $E^*[\log(W(k))]$ where the latter is defined via bootstrapping.

Usage

h2o.gapStatistic(data, cols = "", K = 10,
                  B = 10, boot_frac = 0.1, max_iter = 50, seed = 0)

Arguments

data

An H2OParsedData object.

cols

(Optional) A vector of column names or indices indicating the features to analyze. By default, all columns in the dataset are analyzed.

The maximum number of clusters to consider. Must be at least 2.

A positive integer indicating the number of Monte Carlo (bootstrap) samples for simulating the reference distribution.

boot_frac

Fraction of data size to replicate in each Monte Carlo simulation.

max_iter

Number of iterations before stopping in KMeans.

seed

(Optional) Random number seed for breaking ties between equal probabilities.

Value

A list containing the following components:
log_within_ssLog of the pooled cluster within sum of squares per value of k.
boot_within_ssMonte Carlo bootstrap replicate averages of log_within_ss per value of k.
se_boot_within_ssStandard error from the Monte Carlo simulated data for each iteration.
gap_statsGap statistics per value of k.
k_optOptimal number of clusters.

References

Tibshirani, R., Walther, G. and Hastie, T. (2001). Estimating the number of data clusters via the Gap statistic. Journal of the Royal Statistical Society B, 63, 411-423.

Tibshirani, R., Walther, G. and Hastie, T. (2000). Estimating the number of clusters in a dataset via the Gap statistic. Technical Report. Stanford.

Examples

Run this code

# Currently still in beta, so don't automatically run example
  \dontrun{
    library(h2o)
    localH2O = h2o.init()
    iris.hex <- as.h2o(localH2O, iris)
    gs <- h2o.gapStatistic(iris.hex, K = 10, B = 10)
    gs   # default show displays number of KMeans run and the optimal k
    summary(gs)  # gives all model information computed
    plot(gs)  # shows various plots
  }