Learn R Programming

h2o (version 2.4.3.11)

h2o.gapStatistic: Compute Gap Statistic from H2O Dataset

Description

Compute the gap statistic of a H2O dataset. The gap statistic is a measure of the goodness of fit of a clustering algorithm. For each number of clusters k, it compares $\log(W(k))$ with $E^*[\log(W(k))]$ where the latter is defined via bootstrapping.

Usage

h2o.gapStatistic(data, cols = "", K.max = 10, B = 100, boot_frac = 0.33, seed = 0)

Arguments

data
An H2OParsedData object.
cols
(Optional) A vector of column names or indices indicating the features to analyze. By default, all columns in the dataset are analyzed.
K.max
The maximum number of clusters to consider. Must be at least 2.
B
A positive integer indicating the number of Monte Carlo (bootstrap) samples for simulating the reference distribution.
boot_frac
Fraction of data size to replicate in each Monte Carlo simulation.
seed
(Optional) Random number seed for breaking ties between equal probabilities.

Value

  • A list containing the following components:
  • log_within_ssLog of the pooled cluster within sum of squares per value of k.
  • boot_within_ssMonte Carlo bootstrap replicate averages of log_within_ss per value of k.
  • se_boot_within_ssStandard error from the Monte Carlo simulated data for each iteration.
  • gap_statsGap statistics per value of k.
  • k_optOptimal number of clusters.

Details

IMPORTANT: Currently, you must initialize H2O with the flag beta = TRUE in h2o.init in order to use this method!

References

Tibshirani, R., Walther, G. and Hastie, T. (2001). Estimating the number of data clusters via the Gap statistic. Journal of the Royal Statistical Society B, 63, 411-423.

Tibshirani, R., Walther, G. and Hastie, T. (2000). Estimating the number of clusters in a dataset via the Gap statistic. Technical Report. Stanford.

See Also

H2OParsedData, h2o.kmeans

Examples

Run this code
# Currently still in beta, so don't automatically run example
library(h2o)
localH2O = h2o.init(ip = "localhost", port = 54321, startH2O = TRUE, beta = TRUE)
irisPath = system.file("extdata", "iris.csv", package = "h2o")
iris.hex = h2o.importFile(localH2O, path = irisPath)
h2o.gapStatistic(iris.hex, K.max = 10, B = 100)

Run the code above in your browser using DataLab