h2o (version 3.10.3.6)

h2o.kmeans: Performs k-means clustering on an H2O dataset.

Description

Performs k-means clustering on an H2O dataset.

Usage

h2o.kmeans(training_frame, x, model_id = NULL, validation_frame = NULL,
  nfolds = 0, keep_cross_validation_predictions = FALSE,
  keep_cross_validation_fold_assignment = FALSE, fold_assignment = c("AUTO",
  "Random", "Modulo", "Stratified"), fold_column = NULL,
  ignore_const_cols = TRUE, score_each_iteration = FALSE, k = 1,
  estimate_k = FALSE, user_points = NULL, max_iterations = 10,
  standardize = TRUE, seed = -1, init = c("Random", "PlusPlus",
  "Furthest", "User"), max_runtime_secs = 0,
  categorical_encoding = c("AUTO", "Enum", "OneHotInternal", "OneHotExplicit",
  "Binary", "Eigen"))

Arguments

training_frame
Id of the training data frame (Not required, to allow initial validation of model parameters).
x
A vector containing the character names of the predictors in the model.
model_id
Destination id for this model; auto-generated if not specified.
validation_frame
Id of the validation data frame.
nfolds
Number of folds for N-fold cross-validation (0 to disable or >= 2). Defaults to 0.
keep_cross_validation_predictions
Logical. Whether to keep the predictions of the cross-validation models. Defaults to FALSE.
keep_cross_validation_fold_assignment
Logical. Whether to keep the cross-validation fold assignment. Defaults to FALSE.
fold_assignment
Cross-validation fold assignment scheme, if fold_column is not specified. The 'Stratified' option will stratify the folds based on the response variable, for classification problems. Must be one of: "AUTO", "Random", "Modulo", "Stratified". Defaults to AUTO.
fold_column
Column with cross-validation fold index assignment per observation.
ignore_const_cols
Logical. Ignore constant columns. Defaults to TRUE.
score_each_iteration
Logical. Whether to score during each iteration of model training. Defaults to FALSE.
k
The max. number of clusters. If estimate_k is disabled, the model will find k centroids, otherwise it will find up to k centroids. Defaults to 1.
estimate_k
Logical. Whether to estimate the number of clusters (<=k) iteratively and deterministically. Defaults to FALSE.
user_points
User-specified points
max_iterations
Maximum training iterations (if estimate_k is enabled, then this is for each inner Lloyds iteration) Defaults to 10.
standardize
Logical. Standardize columns before computing distances Defaults to TRUE.
seed
Seed for random numbers (affects certain parts of the algo that are stochastic and those might or might not be enabled by default) Defaults to -1 (time-based random number).
init
Initialization mode Must be one of: "Random", "PlusPlus", "Furthest", "User". Defaults to Furthest.
max_runtime_secs
Maximum allowed runtime in seconds for model training. Use 0 to disable. Defaults to 0.
categorical_encoding
Encoding scheme for categorical features Must be one of: "AUTO", "Enum", "OneHotInternal", "OneHotExplicit", "Binary", "Eigen". Defaults to AUTO.

Value

Returns an object of class .

See Also

h2o.cluster_sizes, h2o.totss, h2o.num_iterations, h2o.betweenss, h2o.tot_withinss, h2o.withinss, h2o.centersSTD, h2o.centers

Examples

Run this code
library(h2o)
h2o.init()
prosPath <- system.file("extdata", "prostate.csv", package="h2o")
prostate.hex <- h2o.uploadFile(path = prosPath)
h2o.kmeans(training_frame = prostate.hex, k = 10, x = c("AGE", "RACE", "VOL", "GLEASON"))

Run the code above in your browser using DataLab