h2o4gpu.kmeans: K-means Clustering

Description

K-means Clustering

Usage

h2o4gpu.kmeans(n_clusters = 8L, init = "k-means++", n_init = 1L,
  max_iter = 300L, tol = 1e-04, precompute_distances = "auto",
  verbose = 0L, random_state = NULL, copy_x = TRUE, n_jobs = 1L,
  algorithm = "auto", gpu_id = 0L, n_gpus = -1L, do_checks = 1L,
  backend = "h2o4gpu")

Arguments

n_clusters

The number of clusters to form as well as the number of centroids to generate.

init

Method for initialization, defaults to 'random': 'k-means++' : selects initial cluster centers for k-mean clustering in a smart way to speed up convergence. Not supported yet - if chosen we will use SKLearn's methods. 'random': choose k observations (rows) at random from data for the initial centroids. If an ndarray is passed, it should be of shape (n_clusters, n_features) and gives the initial centers. Not supported yet - if chosen we will use SKLearn's methods.

n_init

Number of time the k-means algorithm will be run with different centroid seeds. The final results will be the best output of n_init consecutive runs in terms of inertia. Not supported yet - always runs 1.

max_iter

Maximum number of iterations of the algorithm.

tol

Relative tolerance to declare convergence.

precompute_distances

Precompute distances (faster but takes more memory). 'auto' : do not precompute distances if n_samples * n_clusters > 12 million. This corresponds to about 100MB overhead per job using double precision. TRUE : always precompute distances FALSE : never precompute distances Not supported yet - always uses auto if running h2o4gpu version.

verbose

Logger verbosity level.

random_state

random_state for RandomState. Must be convertible to 32 bit unsigned integers.

copy_x

When pre-computing distances it is more numerically accurate to center the data first. If copy_x is TRUE, then the original data is not modified. If FALSE, the original data is modified, and put back before the function returns, but small numerical differences may be introduced by subtracting and then adding the data mean. Not supported yet - always uses TRUE if running h2o4gpu version.

n_jobs

The number of jobs to use for the computation. This works by computing each of the n_init runs in parallel. If -1 all CPUs are used. If 1 is given, no parallel computing code is used at all, which is useful for debugging. For n_jobs below -1, (n_cpus + 1 + n_jobs) are used. Thus for n_jobs = -2, all CPUs but one are used. Not supported yet - CPU backend not yet implemented.

algorithm

K-means algorithm to use. The classical EM-style algorithm is "full". The "elkan" variation is more efficient by using the triangle inequality, but currently doesn't support sparse data. "auto" chooses "elkan" for dense data and "full" for sparse data. Not supported yet - always uses full if running h2o4gpu version.

gpu_id

ID of the GPU on which the algorithm should run.

n_gpus

Number of GPUs on which the algorithm should run. < 0 means all possible GPUs on the machine. 0 means no GPUs, run on CPU.

do_checks

If set to 0 GPU error check will not be performed.

backend

Which backend to use. Options are 'auto', 'sklearn', 'h2o4gpu'. Saves as attribute for actual backend used.