sparklyr (version 0.5.6)

ml_kmeans: Spark ML -- K-Means Clustering

Description

Perform k-means clustering on a Spark DataFrame.

Usage

ml_kmeans(x, centers, iter.max = 100, features = dplyr::tbl_vars(x),
  compute.cost = TRUE, tolerance = 1e-04, ml.options = ml_options(), ...)

Arguments

x

An object coercable to a Spark DataFrame (typically, a tbl_spark).

centers

The number of cluster centers to compute.

iter.max

The maximum number of iterations to use.

features

The name of features (terms) to use for the model fit.

compute.cost

Whether to compute cost for k-means model using Spark's computeCost.

tolerance

Param for the convergence tolerance for iterative algorithms.

ml.options

Optional arguments, used to affect the model generated. See ml_options for more details.

...

Optional arguments. The data argument can be used to specify the data to be used when x is a formula; this allows calls of the form ml_linear_regression(y ~ x, data = tbl), and is especially useful in conjunction with do.

Value

ml_model object of class kmeans with overloaded print, fitted and predict functions.

References

Bahmani et al., Scalable K-Means++, VLDB 2012

See Also

For information on how Spark k-means clustering is implemented, please see http://spark.apache.org/docs/latest/mllib-clustering.html#k-means.

Other Spark ML routines: ml_als_factorization, ml_decision_tree, ml_generalized_linear_regression, ml_gradient_boosted_trees, ml_lda, ml_linear_regression, ml_logistic_regression, ml_multilayer_perceptron, ml_naive_bayes, ml_one_vs_rest, ml_pca, ml_random_forest, ml_survival_regression