ml_kmeans
Spark ML -- K-Means Clustering
Perform k-means clustering on a Spark DataFrame.
Usage
ml_kmeans(x, centers, iter.max = 100, features = tbl_vars(x),
compute.cost = TRUE, tolerance = 1e-04, ml.options = ml_options(), ...)
Arguments
- x
An object coercable to a Spark DataFrame (typically, a
tbl_spark
).- centers
The number of cluster centers to compute.
- iter.max
The maximum number of iterations to use.
- features
The name of features (terms) to use for the model fit.
- compute.cost
Whether to compute cost for
k-means
model using Spark's computeCost.- tolerance
Param for the convergence tolerance for iterative algorithms.
- ml.options
Optional arguments, used to affect the model generated. See
ml_options
for more details.- ...
Optional arguments. The
data
argument can be used to specify the data to be used whenx
is a formula; this allows calls of the formml_linear_regression(y ~ x, data = tbl)
, and is especially useful in conjunction withdo
.
Value
ml_model object of class kmeans
with overloaded print
, fitted
and predict
functions.
References
Bahmani et al., Scalable K-Means++, VLDB 2012
See Also
For information on how Spark k-means clustering is implemented, please see http://spark.apache.org/docs/latest/mllib-clustering.html#k-means.
Other Spark ML routines: ml_als_factorization
,
ml_decision_tree
,
ml_generalized_linear_regression
,
ml_gradient_boosted_trees
,
ml_lda
, ml_linear_regression
,
ml_logistic_regression
,
ml_multilayer_perceptron
,
ml_naive_bayes
,
ml_one_vs_rest
, ml_pca
,
ml_random_forest
,
ml_survival_regression