Fit a Latent Dirichlet Allocation (LDA) model to a Spark DataFrame.
ml_lda(x, features = dplyr::tbl_vars(x), k = length(features),
alpha = (50/k) + 1, beta = 0.1 + 1, ml.options = ml_options(), ...)An object coercable to a Spark DataFrame (typically, a
tbl_spark).
The name of features (terms) to use for the model fit.
The number of topics to estimate.
Concentration parameter for the prior placed on documents' distributions over topics. This is a singleton which is replicated to a vector of length k in fitting (as currently EM optimizer only supports symmetric distributions, so all values in the vector should be the same). For Expectation-Maximization optimizer values should be > 1.0.
By default alpha = (50 / k) + 1, where 50/k is common in LDA libraries and +1 follows from Asuncion et al. (2009), who recommend a +1 adjustment for EM.
Concentration parameter for the prior placed on topics' distributions over terms. For Expectation-Maximization optimizer value should be > 1.0 and by default beta = 0.1 + 1, where 0.1 gives a small amount of smoothing and +1 follows Asuncion et al. (2009), who recommend a +1 adjustment for EM.
Optional arguments, used to affect the model generated. See
ml_options for more details.
Optional arguments. The data argument can be used to
specify the data to be used when x is a formula; this allows calls
of the form ml_linear_regression(y ~ x, data = tbl), and is
especially useful in conjunction with do.
Original LDA paper (journal version): Blei, Ng, and Jordan. "Latent Dirichlet Allocation." JMLR, 2003.
Asuncion et al. (2009)
Other Spark ML routines: ml_als_factorization,
ml_decision_tree,
ml_generalized_linear_regression,
ml_gradient_boosted_trees,
ml_kmeans,
ml_linear_regression,
ml_logistic_regression,
ml_multilayer_perceptron,
ml_naive_bayes,
ml_one_vs_rest, ml_pca,
ml_random_forest,
ml_survival_regression