ml_lda: Spark ML -- Latent Dirichlet Allocation

Description

Fit a Latent Dirichlet Allocation (LDA) model to a Spark DataFrame.

Usage

ml_lda(x, features = tbl_vars(x), k = length(features), alpha = (50/k) +
  1, beta = 0.1 + 1, ml.options = ml_options(), ...)

Arguments

An object coercable to a Spark DataFrame (typically, a tbl_spark).

features

The name of features (terms) to use for the model fit.

The number of topics to estimate.

alpha

Concentration parameter for the prior placed on documents' distributions over topics. This is a singleton which is replicated to a vector of length k in fitting (as currently EM optimizer only supports symmetric distributions, so all values in the vector should be the same). For Expectation-Maximization optimizer values should be > 1.0. By default alpha = (50 / k) + 1, where 50/k is common in LDA libraries and +1 follows from Asuncion et al. (2009), who recommend a +1 adjustment for EM.

beta

Concentration parameter for the prior placed on topics' distributions over terms. For Expectation-Maximization optimizer value should be > 1.0 and by default beta = 0.1 + 1, where 0.1 gives a small amount of smoothing and +1 follows Asuncion et al. (2009), who recommend a +1 adjustment for EM.

ml.options

Optional arguments, used to affect the model generated. See ml_options for more details.

...

Optional arguments. The data argument can be used to specify the data to be used when x is a formula; this allows calls of the form ml_linear_regression(y ~ x, data = tbl), and is especially useful in conjunction with do.

References

Original LDA paper (journal version): Blei, Ng, and Jordan. "Latent Dirichlet Allocation." JMLR, 2003.

Asuncion et al. (2009)

Examples

Run this code

# NOT RUN {
library(janeaustenr)
library(sparklyr)
library(dplyr)

sc <- spark_connect(master = "local")

austen_books <- austen_books()
books_tbl <- sdf_copy_to(sc, austen_books, overwrite = TRUE)
first_tbl <- books_tbl %>% filter(nchar(text) > 0) %>% head(100)

first_tbl %>%
  ft_tokenizer("text", "tokens") %>%
  ft_count_vectorizer("tokens", "features") %>%
  ml_lda("features", k = 4)
# }

Run the code above in your browser using DataLab

Description

Usage

Arguments

References

See Also

Examples