Implements unsupervised Latent Dirichlet allocation (LDA). Users can run
Seeded LDA by setting gamma > 0
.
textmodel_lda(
x,
k = 10,
max_iter = 2000,
auto_iter = FALSE,
alpha = 0.5,
beta = 0.1,
gamma = 0,
adjust_alpha = 0,
model = NULL,
update_model = FALSE,
batch_size = 1,
verbose = quanteda_options("verbose")
)
Returns a list of model parameters:
the number of topics.
the number of iterations in Gibbs sampling.
the maximum number of iterations in Gibbs sampling.
the use of auto_iter
the value of adjust_alpha
.
the smoothing parameter for theta
.
the smoothing parameter for phi
.
the amount of adjustment for adjust_alpha
.
the gamma parameter for Sequential LDA.
the distribution of words over topics.
the distribution of topics over documents.
the raw frequency count of words assigned to topics.
the original input of x
.
the command used to execute the function.
the version of the seededlda package.
the dfm on which the model will be fit.
the number of topics.
the maximum number of iteration in Gibbs sampling.
if TRUE
, stops Gibbs sampling on convergence before
reaching max_iter
. See details.
the values to smooth topic-document distribution.
the values to smooth topic-word distribution.
a parameter to determine change of topics between sentences or
paragraphs. When gamma > 0
, Gibbs sampling of topics for the current
document is affected by the previous document's topics.
[experimental] if adjust_alpha > 0
, automatically adjust
alpha
by the size of the topics. The smallest value of adjusted alpha
will be alpha * (1 - adjust_alpha)
.
a fitted LDA model; if provided, textmodel_lda()
inherits
parameters from an existing model. See details.
if TRUE
, update the terms of model
to recognize unseen
words.
split the corpus into the smaller batches (specified in
proportion) for distributed computing; it is disabled when a batch include
all the documents batch_size = 1.0
. See details.
logical; if TRUE
print diagnostic information during
fitting.
If auto_iter = TRUE
, the iteration stops even before max_iter
when delta <= 0
. delta
is computed to measure the changes in the number
of words whose topics are updated by the Gibbs sampler in every 100
iteration as shown in the verbose message.
If batch_size < 1.0
, the corpus is partitioned into sub-corpora of
ndoc(x) * batch_size
documents for Gibbs sampling in sub-processes with
synchronization of parameters in every 10 iteration. Parallel processing is
more efficient when batch_size
is small (e.g. 0.01). The algorithm is the
Approximate Distributed LDA proposed by Newman et al. (2009). User can
changed the number of sub-processes used for the parallel computing via
options(seededlda_threads)
.
set.seed()
should be called immediately before textmodel_lda()
or
textmodel_seededlda()
to control random topic assignment. If the random
number seed is the same, the serial algorithm produces identical results;
the parallel algorithm produces non-identical results because it
classifies documents in different orders using multiple processors.
To predict topics of new documents (i.e. out-of-sample), first, create a
new LDA model from a existing LDA model passed to model
in
textmodel_lda()
; second, apply topics()
to the new model. The model
argument takes objects created either by textmodel_lda()
or
textmodel_seededlda()
.
Newman, D., Asuncion, A., Smyth, P., & Welling, M. (2009). Distributed Algorithms for Topic Models. The Journal of Machine Learning Research, 10, 1801–1828.
# \donttest{
require(seededlda)
require(quanteda)
corp <- head(data_corpus_moviereviews, 500)
toks <- tokens(corp, remove_punct = TRUE, remove_symbols = TRUE, remove_number = TRUE)
dfmt <- dfm(toks) %>%
dfm_remove(stopwords("en"), min_nchar = 2) %>%
dfm_trim(max_docfreq = 0.1, docfreq_type = "prop")
lda <- textmodel_lda(dfmt, k = 6, max_iter = 500) # 6 topics
terms(lda)
topics(lda)
# }
Run the code above in your browser using DataLab