Implementing seeded (or guided) LDA models and transfer learning means that
we can't initialize topics with a uniform-random start. This function prepares
data and then calls a C++ function, create_lexicon, that runs a single
Gibbs iteration to populate topic counts (and other objects) used during the
main Gibbs sampling run of fit_lda_c. In the event that
you aren't using fancy seeding or transfer learning, this makes a random
initialization by sampling from Dirichlet distributions parameterized by
priors alpha and eta.
initialize_topic_counts(
dtm,
k,
alpha,
eta,
beta_initial = NULL,
theta_initial = NULL,
freeze_topics = FALSE,
threads = 1,
...
)Returns a list with 5 elements: docs, Zd, Cd, Cv,
and Ck. All of these are used by fit_lda_c.
docs is a list with one element per document. Each element is a vector
of integers of length sum(dtm[j,]) for the j-th document. The integer
entries correspond to the zero-index column of the dtm.
Zd is a list of similar format as docs. The difference is that
the integer values correspond to the zero-index for topics.
Cd is a matrix of integers denoting how many times each topic has
been sampled in each document.
Cv is similar to Cd but it counts how many times each topic
has been sampled for each token.
Ck is an integer vector denoting how many times each topic has been
sampled overall.
a document term matrix or term co-occurrence matrix of class dgCMatrix.
the number of topics
the numeric vector prior for topics over documents as formatted
by format_alpha
the numeric matrix prior for topics over documents as formatted
by format_eta
if specified, a numeric matrix for the probability of tokens
in topics. Must be specified for predictions or updates as called by
predict.tidylda or refit.tidylda
respectively.
if specified, a numeric matrix for the probability of
topics in documents. Must be specified for updates as called by
refit.tidylda
if TRUE does not update counts of tokens in topics.
This is TRUE for predictions.
number of parallel threads, currently unused
Additional arguments, currently unused