initialize_topic_counts: Initialize topic counts for gibbs sampling

Description

Implementing seeded (or guided) LDA models and transfer learning means that we can't initialize topics with a uniform-random start. This function prepares data and then calls a C++ function, create_lexicon, that runs a single Gibbs iteration to populate topic counts (and other objects) used during the main Gibbs sampling run of fit_lda_c. In the event that you aren't using fancy seeding or transfer learning, this makes a random initialization by sampling from Dirichlet distributions parameterized by priors alpha and eta.

Usage

initialize_topic_counts(
  dtm,
  k,
  alpha,
  eta,
  beta_initial = NULL,
  theta_initial = NULL,
  freeze_topics = FALSE,
  threads = 1,
  ...
)

Value

Returns a list with 5 elements: docs, Zd, Cd, Cv, and Ck. All of these are used by fit_lda_c.

docs is a list with one element per document. Each element is a vector of integers of length sum(dtm[j,]) for the j-th document. The integer entries correspond to the zero-index column of the dtm.

Zd is a list of similar format as docs. The difference is that the integer values correspond to the zero-index for topics.

Cd is a matrix of integers denoting how many times each topic has been sampled in each document.

Cv is similar to Cd but it counts how many times each topic has been sampled for each token.

Ck is an integer vector denoting how many times each topic has been sampled overall.

Arguments

dtm: a document term matrix or term co-occurrence matrix of class dgCMatrix.
k: the number of topics
alpha: the numeric vector prior for topics over documents as formatted by format_alpha
eta: the numeric matrix prior for topics over documents as formatted by format_eta
beta_initial: if specified, a numeric matrix for the probability of tokens in topics. Must be specified for predictions or updates as called by predict.tidylda or refit.tidylda respectively.
theta_initial: if specified, a numeric matrix for the probability of topics in documents. Must be specified for updates as called by refit.tidylda
freeze_topics: if TRUE does not update counts of tokens in topics. This is TRUE for predictions.
threads: number of parallel threads, currently unused
...: Additional arguments, currently unused