One run of the Gibbs sampler and other magic to initialize some objects.
Works in concert with initialize_topic_counts.
create_lexicon(Cd_in, Beta_in, dtm_in, alpha, freeze_topics)Returns a list with five entries.
Docs is a list of vectors. Each element is a document, and the contents
are indices for tokens. Used as an iterator for the Gibbs sampler.
Zd is a list of vectors, similar to Docs. However, its contents are topic
assignments of each document/token pair. Used as an iterator for Gibbs
sampling.
Cd is a matrix counting the number of times each topic is sampled per
document.
Cv is a matrix counting the number of times each topic is sampled per token.
Ck is a vector counting the total number of times each topic is sampled overall.
Cd, Cv, and Ck are derivatives of Zd.
IntegerMatrix denoting counts of topics in documents
NumericMatrix denoting probability of words in topics
arma::sp_mat document term matrix
NumericVector prior for topics over documents
bool if making predictions, set to TRUE
Arguments ending in _in are copied and their copies modified in
some way by this function. In the case of Cd_in and Beta_in,
the only modification is that they are converted from matrices to nested
std::vector for speed, reliability, and thread safety. dtm_in
is transposed for speed when looping over columns.