refit allows you to (a) update the probabilities (i.e. weights) of
a previously-fit model with new data or additional iterations and (b) optionally
use beta of a previously-fit LDA topic model as the eta prior
for the new model. This is tuned by setting beta_as_prior = FALSE or
beta_as_prior = TRUE respectively.
prior_weight tunes how strong the base model is represented in the prior.
If prior_weight = 1, then the tokens from the base model's training data
have the same relative weight as tokens in new_data. In other words,
it is like just adding training data. If prior_weight is less than 1,
then tokens in new_data are given more weight. If prior_weight
is greater than 1, then the tokens from the base model's training data are
given more weight.
If prior_weight is NA, then the new eta is equal to
eta from the old model, with new tokens folded in.
(For handling of new tokens, see below.) Effectively, this just controls
how the sampler initializes (described below), but does not give prior
weight to the base model.
Instead of initializing token-topic assignments in the manner for new
models (see tidylda), the update initializes in 2
steps:
First, topic-document probabilities (i.e. theta) are obtained by a
call to predict.tidylda using method = "dot"
for the documents in new_data. Next, both beta and theta are
passed to an internal function, initialize_topic_counts,
which assigns topics to tokens in a manner approximately proportional to
the posteriors and executes a single Gibbs iteration.
refit handles the addition of new vocabulary by adding a flat prior
over new tokens. Specifically, each entry in the new prior is equal to the
10th percentile of eta from the old model. The resulting model will
have the total vocabulary of the old model plus any new vocabulary tokens.
In other words, after running refit.tidylda ncol(beta) >= ncol(new_data)
where beta is from the new model and new_data is the additional data.
You can add additional topics by setting the additional_k parameter
to an integer greater than zero. New entries to alpha have a flat
prior equal to the median value of alpha in the old model. (Note that
if alpha itself is a flat prior, i.e. scalar, then the new topics have
the same value for their prior.) New entries to eta have a shape
from the average of all previous topics in eta and scaled by
additional_eta_sum.