
Tools for making and evaluating heldout datasets.
make.heldout(documents, vocab, N = floor(0.1 * length(documents)),
proportion = 0.5, seed = NULL)
the documents to be modeled (see stm
for format).
the vocabulary item
number of docs to be partially held out
proportion of docs to be held out.
the seed, set for replicability
These functions are used to create and evaluate heldout likelihood using the document completion method. The basic idea is to hold out some fraction of the words in a set of documents, train the model and use the document-level latent variables to evaluate the probability of the heldout portion. See the example for the basic workflow.
# NOT RUN {
prep <- prepDocuments(poliblog5k.docs, poliblog5k.voc,
poliblog5k.meta,subsample=500,
lower.thresh=20,upper.thresh=200)
heldout <- make.heldout(prep$documents, prep$vocab)
documents <- heldout$documents
vocab <- heldout$vocab
meta <- prep$meta
stm1<- stm(documents, vocab, 5,
prevalence =~ rating+ s(day),
init.type="Random",
data=meta, max.em.its=5)
eval.heldout(stm1, heldout$missing)
# }
Run the code above in your browser using DataLab