Usage
selectModel(documents, vocab, K,
prevalence, content, data=NULL,
max.em.its=100, verbose=TRUE, init.type ="LDA",
emtol= 1e-05, seed=NULL,runs=50, frexw=.7,
net.max.em.its=2, netverbose=FALSE, M=10, N=NULL,
to.disk=F, ...)
Arguments
documents
The documents to be modeled. Object must be a list of with each
element corresponding to a document. Each document is represented
as an integer matrix w
ith two rows, and columns equal to the number of unique vocabulary
words in
vocab
Character vector specifying the words in the corpus in the order of
the vocab indices in documents. Each term in the vocabulary index must
appear at least
once in the documents. See prepDocuments
K
A positive integer (of size 2 or greater) representing the desired
number of topics. Additional detail on choosing the number of topics
in details.
prevalence
A formula object with no response variable or a matrix containing
topic prevalence covariates. Use s()
, ns()
or
bs()
to specify smoo
th terms. See details for more information.
content
A formula containing a single variable, a factor variable or
something which can be coerced to a factor indicating the
category of the content variable fo
r each document.
runs
Total number of STM runs used in the cast net stage. Approximately 15 percent of these runs will be used for running a STM until convergence.
data
Dataset which contains prevalence and content covariates.
init.type
The method of initialization. Must be either Latent Dirichlet
Allocation (LDA), Dirichlet Multinomial Regression Topic Model
(DMR), a random initialization
or a previous STM object.
seed
Seed for the random number generator. stm
saves the seed
it uses on every run so that any result can be exactly
reproduced. Setting the seed here simply ensures that the sequence of
models will be exactly the same when respecified. Indi
max.em.its
The maximum number of EM iterations. If convergence has not
been met at this point, a message will be printed.
emtol
Convergence tolerance. EM stops when the relative change in
the approximate bound drops below this level. Defaults to
.001%.
verbose
A logical flag indicating whether information should be
printed to the screen.
frexw
Weight used to calculate exclusivity
net.max.em.its
Maximum EM iterations used when casting the net
netverbose
Whether verbose should be used when calculating net models.
M
Number of words used to calculate semantic coherence and
exclusivity. Defaults to 10.
N
Total number of models to retain in the end. Defaults to .2 of runs.
to.disk
Boolean. If TRUE, each model is saved to disk at the current directory in a separate RData file.
This is most useful if one needs to run multiSTM()
on a large number of output models.
...
Additional options described in details of stm.