prevalence, in the prior for topical content or both. See an overview of functions in the package here: stm-packagestm(documents, vocab, K,
prevalence, content, data=NULL,
init.type=c("LDA", "DMR","Random"), seed=NULL,
max.em.its=100, emtol=1e-5,
verbose=TRUE, reportevery=5, keepHistory=FALSE,
LDAbeta=TRUE, interactions=TRUE,
gamma.prior=c("Pooled", "L1"), sigma.prior=0,
kappa.prior=c("Jeffreys", "L1"), control=list())prepDocuments forseed.stm saves the seed it uses on every run so that any result can be exactly reproduced. When attempting to reproduce a result with that seed, it should be specified here.FALSE. Note that the model parameters are extremely memory intensive so use with care.TRUE when there are no content covariates. When set to FALSE the model performs SAGE style topic updates (sparse deviations from a baseline).TRUE. This automatically includes interactions between content covariates and the latent topics. Setting it to FALSE reduces to a model with no interactive effects.Pooled options uses Normal prior distributions with a topic-level pooled variance which is given a broad gamma hyperprior. The alternative L1 usesJeffreys and uses a scale mixture of Normals with an improper Jeffreys prior. The option L1 uses glmnet to estimate with a penalty bes. The response portion of the formula should be left blank. See the examples.
The topical convent covariates are those which affect the way in which a topic is discussed. As currently implemented this must be a single variable which defines a discrete partition of the dataset (each document is in one and only one group). We may relax this in the future. While including more covariates in topical prevalence will rarely affect the speed of the model, including additional levels of the content covariates can make the model much slower to converge. This is due to the model operating in the much higher dimensional space of words in dictionary (which tend to be in the thousands) as opposed to topics.
In addition to the default priors for prevalence and content, we also make use of the glmnet package to allow for penalties between the L1 and L2 norm. In these settings we estimate a regularization path and then select the optimal shrinkage parameter using a user-tuneable information criterion. By default selecting the L1 option will apply the L1 penalty selecting the optimal shrinkage parameter using AIC. The defaults have been specifically tuned for the STM but almost all the relevant arguments can be changed through the control structure below. Changing the gamma.enet and kappa.enet parameters allow the user to choose a mix between the L1 and L2 norms. When set to 1 (as by default) this is the lasso penalty, when set to 0 its the ridge penalty. Any value in between is a mixture called the elastic net.
The control argument is a list with named components which can be used to specify numerous additional computational details. Valid components include:
[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]prepDocuments
labelTopics
estimateEffect#An example using the Gadarian data. From Raw text to fitted model.
temp<-textProcessor(documents=gadarian$open.ended.response,metadata=gadarian)
meta<-temp$meta
vocab<-temp$vocab
docs<-temp$documents
out <- prepDocuments(docs, vocab, meta)
docs<-out$documents
vocab<-out$vocab
meta <-out$meta
set.seed(02138)
mod.out <- stm(docs, vocab, 3, prevalence=~treatment + s(pid_rep), data=meta)Run the code above in your browser using DataLab