prevalence, in the prior for topical content or both. See an overview of functions in the package here: stm-package
stm(documents, vocab, K, prevalence, content, data=NULL, init.type=c("LDA", "Random", "Spectral"), seed=NULL, max.em.its=500, emtol=1e-5, verbose=TRUE, reportevery=5, LDAbeta=TRUE, interactions=TRUE, ngroups=1, model=NULL, gamma.prior=c("Pooled", "L1"), sigma.prior=0, kappa.prior=c("L1", "Jeffreys"), control=list())lda package except that (following R convention) the vocabulary is indexed from one. Corpora can be imported using the reader function and manipulated using the prepDocuments. Raw texts can be ingested using textProcessor.
prepDocuments for dropping unused items in the vocabulary.
init.type="Spectral" you can also set K=0 to use the algorithm of Lee and Mimno (2014) to set the number of topics (although unlike the standard spectral initialization this is not deterministic). Additional detail on choosing the number of topics below.
seed.
stm saves the seed it uses on every run so that any result can be exactly reproduced. When attempting to reproduce a result with that seed, it should be specified here.
TRUE when there are no content covariates. When set to FALSE the model performs SAGE style topic updates (sparse deviations from a baseline).
TRUE. This automatically includes interactions between content covariates and the latent topics. Setting it to FALSE reduces to a model with no interactive effects.
stm object to this argument you can restart an existing model. See details for more info.
Pooled options uses Normal prior distributions with a topic-level pooled variance which is given a moderately regularizing half-cauchy(1,1) prior. The alternative L1 uses glmnet to estimate a grouped penalty between L1-L2. See details below.
L1 prior. The second option is Jeffreys which is markedly less computationally efficient but is included for backwards compatability. See details for more information on computation.
The most important user input in parametric topic models is the number of topics. There is no right answer to the appropriate number of topics. More topics will give more fine-grained representations of the data at the potential cost of being less precisely estimated. The number must be at least 2 which is equivalent to a unidimensional scaling model. For short corpora focused on very specific subject matter (such as survey experiments) 3-10 topics is a useful starting range. For small corpora (a few hundred to a few thousand) 5-50 topics is a good place to start. Beyond these rough guidelines it is application specific. Previous applications in political science with medium sized corpora (10k to 100k documents) have found 60-100 topics to work well. For larger corpora 100 topics is a useful default size. Of course, your mileage may vary.
When init.type="Spectral" and K=0 the number of topics is set using the algorithm in Lee and Mimno (2014). See vignette for details. We emphasize here as we do there that this does not estimate the "true" number of topics and does not necessarily have any particular statistical properties for consistently estimating the number of topics. It can however provide a useful starting point.
The model for topical prevalence includes covariates which the analyst believes may influence the frequency with which a topic is discussed. This is specified as a formula which can contain smooth terms using splines or by using the function s. The response portion of the formula should be left blank. See the examples. These variables can include numeric and factor variables. While including variables of class Dates or other non-numeric, non-factor types will work in stm it may not always work for downstream functions such as estimateEffect.
The topical convent covariates are those which affect the way in which a topic is discussed. As currently implemented this must be a single variable which defines a discrete partition of the dataset (each document is in one and only one group). We may relax this in the future. While including more covariates in topical prevalence will rarely affect the speed of the model, including additional levels of the content covariates can make the model much slower to converge. This is due to the model operating in the much higher dimensional space of words in dictionary (which tend to be in the thousands) as opposed to topics.
In addition to the default priors for prevalence, we also make use of the glmnet package to allow for penalties between the L1 and L2 norm. In these settings we estimate a regularization path and then select the optimal shrinkage parameter using a user-tuneable information criterion. By default selecting the L1 option will apply the L1 penalty selecting the optimal shrinkage parameter using AIC. The defaults have been specifically tuned for the STM but almost all the relevant arguments can be changed through the control structure below. Changing the gamma.enet parameters allow the user to choose a mix between the L1 and L2 norms. When set to 1 (as by default) this is the lasso penalty, when set to 0 its the ridge penalty. Any value in between is a mixture called the elastic net.
The default prior choice for content covariates is now the L1 option. This uses an approximation framework developed in Taddy (2013) called Distributed Multinomial Regression which utilizes a factorized poisson approximation to the multinomial. See Roberts, Stewart and Airoldi (2014) for details on the implementation here. This is dramatically faster than previous versions. The old default setting which uses a Jeffreys prior is also available.
The argument init.type allows the user to specify an intialization method. The default uses collapsed gibbs sampling for the LDA model. The choice "Spectral" provides a deterministic inialization using the spectral algorithm given in Arora et al 2013. See Roberts, Stewart and Tingley (2014) for details and a comparison of different approaches. Particularly when the number of documents is relatively large we highly recommend the Spectral algorithm which often performs extremely well. Note that the random seed plays no role in the spectral initialization as it is completely deterministic (unless using the K=0 or random projection settings).
Specifying an integer greater than 1 for the argument ngroups causes the corpus to be broken into the specified number of groups. Global updates are then computed after each group in turn. This approach, called memoized variational inference in Hughes and Sudderth (2013), can lead to more rapid convergence when the number of documents is large. Note that the memory requirements scale linearly with the number of groups so this provides a tradeoff between memory efficiency and computational power.
Models can now be restarted by passing an STM object to the argument model. This is particularly useful if you run a model to the maximum iterations and it terminates without converging. Note that all the standard arguments still need to be passed to the object (including any formulas, the number of topics, etc.). Be sure to change the max.em.its argument or it will simply complete one additional iteration and stop.
The control argument is a list with named components which can be used to specify numerous additional computational details. Valid components include:
tau.maxitJeffreys, estimation proceeds by iterating between the kappa vector corresponding to a particular topic and the associated variance tau before moving on to the next parameter vector. this controls the maximum number of iterations. It defaults to NULL effectively enforcing convergence. When the mode is L1 this sets the maximum number of passes in the coordinate descent algorithm and defaults to 1e8.
tau.tolJeffreys this sets the convergence tolerance in the iteration between the kappa vector and variances tau and defaults to 1e-5. With L1 it defaults to 1e-6.
kappa.mstepmaxitJeffreys this controls the maximum number of passes through the sequence of kappa vectors. It defaults to 3. It has no role under L1- see tau.maxit option instead.
kappa.msteptolJeffreys this controls the tolerance for convergence (measured by the L1 norm) for the entire M-step. It is set to .01 by default. This has no role under mode L1- see tau.tol option instead.
fixedintercept
kappa.enetalpha in glmnet. Value must be between 1 and 0 where 1 is the lasso penalty (the default) and 0 is the ridge penalty. The closer the parameter is to zero the less sparse the solution will tend to be.
gamma.enet
nlambda
lambda.min.ratio
ic.kic.k. When set to 2 (as by default) this results in AIC. When set to log(n) (where n is the total number of words in the corpus) this is equivalent to BIC. Larger numbers will express a preference for sparser (simpler) models.
nits
burnin
alpha
eta
contrast
rp.s
rp.p
rp.d.group.size
SpectralRPTRUE turns on the experimental random projections spectral initialization.
Roberts M., Stewart, B. and Airoldi, E. (2015) "A model of text for experimentation in the social sciences" Roberts, M., Stewart, B., Tingley, D., Lucas, C., Leder-Luis, J., Gadarian, S., Albertson, B., et al. (2014). Structural topic models for open ended survey responses. American Journal of Political Science, 58(4), 1064-1082. http://goo.gl/0x0tHJ Roberts, M., Stewart, B., & Tingley, D. (Forthcoming). "Navigating the Local Modes of Big Data: The Case of Topic Models. In Data Analytics in Social Science, Government, and Industry." New York: Cambridge University Press.
prepDocuments
labelTopics
estimateEffect
## Not run:
# #An example using the Gadarian data. From Raw text to fitted model.
# temp<-textProcessor(documents=gadarian$open.ended.response,metadata=gadarian)
# meta<-temp$meta
# vocab<-temp$vocab
# docs<-temp$documents
# out <- prepDocuments(docs, vocab, meta)
# docs<-out$documents
# vocab<-out$vocab
# meta <-out$meta
# set.seed(02138)
# mod.out <- stm(docs, vocab, 3, prevalence=~treatment + s(pid_rep), data=meta)
#
# #An example of restarting a model
# mod.out <- stm(docs, vocab, 3, prevalence=~treatment + s(pid_rep),
# data=meta, max.em.its=5)
# mod.out2 <- stm(docs, vocab, 3, prevalence=~treatment + s(pid_rep),
# data=meta, model=mod.out, max.em.its=10)
# ## End(Not run)
Run the code above in your browser using DataLab