stm: Variational EM for the Structural Topic Model

Description

Estimation of the Structural Topic Model using semi-collapsed variational EM. The function takes sparse representation of documents, an integer number of topics, and covariates and returns fitted model parameters. Covariates can be used in the prior for topic prevalence, in the prior for topical content or both. See an overview of functions in the package here: stm-package

Usage

stm(documents, vocab, K, 
    prevalence, content, data=NULL,
    init.type=c("LDA", "DMR","Random"), seed=NULL, 
    max.em.its=100, emtol=1e-5,
    verbose=TRUE, reportevery=5, keepHistory=FALSE,  
    LDAbeta=TRUE, interactions=TRUE,
    gamma.prior=c("Pooled", "L1"), sigma.prior=0,
    kappa.prior=c("Jeffreys", "L1"), control=list())

Arguments

documents

The documents to be modeled. Object must be a list of with each element corresponding to a document. Each document is represented as an integer matrix with two rows, and columns equal to the number of unique vocabulary words in the document. The first

vocab

Character vector specifying the words in the corpus in the order of the vocab indices in documents. Each term in the vocabulary index must appear at least once in the documents. See prepDocuments for

A positive integer (of size 2 or greater) representing the desired number of topics. Additional detail on choosing the number of topics in details.

prevalence

A formula object with no response variable or a matrix containing topic prevalence covariates. Use s, ns or bs

content

A formula containing a single variable, a factor variable or something which can be coerced to a factor indicating the category of the content variable for each document.

data

an optional data frame containing the prevalence and/or content covariates. If unspecified the variables are taken from the active environment.

init.type

The method of initialization. Must be either Latent Dirichlet Allocation (LDA), Dirichlet Multinomial Regression Topic Model (DMR), or a random initialization. If you want to replicate a previous result, see the argument seed.

seed

Seed for the random number generator. stm saves the seed it uses on every run so that any result can be exactly reproduced. When attempting to reproduce a result with that seed, it should be specified here.

max.em.its

The maximum number of EM iterations. If convergence has not been met at this point, a message will be printed.

emtol

Convergence tolerance. EM stops when the relative change in the approximate bound drops below this level. Defaults to .001.

verbose

A logical flag indicating whether information should be printed to the screen. During the E-step (iteration over documents) a dot will print each time 1% of the documents are completed. At the end of each iteration the approximate bound will also be pri

reportevery

An integer determining the intervals at which labels are printed to the screen during fitting. Defaults to every 5 iterations.

keepHistory

Logical indicating whether the history should be saved at each iteration. Defaults to FALSE. Note that the model parameters are extremely memory intensive so use with care.

LDAbeta

a logical that defaults to TRUE when there are no content covariates. When set to FALSE the model performs SAGE style topic updates (sparse deviations from a baseline).

interactions

a logical that defaults to TRUE. This automatically includes interactions between content covariates and the latent topics. Setting it to FALSE reduces to a model with no interactive effects.

gamma.prior

sets the prior estimation method for the prevalence covariate model. The default Pooled options uses Normal prior distributions with a topic-level pooled variance which is given a broad gamma hyperprior. The alternative L1 uses

sigma.prior

a scalar between 0 and 1 which defaults to 0. This sets the strength of regularization towards a diagonalized covariance matrix. Setting the value above 0 can be useful if topics are becoming too highly correlated

kappa.prior

sets the prior estimation for the content covariate coefficients. The default is Jeffreys and uses a scale mixture of Normals with an improper Jeffreys prior. The option L1 uses glmnet to estimate with a penalty be

control

a list of additional parameters control portions of the optimization. See details.

Value

An object of class STM
muThe corpus mean of topic prevalence and coefficients
sigmaCovariance matrix
betaList containing the log of the word probabilities for each topic.
settingsThe settings file. The Seed object will always contain the seed which can be fed as an argument to recover the model.
vocabThe vocabulary vector used.
convergencelist of convergence elements including the value of the approximate bound on the marginal likelihood at each step.
thetaNumber of Documents by Number of Topics matrix of topic proportions.
etaMatrix of means for the variational distribution of the multivariate normal latent variables used to calculate theta.
historyIf keepHistory=TRUE the history of model parameters at each step.

Details

The main function for estimating a Structural Topic Model (STM). STM is an admixture with covariates in both mixture components. Users provide a corpus of documents and a number of topics. Each word in a document comes from exactly one topic and each document is represented by the proportion of its words that come from each of the K topics. These proportions are found in the N (number of documents) by K (user specified number of topics) theta matrix. Each of the K topics are represented as distributions over words. The K-by-V (number of words in the vocabulary) matrix logbeta contains the natural log of the probability of seeing each word conditional on the topic. The most important user input in parametric topic models is the number of topics. There is no right answer to the appropriate number of topics. More topics will give more fine-grained representations of the data at the potential cost of being less precisely estimated. The number must be at least 2 which is equivalent to unidimensional scaling model. For short corpora focused on very specific subject matter (such as survey experiments) 3-5 topics is a useful starting range. For small corpora (a few hundred to a few thousand) 5-20 topics is a good place to start. Beyond these rough guidelines it is application specific. Previous applications in political science with medium sized corpora (10k to 100k documents) have found 50-60 topics to work well. For larger corpora 100 topics is a useful default size. Of course, your mileage may vary. The model for topical prevalence includes covariates which the analyst believes may influence the frequency with which a topic is discussed. This is specified as a formula which can contain smooth terms using splines or by using the function s. The response portion of the formula should be left blank. See the examples. The topical convent covariates are those which affect the way in which a topic is discussed. As currently implemented this must be a single variable which defines a discrete partition of the dataset (each document is in one and only one group). We may relax this in the future. While including more covariates in topical prevalence will rarely affect the speed of the model, including additional levels of the content covariates can make the model much slower to converge. This is due to the model operating in the much higher dimensional space of words in dictionary (which tend to be in the thousands) as opposed to topics. In addition to the default priors for prevalence and content, we also make use of the glmnet package to allow for penalties between the L1 and L2 norm. In these settings we estimate a regularization path and then select the optimal shrinkage parameter using a user-tuneable information criterion. By default selecting the L1 option will apply the L1 penalty selecting the optimal shrinkage parameter using AIC. The defaults have been specifically tuned for the STM but almost all the relevant arguments can be changed through the control structure below. Changing the gamma.enet and kappa.enet parameters allow the user to choose a mix between the L1 and L2 norms. When set to 1 (as by default) this is the lasso penalty, when set to 0 its the ridge penalty. Any value in between is a mixture called the elastic net. The control argument is a list with named components which can be used to specify numerous additional computational details. Valid components include: [object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

References

Roberts, M., Stewart, B., Tingley, D., and Airoldi, E. (2013) "The structural topic model and applied social science." In Advances in Neural Information Processing Systems Workshop on Topic Models: Computation, Application, and Evaluation. http://goo.gl/uHkXAQ Roberts M., Stewart, B. and Airoldi, E. (2014) "Structural Topic Models" Roberts, M., Stewart, B., Tingley, D., Lucas, C., Leder-Luis, J., Gadarian, S., Albertson, B., Albertson, B. and Rand, D. (Forthcoming). "Structural topic models for open ended survey responses." American Journal of Political Science http://goo.gl/0x0tHJ

Examples

Run this code

#An example using the Gadarian data.  From Raw text to fitted model.
temp<-textProcessor(documents=gadarian$open.ended.response,metadata=gadarian)
meta<-temp$meta
vocab<-temp$vocab
docs<-temp$documents
out <- prepDocuments(docs, vocab, meta)
docs<-out$documents
vocab<-out$vocab
meta <-out$meta
set.seed(02138)
mod.out <- stm(docs, vocab, 3, prevalence=~treatment + s(pid_rep), data=meta)

Run the code above in your browser using DataLab