Estimation of the STS Model using variational EM.
The function takes sparse representation of a document-term matrix, covariates
for each document, and an integer number of topics and returns fitted model
parameters. See an overview of functions in the package here:
sts-package
sts(
prevalence_sentiment,
initializationVar,
corpus,
K,
maxIter = 100,
convTol = 1e-05,
initialization = "anchor",
kappaEstimation = "adjusted",
verbose = TRUE,
parallelize = FALSE,
stmSeed = NULL
)
An object of class sts
Estimated prevalence and sentiment-discourse values for each document and topic
Estimated regression coefficients that determine prevalence and sentiment/discourse for each topic
Estimated kappa coefficients that determine sentiment-discourse and the topic-word distributions
Inverse of the covariance matrix for the alpha parameters
Covariance matrix for the alpha parameters
the ELBO at each iteration of the estimation algorithm
the baseline log-transformed occurrence rate of each word in the corpus
Time elapsed in seconds
Vocabulary vector used
Mean (fitted) values for alpha based on document-level variables * estimated Gamma for each document
A formula object with no response variable or a design matrix with the covariates. The variables must be contained in corpus$meta.
A formula with a single variable for use in the initialization of latent sentiment. This argument is usually the key experimental variable (e.g., review rating binary indicator of experiment/control group).
The document term matrix to be modeled in a sparse term count matrix with one row
per document and one column per term. The object must be a list of with each element
corresponding to a document. Each document is represented
as an integer matrix with two rows, and columns equal to the number of unique
vocabulary words in the document. The first row contains the 1-indexed
vocabulary entry and the second row contains the number of times that term
appears. This is the same format in the stm
package.
A positive integer (of size 2 or greater) representing the desired number of topics.
A positive integer representing the max number of VEM iterations allowed.
Convergence tolerance for the variational EM estimation algorithm; Default value = 1e-5.
Character argument that allows the user to specify an initialization
method. The default choice, "anchor"
to initialize prevalence according to anchor words and
the key experimental covariate identified in argument initializationVar
. One can also use
"stm"
, which uses a fitted STM model (Roberts et al. 2014, 2016)
to initialize coefficients related to prevalence and sentiment-discourse.
A character input specifying how kappa should be estimated. "lasso"
allows for
penalties on the L1 norm. We estimate a regularization path and then select the optimal
shrinkage parameter using AIC. "adjusted"
(default) utilizes the lasso penalty with an adjusted aggregated Poisson regression.
All options use an approximation framework developed in Taddy (2013) called
Distributed Multinomial Regression which utilizes a factorized poisson
approximation to the multinomial. See Li and Mankad (2024) on the implementation here.
A logical flag indicating whether information should be printed to the screen.
A logical flag indicating whether to parallelize the estimation using all but one CPU cores on your local machine.
A prefit STM model object to initialize the STS model. Note this is ignored unless initialization = "stm"
This is the main function for estimating the Structural Topic and Sentiment-Discourse (STS) Model. Users provide a corpus of documents and a number of topics. Each word in a document comes from exactly one topic and each document is represented by the proportion of its words that come from each of the topics. The document-specific content covariates affect how much (prevalence) and the way in which a topic is discussed (sentiment-discourse).
Roberts, M., Stewart, B., Tingley, D., and Airoldi, E. (2013) "The structural topic model and applied social science." In Advances in Neural Information Processing Systems Workshop on Topic Models: Computation, Application, and Evaluation.
Roberts M., Stewart, B. and Airoldi, E. (2016) "A model of text for experimentation in the social sciences" Journal of the American Statistical Association.
Chen L. and Mankad, S. (2024) "A Structural Topic and Sentiment-Discourse Model for Text Analysis" Management Science.
estimateRegns
#An example using the Gadarian data from the stm package. From Raw text to
# fitted model using textProcessor() which leverages the tm Package
library("tm"); library("stm"); library("sts")
temp<-textProcessor(documents=gadarian$open.ended.response,
metadata=gadarian, verbose = FALSE)
out <- prepDocuments(temp$documents, temp$vocab, temp$meta, verbose = FALSE)
out$meta$noTreatment <- ifelse(out$meta$treatment == 1, -1, 1)
## low max iteration number just for testing
sts_estimate <- sts(~ treatment*pid_rep, ~ noTreatment, out, K = 3, maxIter = 1, verbose = FALSE)
Run the code above in your browser using DataLab