sts: Variational EM for the Structural Topic and Sentiment-Discourse (STS) Model

Description

Estimation of the STS Model using variational EM. The function takes sparse representation of a document-term matrix, covariates for each document, and an integer number of topics and returns fitted model parameters. See an overview of functions in the package here: sts-package

Usage

sts(
  prevalence_sentiment,
  initializationVar,
  corpus,
  K,
  maxIter = 100,
  convTol = 1e-05,
  initialization = "anchor",
  kappaEstimation = "adjusted",
  verbose = TRUE,
  parallelize = FALSE,
  stmSeed = NULL
)

Value

An object of class sts

alpha: Estimated prevalence and sentiment-discourse values for each document and topic
gamma: Estimated regression coefficients that determine prevalence and sentiment/discourse for each topic
kappa: Estimated kappa coefficients that determine sentiment-discourse and the topic-word distributions
sigma_inv: Inverse of the covariance matrix for the alpha parameters
sigma: Covariance matrix for the alpha parameters
elbo: the ELBO at each iteration of the estimation algorithm
mv: the baseline log-transformed occurrence rate of each word in the corpus
runtime: Time elapsed in seconds
vocab: Vocabulary vector used
mu: Mean (fitted) values for alpha based on document-level variables * estimated Gamma for each document

Arguments

prevalence_sentiment: A formula object with no response variable or a design matrix with the covariates. The variables must be contained in corpus$meta.
initializationVar: A formula with a single variable for use in the initialization of latent sentiment. This argument is usually the key experimental variable (e.g., review rating binary indicator of experiment/control group).
corpus: The document term matrix to be modeled in a sparse term count matrix with one row per document and one column per term. The object must be a list of with each element corresponding to a document. Each document is represented as an integer matrix with two rows, and columns equal to the number of unique vocabulary words in the document. The first row contains the 1-indexed vocabulary entry and the second row contains the number of times that term appears. This is the same format in the stm package.
K: A positive integer (of size 2 or greater) representing the desired number of topics.
maxIter: A positive integer representing the max number of VEM iterations allowed.
convTol: Convergence tolerance for the variational EM estimation algorithm; Default value = 1e-5.
initialization: Character argument that allows the user to specify an initialization method. The default choice, "anchor" to initialize prevalence according to anchor words and the key experimental covariate identified in argument initializationVar. One can also use "stm", which uses a fitted STM model (Roberts et al. 2014, 2016) to initialize coefficients related to prevalence and sentiment-discourse.
kappaEstimation: A character input specifying how kappa should be estimated. "lasso" allows for penalties on the L1 norm. We estimate a regularization path and then select the optimal shrinkage parameter using AIC. "adjusted" (default) utilizes the lasso penalty with an adjusted aggregated Poisson regression. All options use an approximation framework developed in Taddy (2013) called Distributed Multinomial Regression which utilizes a factorized poisson approximation to the multinomial. See Li and Mankad (2024) on the implementation here.
verbose: A logical flag indicating whether information should be printed to the screen.
parallelize: A logical flag indicating whether to parallelize the estimation using all but one CPU cores on your local machine.
stmSeed: A prefit STM model object to initialize the STS model. Note this is ignored unless initialization = "stm"

Details

This is the main function for estimating the Structural Topic and Sentiment-Discourse (STS) Model. Users provide a corpus of documents and a number of topics. Each word in a document comes from exactly one topic and each document is represented by the proportion of its words that come from each of the topics. The document-specific content covariates affect how much (prevalence) and the way in which a topic is discussed (sentiment-discourse).

References

Roberts, M., Stewart, B., Tingley, D., and Airoldi, E. (2013) "The structural topic model and applied social science." In Advances in Neural Information Processing Systems Workshop on Topic Models: Computation, Application, and Evaluation.

Roberts M., Stewart, B. and Airoldi, E. (2016) "A model of text for experimentation in the social sciences" Journal of the American Statistical Association.

Chen L. and Mankad, S. (2024) "A Structural Topic and Sentiment-Discourse Model for Text Analysis" Management Science.

Examples

Run this code

#An example using the Gadarian data from the stm package.  From Raw text to 
# fitted model using textProcessor() which leverages the tm Package
library("tm"); library("stm"); library("sts")
temp<-textProcessor(documents=gadarian$open.ended.response,
metadata=gadarian, verbose = FALSE)
out <- prepDocuments(temp$documents, temp$vocab, temp$meta, verbose = FALSE)
out$meta$noTreatment <- ifelse(out$meta$treatment == 1, -1, 1)
## low max iteration number just for testing
sts_estimate <- sts(~ treatment*pid_rep, ~ noTreatment, out, K = 3, maxIter = 1, verbose = FALSE)

Run the code above in your browser using DataLab