multiSTM: Analyze Stability of Local STM Mode

Description

This function performs a suite of tests aimed at assessing the global behavior of an STM model, which may have multiple modes. The function takes in a collection of differently initialized STM fitted objects and selects a reference model against which all others are benchmarked for stability. The function returns an output of S3 class 'MultimodDiagnostic', with associated plotting methods for quick inspection of the test results.

Usage

multiSTM(mod.out=NULL, ref.model=NULL, align.global=FALSE, mass.threshold=1, 
    reg.formula=NULL, metadata=NULL, reg.nsims=100, 
    reg.parameter.index=2, verbose=TRUE, from.disk=FALSE)

Arguments

mod.out

The output of a selectModel() run. This is a list of model outputs the user has to choose from, which all take the same form as the output from a STM model. Currently only works with models without content covariates.

ref.model

An integer referencing the element of the list in mod.out which contains the desired reference model. When set to the default value of NULL this chooses the model with the largest value of the approximate variational bound.

align.global

A boolean parameter specifiying how to align the topics of two different STM fitted models. The alignment is performed by solving the linear sum assignment problem using the Hungarian algorithm. If align.global is set to TRUE, th

mass.threshold

A parameter specifying the portion of the probability mass of topics to be used for model analysis. The tail of the probability mass is disregarded accordingly. If mass.threshold is different from 1, both the full-mass and partial-mass analys

reg.formula

A formula for estimating a regression for each model in the ensemble, where the documents are the units, the outcome is the proportion of each document about a topic in an STM model, and the covariates are the document-level metadata. The formula should h

metadata

A dataframe where the predictor variables in reg.formula can be found. It is necessary to unclude this argument if reg.formula is specified.

reg.nsims

The number of simulated draws from the variational posterior for each call of estimateEffect(). Defaults to 100.

reg.parameter.index

If reg.formula is specified, the function analyzes the stability across runs of the regression coefficient for one particular predictor variable. This argument specifies which predictor variable is to be analyzed. A value of 1 corresponds to

verbose

If set to TRUE, the function will report progress.

from.disk

If set to TRUE, multiSTM() will load the input models from disk rather than from RAM. This option is particularly useful for dealing with large numbers of models, and is intended to be used in conjunction with the to.disk

`Value`

An object of 'MultimodDiagnostic' S3 class, consisting of a list with the following components:
NThe number of fitted models in the list of model outputs that was supplied to the function for the purpose of stability analysis.
KThe number of topics in the models.
glob.maxThe index of the reference model in the list of model outputs (mod.out) that was supplied to the function. The reference model is selected as the one with the maximum bound value at convergence.
lbA list of the maximum bound value at convergence for each of the fitted models in the list of model outputs. The list has length N.
lmatA K-by-N matrix reporting the L1-distance of each topic from the corresponding one in the reference model. This is defined as: $$L_{1}=\sum_{v}|\beta_{k,v}^{ref}-\beta_{k,v}^{cand}|$$
Where the beta matrices are the topic-word matrices for the reference and the candidate model.
tmatA K-by-N matrix reporting the number of "top documents" shared by the reference model and the candidate model. The "top documents" for a given topic are defined as the 10 documents in the reference corpus with highest topical frequency.
wmatA K-by-N matrix reporting the number of "top words" shared by the reference model and the candidate model. The "top words" for a given topic are defined as the 10 highest-frequency words.
lmodA vector of length N consisting of the row sums of the lmat matrix.
tmodA vector of length N consisting of the row sums of the tmat matrix.
wmodA vector of length N consisting of the row sums of the wmat matrix.
semcohSemantic coherence values for each topic within each model in the list of model outputs.
L1matA K-by-N matrix reporting the limited-mass L1-distance of each topic from the corresponding one in the reference model. Similar to lmat, but computed using only the top portion of the probability mass for each topic, as specified by the mass.threshol parameter. NULL if mass.treshold==1.
L1modA vector of length N consisting of the row means of the L1mat matrix.
mass.thresholdThe mass threshold argument that was supplied to the function.
cov.effectsA list of length N containing the output of the run of estimateEffect() on each candidate model with the given regression formula. NULL if no regression formula is given.
var.matrixA K-by-N matrix containing the estimated variance for each of the fitted regression parameters. NULL if no regression formula is given.
confidence.ratingsA vector of length N, where each entry specifies the proportion of regression coefficient estimates in a candidate model that fall within the .95 confidence interval for the corresponding estimate in the reference model.
align.globalThe alignment control argument that was supplied to the function.
reg.formulaThe regression formula that was supplied to the function.
reg.nsimsThe reg.nsims argument that was supplied to the function.
reg.parameter.indexThe reg.parameter.index argument that was supplied to the function.

`Details`

The purpose of this function is to automate and generalize the stability analysis routines for topic models that are introduced in Roberts, Margaret E., Brandon M. Stewart, and Dustin Tingley: "Navigating the Local Modes of Big Data: The Case of Topic Models" (2014). For more detailed discussion regarding the background and motivation for multimodality analysis, please refer to the original article. See also the documentation for plot.MultimodDiagnostic for help with the plotting methods associated with this function.

`References`

Roberts, M., Stewart, B., & Tingley, D. (Forthcoming). "Navigating the Local Modes of Big Data: The Case of Topic Models. In Data Analytics in Social Science, Government, and Industry." New York: Cambridge University Press.

`See Also`

plot.MultimodDiagnostic
selectModel
estimateEffect

`Examples`

Run this code# Example using Gadarian data
temp<-textProcessor(documents=gadarian$open.ended.response, 
                    metadata=gadarian)
meta<-temp$meta
vocab<-temp$vocab
docs<-temp$documents
out <- prepDocuments(docs, vocab, meta)
docs<-out$documents
vocab<-out$vocab
meta <-out$meta
set.seed(02138)
mod.out <- selectModel(docs, vocab, K=3, 
                       prevalence=~treatment + s(pid_rep), 
                       data=meta, runs=20)

out <- multiSTM(mod.out, mass.threshold = .75, 
                reg.formula = ~ treatment,
                metadata = gadarian)
plot(out)

# Same example as above, but loading from disk
mod.out <- selectModel(docs, vocab, K=3, 
                       prevalence=~treatment + s(pid_rep), 
                       data=meta, runs=20, to.disk=T)

out <- multiSTM(from.disk=T, mass.threshold = .75, 
                reg.formula = ~ treatment,
                metadata = gadarian)
# One more example using Poliblog data
load(url("http://goo.gl/91KbfS"))
meta <- poliblogPrevFit$settings$covariates$X
out <- multiSTM(poliblogSelect, mass.threshold=.75, 
                reg.formula= ~ ratingLiberal,
                metadata=meta)

plot(out, 1:4)
Run the code above in your browser using DataLab