manyTopics: Performs model selection across separate STM's that each assume different numbers of topics.

Description

Works the same as selectModel, except user specifies a range of numbers of topics that they want the model fitted for. For example, models with 5, 10, and 15 topics. Then, for each number of topics, selectModel is run multiple times. The output is then processed through a function that takes a pareto dominant run of the model in terms of exclusivity and semantic coherence. If multiple runs are candidates (i.e., none weakly dominates the others), a single model run is randomly chosen from the set of undominated runs.

Usage

manyTopics(documents, vocab, K,
                        prevalence, content, data=NULL,
                        max.em.its=100, verbose=TRUE, init.type =
                        "LDA",
                        emtol= 1e-05, seed=NULL,runs=50, frexw=.7,
                        net.max.em.its=2, netverbose=FALSE, M=10,...)

Arguments

documents

The documents to be modeled. Object must be a list of with each element corresponding to a document. Each document is represented as an integer matrix w ith two rows, and columns equal to the number of unique vocabulary words in

vocab

Character vector specifying the words in the corpus in the order of the vocab indices in documents. Each term in the vocabulary index must appear at least once in the documents. See prepDocuments

A vector of positive integers representing the desired number of topics for separate runs of selectModel.

prevalence

A formula object with no response variable or a matrix containing topic prevalence covariates. Use s(), ns() or bs() to specify smoo th terms. See details for more information.

content

A formula containing a single variable, a factor variable or something which can be coerced to a factor indicating the category of the content variable fo r each document.

runs

Total number of STM runs used in the cast net stage. Approximately 15 percent of these runs will be used for running a STM until convergence.

data

Dataset which contains prevalence and content covariates.

init.type

The method of initialization. See stm.

seed

Seed for the random number generator. stm saves the seed it uses on every run so that any result can be exactly reproduced. When attempting to reproduce a result with that seed, it should be specified here.

max.em.its

The maximum number of EM iterations. If convergence has not been met at this point, a message will be printed.

emtol

Convergence tolerance.

verbose

A logical flag indicating whether information should be printed to the screen.

frexw

Weight used to calculate exclusivity

net.max.em.its

Maximum EM iterations used when casting the net

netverbose

Whether verbose should be used when calculating net models.

Number of words used to calculate semantic coherence and exclusivity. Defaults to 10.

...

Additional options described in details of stm.

Value

outList of model outputs the user has to choose from. Take the same form as the output from a stm model.
semcohSemantic coherence values for each topic within each model selected for each number of topics.
exclusivityExclusivity values for each topic wihtin each model selected. Only calculated for models without a content covariate.

Details

Does not work with models that have a content variable (at this point).

Examples

Run this code

temp<-textProcessor(documents=gadarian$open.ended.response,metadata=gadarian)
meta<-temp$meta
vocab<-temp$vocab
docs<-temp$documents
out <- prepDocuments(docs, vocab, meta)
docs<-out$documents
vocab<-out$vocab
meta <-out$meta

set.seed(02138)
storage<-manyTopics(docs,vocab,K=3:4, prevalence=~treatment + s(pid_rep),data=meta, runs=10)
#This chooses the output, a single run of STM that was selected,
#from the runs of the 3 topic model
t<-storage$out[[1]]
#This chooses the output, a single run of STM that was selected,
#from the runs of the 4 topic model
t<-storage$out[[2]]
#Please note that the way to extract a result for manyTopics is different from selectModel.

Run the code above in your browser using DataLab