semanticCoherence: Semantic Coherence

Description

Calculate semantic coherence (Mimno et al 2011) for an STM model.

Usage

semanticCoherence(model, documents, M = 10)

Arguments

model

the STM object

documents

the STM formatted documents (see stm for format).

the number of top words to consider per topic

Value

a numeric vector containing semantic coherence for each topic

Details

Semantic coherence is a metric related to pointwise mutual information that was introduced in a paper by David Mimno, Hanna Wallach and colleagues (see references), The paper details a series of manual evaluations which show that their metric is a reasonable surrogate for human judgment. The core idea here is that in models which are semantically coherent the words which are most probable under a topic should co-occur within the same document.

One of our observations in Roberts et al 2014 was that semantic coherence alone is relatively easy to achieve by having only a couple of topics which all are dominated by the most common words. Thus we suggest that users should also consider exclusivity which provides a natural counterpoint.

This function is currently marked with the keyword internal because it does not have much error checking.

References

Mimno, D., Wallach, H. M., Talley, E., Leenders, M., & McCallum, A. (2011, July). "Optimizing semantic coherence in topic models." In Proceedings of the Conference on Empirical Methods in Natural Language Processing (pp. 262-272). Association for Computational Linguistics. Chicago

Roberts, M., Stewart, B., Tingley, D., Lucas, C., Leder-Luis, J., Gadarian, S., Albertson, B., et al. (2014). "Structural topic models for open ended survey responses." American Journal of Political Science, 58(4), 1064-1082. http://goo.gl/0x0tHJ

Examples

Run this code

# NOT RUN {
temp<-textProcessor(documents=gadarian$open.ended.response,metadata=gadarian)
meta<-temp$meta
vocab<-temp$vocab
docs<-temp$documents
out <- prepDocuments(docs, vocab, meta)
docs<-out$documents
vocab<-out$vocab
meta <-out$meta
set.seed(02138)
#maximum EM iterations set very low so example will run quickly.
#Run your models to convergence!
mod.out <- stm(docs, vocab, 3, prevalence=~treatment + s(pid_rep), data=meta,
               max.em.its=5)
semanticCoherence(mod.out, docs)
# }

Run the code above in your browser using DataLab