topWords: Extract the most representative words from topics

Description

Extract the top words in each topic/sentiment from a sentopicmodel.

Usage

topWords(
  x,
  nWords = 10,
  method = c("frequency", "probability", "term-score", "FREX"),
  output = c("data.frame", "plot", "matrix"),
  subset,
  w = 0.5
)
plot_topWords(
  x,
  nWords = 10,
  method = c("frequency", "probability", "term-score", "FREX"),
  subset,
  w = 0.5
)

Value

The top words of the topic model. Depending on the output chosen, can result in either a long-style data.frame, a ggplot2 object or a matrix.

Arguments

x: a sentopicmodel created from the LDA(), JST() or rJST()
nWords: the number of top words to extract
method: specify if a re-ranking function should be applied before returning the top words. See Details for a description of each method.
output: determines the output of the function
subset: allows to subset using a logical expression, as in subset(). Particularly useful to limit the number of observation on plot outputs. The logical expression uses topic and sentiment indices rather than their label. It is possible to subset on both topic and sentiment but adding a & operator between two expressions.
w: only used when method = "FREX". Determines the weight assigned to the exclusivity score at the expense of the frequency score.

Author

Olivier Delmarcelle

Details

"frequency" ranks top words according to their frequency within a topic. This method also reports the overall frequency of each word. When returning a plot, the overall frequency is represented with a grey bar.

"probability" uses the estimated topic-word mixture $\phi$ to rank top words.

"term-score" implements the re-ranking method from Blei and Lafferty (2009). This method down-weights terms that have high probability in all topics using the following score: $$\text{term-score}_{k,v} = \phi_{k, v}\log\left(\frac{\phi_{k, v}}{\left(\prod^K_{j=1}\phi_{j,v}\right)^{\frac{1}{K}}}\right),$$ for topic $k$, vocabulary word $v$ and number of topics $K$.

"FREX" implements the re-ranking method from Bischof and Airoldi (2012). This method used the weight $w$ to balance between topic-word probability and topic exclusivity using the following score: $$\text{FREX}_{k,v}=\left(\frac{w}{\text{ECDF}\left( \frac{\phi_{k,v}}{\sum_{j=1}^K\phi_{k,v}}\right)} + \frac{1-w}{\text{ECDF}\left(\phi_{k,v}\right)} \right),$$ for topic $k$, vocabulary word $v$, number of topics $K$ and weight $w$, where $\text{ECDF}$ is the empirical cumulative distribution function.

References

Blei, DM. and Lafferty, JD. (2009). Topic models.. In Text Mining, chapter 4, 101--124.

Bischof JM. and Airoldi, EM. (2012). Summarizing Topical Content with Word Frequency and Exclusivity.. In Proceedings of the 29th International Conference on International Conference on Machine Learning, ICML'12, 9--16.

Examples

Run this code

model <- LDA(ECB_press_conferences_tokens)
model <- fit(model, 10)
topWords(model)
topWords(model, output = "matrix")
topWords(model, method = "FREX")
plot_topWords(model)
plot_topWords(model, subset = topic %in% 1:2)

jst <- JST(ECB_press_conferences_tokens)
jst <- fit(jst, 10)
plot_topWords(jst)
plot_topWords(jst, subset = topic %in% 1:2 & sentiment == 3)

Run the code above in your browser using DataLab