textplot_wordcloud: Plot features as a wordcloud

Description

Plot a dfm or quanteda.textstats::textstat_keyness object as a wordcloud, where the feature labels are plotted with their sizes proportional to their numerical values in the dfm. When comparison = TRUE, it plots comparison word clouds by document (or by target and reference categories in the case of a keyness object).

Usage

textplot_wordcloud(
  x,
  min_size = 0.5,
  max_size = 4,
  min_count = 3,
  max_words = 500,
  color = "darkblue",
  font = NULL,
  adjust = 0,
  rotation = 0.1,
  random_order = FALSE,
  random_color = FALSE,
  ordered_color = FALSE,
  labelcolor = "gray20",
  labelsize = 1.5,
  labeloffset = 0,
  fixed_aspect = TRUE,
  ...,
  comparison = FALSE
)

Arguments

x: a dfm or quanteda.textstats::textstat_keyness object
min_size: size of the smallest word
max_size: size of the largest word
min_count: words with frequency below min_count will not be plotted
max_words: maximum number of words to be plotted. The least frequent terms dropped. The maximum frequency will be split evenly across categories when comparison = TRUE.
color: colour of words from least to most frequent
font: font-family of words and labels. Use default font if NULL.
adjust: adjust sizes of words by a constant. Useful for non-English words for which R fails to obtain correct sizes.
rotation: proportion of words with 90 degree rotation
random_order: plot words in random order. If FALSE, they will be plotted in decreasing frequency.
random_color: choose colours randomly from the colours. If FALSE, the colour is chosen based on the frequency
ordered_color: if TRUE, then colours are assigned to words in order.
labelcolor: colour of group labels. Only used when comparison = TRUE.
labelsize: size of group labels. Only used when comparison = TRUE.
labeloffset: position of group labels. Only used when comparison = TRUE.
fixed_aspect: logical; if TRUE, the aspect ratio is fixed. Variable aspect ratio only supported if rotation = 0.
...: additional parameters. Only used to make it compatible with wordcloud
comparison: logical; if TRUE, plot a wordcloud that compares documents in the same way as wordcloud::comparison.cloud(). If x is a quanteda.textstats::textstat_keyness object, then only the target category's key terms are plotted when comparison = FALSE, otherwise the top max_words / 2 terms are plotted from the target and reference categories.

Author

Kohei Watanabe, building on code from Ian Fellows's wordcloud package.

Details

The default is to plot the word cloud of all features, summed across documents. To produce word cloud plots for specific document or set of documents, you need to slice out the document(s) from the dfm object.

Comparison wordcloud plots may be plotted by setting comparison = TRUE, which plots a separate grouping for each document in the dfm. This means that you will need to slice out just a few documents from the dfm, or to create a dfm where the "documents" represent a subset or a grouping of documents by some document variable.

Examples

Run this code

# plot the features (without stopwords) from Obama's inaugural addresses
set.seed(10)
library("quanteda")
dfmat1 <- data_corpus_inaugural |>
    corpus_subset(President == "Obama") |>
    tokens(remove_punct = TRUE) |>
    tokens_remove(stopwords("en")) |>
    dfm() |>
    dfm_trim(min_termfreq = 3)

# basic wordcloud
textplot_wordcloud(dfmat1)

# plot in colours with some additional options
textplot_wordcloud(dfmat1, rotation = 0.25,
                   color = rev(RColorBrewer::brewer.pal(10, "RdBu")))

# other display options
col <- sapply(seq(0.1, 1, 0.1), function(x) adjustcolor("#1F78B4", x))
textplot_wordcloud(dfmat1, adjust = 0.5, random_order = FALSE,
                   color = col, rotation = FALSE)

# comparison plot of Obama v. Trump
dfmat2 <- data_corpus_inaugural |>
    corpus_subset(President %in% c("Obama", "Trump")) |>
    tokens(remove_punct = TRUE) |>
    tokens_remove(stopwords("en")) |>
    dfm()
dfmat2 <- dfm_group(dfmat2, dfmat2$President) |>
    dfm_trim(min_termfreq = 3)

textplot_wordcloud(dfmat2, comparison = TRUE, max_words = 100,
                   color = c("blue", "red"))

if (FALSE) {
# for keyness
tstat <- data_corpus_inaugural[c(1, 3)] |>
    tokens(remove_punct = TRUE) |>
    tokens_remove(stopwords("en")) |>
    dfm() |>
    quanteda.textstats::textstat_keyness()
textplot_wordcloud(tstat, min_count = 2)
textplot_wordcloud(tstat, min_count = 2, comparison = FALSE)
}

Run the code above in your browser using DataLab