textmodel_wordmap: A model for multinomial feature extraction and document classification

Description

Wordmap is a model for multinomial feature extraction and document classification. Its naive Bayesian algorithm allows users to train the model on a large corpus with noisy labels given by document meta-data or keyword matching.

Usage

textmodel_wordmap(
  x,
  y,
  label = c("all", "max"),
  smooth = 0.01,
  boolean = FALSE,
  drop_label = TRUE,
  entropy = c("none", "global", "local", "average"),
  residual = FALSE,
  verbose = quanteda_options("verbose"),
  ...
)

Value

Returns a fitted textmodel_wordmap object with the following elements:

model: a matrix that records the association between classes and features.
data: the original input of x.
feature: the feature set in x
class: the class labels in y.
concatenator: the concatenator in x.
entropy: the scheme to compute entropy weights.
boolean: the use of the Boolean transformation of x.
call: the command used to execute the function.
version: the version of the wordmap package.

Arguments

x: a dfm or fcm created by quanteda::dfm()
y: a dfm or a sparse matrix that record class membership of the documents. It can be created applying quanteda::dfm_lookup() to x.
label: if "max", uses only labels for the maximum value in each row of y.
smooth: the amount of smoothing in computing coefficients. When smooth = 0.01, 1% of the mean frequency of words in each class is added to smooth likelihood ratios.
boolean: if TRUE, only consider presence or absence of features in each document to limit the impact of words repeated in few documents.
drop_label: if TRUE, drops empty columns of y and ignore their labels.
entropy: the scheme to compute the entropy to regularize likelihood ratios. The entropy of features are computed over labels if global or over documents with the same labels if local. Local entropy is averaged if average. See the details.
residual: if TRUE, a residual class is added to y. It is named "other" but can be changed via base::options(wordmap_residual_name).
verbose: if TRUE, shows progress of training.
...: additional arguments passed to internal functions.

Details

Wordmap learns association between words in x and classes in y based on likelihood ratios. The large likelihood ratios tend to concentrate to a small number of features but the entropy of their frequencies over labels or documents helps to disperse the distribution.

A residual class is created internally by adding a new column to y. The column is given 1 if the other values in the same row are all zero (i.e. rowSums(y) == 0); otherwise 0. It is useful when users cannot create an exhaustive dictionary that covers all the categories.

References

Watanabe, Kohei (2018). "Newsmap: semi-supervised approach to geographical news classification". doi.org/10.1080/21670811.2017.1293487, Digital Journalism.

Watanabe, Kohei & Zhou, Yuan (2020). "Theory-Driven Analysis of Large Corpora: Semisupervised Topic Classification of the UN Speeches". doi:10.1177/0894439320907027. Social Science Computer Review.

Examples

Run this code

require(quanteda)

# split into sentences
corp <- corpus_reshape(data_corpus_ungd2017)

# tokenize
toks <- tokens(corp, remove_punct = TRUE) %>%
   tokens_remove(stopwords("en"))

# apply seed dictionary
toks_dict <- tokens_lookup(toks, data_dictionary_topic)

# form dfm
dfmt_feat <- dfm(toks)
dfmt_dict <- dfm(toks_dict)

# fit wordmap model
map <- textmodel_wordmap(dfmt_feat, dfmt_dict)
coef(map)
predict(map)

Run the code above in your browser using DataLab