Learn R Programming

topicmodels.etm (version 0.1.0)

ETM: Topic Modelling in Semantic Embedding Spaces

Description

ETM is a generative topic model combining traditional topic models (LDA) with word embeddings (word2vec).

  • It models each word with a categorical distribution whose natural parameter is the inner product between a word embedding and an embedding of its assigned topic.

  • The model is fitted using an amortized variational inference algorithm on top of libtorch.

Usage

ETM(
  k = 20,
  embeddings,
  dim = 800,
  activation = c("relu", "tanh", "softplus", "rrelu", "leakyrelu", "elu", "selu",
    "glu"),
  dropout = 0.5,
  vocab = rownames(embeddings)
)

Value

an object of class ETM which is a torch nn_module containing o.a.

  • num_topics: the number of topics

  • vocab: character vector with the terminology used in the model

  • vocab_size: the number of words in vocab

  • rho: The word embeddings

  • alphas: The topic embeddings

Methods

fit(data, optimizer, epoch, batch_size, normalize = TRUE, clip = 0, lr_anneal_factor = 4, lr_anneal_nonmono = 10)

Fit the model on a document term matrix by splitting the data in 70/30 training/test set and updating the model weights.

Arguments

data

bag of words document term matrix in dgCMatrix format

optimizer

object of class torch_Optimizer

epoch

integer with the number of iterations to train

batch_size

integer with the size of the batch

normalize

logical indicating to normalize the bag of words data

clip

number between 0 and 1 indicating to do gradient clipping - passed on to nn_utils_clip_grad_norm_

lr_anneal_factor

divide the learning rate by this factor when the loss on the test set is monotonic for at least lr_anneal_nonmono training iterations

lr_anneal_nonmono

number of iterations after which learning rate annealing is executed if the loss does not decreases

References

https://arxiv.org/pdf/1907.04907.pdf

Examples

Run this code
# NOT RUN {
library(torch)
library(topicmodels.etm)
library(word2vec)
library(udpipe)
data(brussels_reviews_anno, package = "udpipe")
##
## Toy example with pretrained embeddings
##

## a. build word2vec model
x          <- subset(brussels_reviews_anno, language %in% "nl")
x          <- paste.data.frame(x, term = "lemma", group = "doc_id") 
set.seed(4321)
w2v        <- word2vec(x = x$lemma, dim = 15, iter = 20, type = "cbow", min_count = 5)
embeddings <- as.matrix(w2v)

## b. build document term matrix on nouns + adjectives, align with the embedding terms
dtm <- subset(brussels_reviews_anno, language %in% "nl" & upos %in% c("NOUN", "ADJ"))
dtm <- document_term_frequencies(dtm, document = "doc_id", term = "lemma")
dtm <- document_term_matrix(dtm)
dtm <- dtm_conform(dtm, columns = rownames(embeddings))
dtm <- dtm[dtm_rowsums(dtm) > 0, ]

## create and fit an embedding topic model - 8 topics, theta 100-dimensional
if (torch::torch_is_installed()) {

set.seed(4321)
torch_manual_seed(4321)
model       <- ETM(k = 8, dim = 100, embeddings = embeddings, dropout = 0.5)
optimizer   <- optim_adam(params = model$parameters, lr = 0.005, weight_decay = 0.0000012)
overview    <- model$fit(data = dtm, optimizer = optimizer, epoch = 40, batch_size = 1000)
scores      <- predict(model, dtm, type = "topics")

lastbatch   <- subset(overview$loss, overview$loss$batch_is_last == TRUE)
plot(lastbatch$epoch, lastbatch$loss)
plot(overview$loss_test)

## show top words in each topic
terminology <- predict(model, type = "terms", top_n = 7)
terminology

##
## Toy example without pretrained word embeddings
##
set.seed(4321)
torch_manual_seed(4321)
model       <- ETM(k = 8, dim = 100, embeddings = 15, dropout = 0.5, vocab = colnames(dtm))
optimizer   <- optim_adam(params = model$parameters, lr = 0.005, weight_decay = 0.0000012)
overview    <- model$fit(data = dtm, optimizer = optimizer, epoch = 40, batch_size = 1000)
terminology <- predict(model, type = "terms", top_n = 7)
terminology



# }
# NOT RUN {
}
# }

Run the code above in your browser using DataLab