top2vec: Distributed Representations of Topics

Description

Perform text clustering by using semantic embeddings of documents and words to find topics of text documents which are semantically similar.

Usage

top2vec(
  x,
  data = data.frame(doc_id = character(), text = character(), stringsAsFactors = FALSE),
  control.umap = list(n_neighbors = 15L, n_components = 5L, metric = "cosine"),
  control.dbscan = list(minPts = 100L),
  control.doc2vec = list(),
  umap = uwot::umap,
  trace = FALSE,
  ...
)

Value

an object of class top2vec which is a list with elements

embedding: a list of matrices with word and document embeddings
doc2vec: a doc2vec model
umap: a matrix of representations of the documents of x
dbscan: the result of the hdbscan clustering
data: a data.frame with columns doc_id and text
size: a vector of frequency statistics of topic occurrence
k: the number of clusters
control: a list of control arguments to doc2vec / umap / dbscan

Arguments

x: either an object returned by paragraph2vec or a data.frame with columns `doc_id` and `text` storing document ids and texts as character vectors or a matrix with document embeddings to cluster or a list with elements docs and words containing document embeddings to cluster and word embeddings for deriving topic summaries
data: optionally, a data.frame with columns `doc_id` and `text` representing documents. This dataset is just stored, in order to extract the text of the most similar documents to a topic. If it also contains a field `text_doc2vec`, this will be used to indicate the most relevant topic words by class-based tfidf
control.umap: a list of arguments to pass on to umap for reducing the dimensionality of the embedding space
control.dbscan: a list of arguments to pass on to hdbscan for clustering the reduced embedding space
control.doc2vec: optionally, a list of arguments to pass on to paragraph2vec in case x is a data.frame instead of a doc2vec model trained by paragraph2vec
umap: function to apply UMAP. Defaults to umap, can as well be tumap
trace: logical indicating to print evolution of the algorithm
...: further arguments not used yet

References

https://arxiv.org/abs/2008.09470

Examples

Run this code

# \donttest{
if(require(word2vec) && require(uwot) && require(dbscan) && require(udpipe)){
library(word2vec)
library(uwot)
library(dbscan)
data(be_parliament_2020, package = "doc2vec")
x      <- data.frame(doc_id = be_parliament_2020$doc_id,
                     text   = be_parliament_2020$text_nl,
                     stringsAsFactors = FALSE)
x$text <- txt_clean_word2vec(x$text)
x      <- subset(x, txt_count_words(text) < 1000)
d2v    <- paragraph2vec(x, type = "PV-DBOW", dim = 50, 
                        lr = 0.05, iter = 10,
                        window = 15, hs = TRUE, negative = 0,
                        sample = 0.00001, min_count = 5, 
                        threads = 1)
# write.paragraph2vec(d2v, "d2v.bin")
# d2v    <- read.paragraph2vec("d2v.bin")
model  <- top2vec(d2v, data = x,
                  control.dbscan = list(minPts = 50), 
                  control.umap = list(n_neighbors = 15L, n_components = 4), trace = TRUE)
model  <- top2vec(d2v, data = x,
                  control.dbscan = list(minPts = 50), 
                  control.umap = list(n_neighbors = 15L, n_components = 3), umap = tumap, 
                  trace = TRUE)
                                  
info   <- summary(model, top_n = 7)
info$topwords
info$topdocs
library(udpipe)
info   <- summary(model, top_n = 7, type = "c-tfidf")
info$topwords

## Change the model: reduce doc2vec model to 2D
model  <- update(model, type = "umap", 
                 n_neighbors = 100, n_components = 2, metric = "cosine", umap = tumap, 
                 trace = TRUE)
info   <- summary(model, top_n = 7)
info$topwords
info$topdocs

## Change the model: have minimum 200 points for the core elements in the hdbscan density
model  <- update(model, type = "hdbscan", minPts = 200, trace = TRUE)
info   <- summary(model, top_n = 7)
info$topwords
info$topdocs
} # End of main if statement running only if the required packages are installed
# }

##
## Example on a small sample 
##  with unrealistic hyperparameter settings especially regarding dim / iter / n_epochs
##  in order to have a basic example finishing < 5 secs
##
if(require(word2vec) && require(uwot) && require(dbscan)){
library(uwot)
library(dbscan)
library(word2vec)
data(be_parliament_2020, package = "doc2vec")
x        <- data.frame(doc_id = be_parliament_2020$doc_id,
                       text   = be_parliament_2020$text_nl,
                       stringsAsFactors = FALSE)
x        <- head(x, 1000)
x$text   <- txt_clean_word2vec(x$text)
x        <- subset(x, txt_count_words(text) < 1000)
d2v      <- paragraph2vec(x, type = "PV-DBOW", dim = 10, 
                          lr = 0.05, iter = 0,
                          window = 5, hs = TRUE, negative = 0,
                          sample = 0.00001, min_count = 5)
emb      <- list(docs  = as.matrix(d2v, which = "docs"),
                 words = as.matrix(d2v, which = "words"))
model    <- top2vec(emb, 
                    data = x,
                    control.dbscan = list(minPts = 50), 
                    control.umap = list(n_neighbors = 15, n_components = 2, 
                                        init = "spectral"), 
                    umap = tumap, trace = TRUE)
info     <- summary(model, top_n = 7)
print(info, top_n = c(5, 2))
} # End of main if statement running only if the required packages are installed

Run the code above in your browser using DataLab

Description

Usage

Value

Arguments

References

See Also

Examples