Learn R Programming

doc2vec (version 0.2.2)

top2vec: Distributed Representations of Topics

Description

Perform text clustering by using semantic embeddings of documents and words to find topics of text documents which are semantically similar.

Usage

top2vec(
  x,
  data = data.frame(doc_id = character(), text = character(), stringsAsFactors = FALSE),
  control.umap = list(n_neighbors = 15L, n_components = 5L, metric = "cosine"),
  control.dbscan = list(minPts = 100L),
  control.doc2vec = list(),
  umap = uwot::umap,
  trace = FALSE,
  ...
)

Value

an object of class top2vec which is a list with elements

  • embedding: a list of matrices with word and document embeddings

  • doc2vec: a doc2vec model

  • umap: a matrix of representations of the documents of x

  • dbscan: the result of the hdbscan clustering

  • data: a data.frame with columns doc_id and text

  • size: a vector of frequency statistics of topic occurrence

  • k: the number of clusters

  • control: a list of control arguments to doc2vec / umap / dbscan

Arguments

x

either an object returned by paragraph2vec or a data.frame with columns `doc_id` and `text` storing document ids and texts as character vectors or a matrix with document embeddings to cluster or a list with elements docs and words containing document embeddings to cluster and word embeddings for deriving topic summaries

data

optionally, a data.frame with columns `doc_id` and `text` representing documents. This dataset is just stored, in order to extract the text of the most similar documents to a topic. If it also contains a field `text_doc2vec`, this will be used to indicate the most relevant topic words by class-based tfidf

control.umap

a list of arguments to pass on to umap for reducing the dimensionality of the embedding space

control.dbscan

a list of arguments to pass on to hdbscan for clustering the reduced embedding space

control.doc2vec

optionally, a list of arguments to pass on to paragraph2vec in case x is a data.frame instead of a doc2vec model trained by paragraph2vec

umap

function to apply UMAP. Defaults to umap, can as well be tumap

trace

logical indicating to print evolution of the algorithm

...

further arguments not used yet

References

https://arxiv.org/abs/2008.09470

See Also

paragraph2vec

Examples

Run this code
# \donttest{
if(require(word2vec) && require(uwot) && require(dbscan) && require(udpipe)){
library(word2vec)
library(uwot)
library(dbscan)
data(be_parliament_2020, package = "doc2vec")
x      <- data.frame(doc_id = be_parliament_2020$doc_id,
                     text   = be_parliament_2020$text_nl,
                     stringsAsFactors = FALSE)
x$text <- txt_clean_word2vec(x$text)
x      <- subset(x, txt_count_words(text) < 1000)
d2v    <- paragraph2vec(x, type = "PV-DBOW", dim = 50, 
                        lr = 0.05, iter = 10,
                        window = 15, hs = TRUE, negative = 0,
                        sample = 0.00001, min_count = 5, 
                        threads = 1)
# write.paragraph2vec(d2v, "d2v.bin")
# d2v    <- read.paragraph2vec("d2v.bin")
model  <- top2vec(d2v, data = x,
                  control.dbscan = list(minPts = 50), 
                  control.umap = list(n_neighbors = 15L, n_components = 4), trace = TRUE)
model  <- top2vec(d2v, data = x,
                  control.dbscan = list(minPts = 50), 
                  control.umap = list(n_neighbors = 15L, n_components = 3), umap = tumap, 
                  trace = TRUE)
                                  
info   <- summary(model, top_n = 7)
info$topwords
info$topdocs
library(udpipe)
info   <- summary(model, top_n = 7, type = "c-tfidf")
info$topwords

## Change the model: reduce doc2vec model to 2D
model  <- update(model, type = "umap", 
                 n_neighbors = 100, n_components = 2, metric = "cosine", umap = tumap, 
                 trace = TRUE)
info   <- summary(model, top_n = 7)
info$topwords
info$topdocs

## Change the model: have minimum 200 points for the core elements in the hdbscan density
model  <- update(model, type = "hdbscan", minPts = 200, trace = TRUE)
info   <- summary(model, top_n = 7)
info$topwords
info$topdocs
} # End of main if statement running only if the required packages are installed
# }

##
## Example on a small sample 
##  with unrealistic hyperparameter settings especially regarding dim / iter / n_epochs
##  in order to have a basic example finishing < 5 secs
##
if(require(word2vec) && require(uwot) && require(dbscan)){
library(uwot)
library(dbscan)
library(word2vec)
data(be_parliament_2020, package = "doc2vec")
x        <- data.frame(doc_id = be_parliament_2020$doc_id,
                       text   = be_parliament_2020$text_nl,
                       stringsAsFactors = FALSE)
x        <- head(x, 1000)
x$text   <- txt_clean_word2vec(x$text)
x        <- subset(x, txt_count_words(text) < 1000)
d2v      <- paragraph2vec(x, type = "PV-DBOW", dim = 10, 
                          lr = 0.05, iter = 0,
                          window = 5, hs = TRUE, negative = 0,
                          sample = 0.00001, min_count = 5)
emb      <- list(docs  = as.matrix(d2v, which = "docs"),
                 words = as.matrix(d2v, which = "words"))
model    <- top2vec(emb, 
                    data = x,
                    control.dbscan = list(minPts = 50), 
                    control.umap = list(n_neighbors = 15, n_components = 2, 
                                        init = "spectral"), 
                    umap = tumap, trace = TRUE)
info     <- summary(model, top_n = 7)
print(info, top_n = c(5, 2))
} # End of main if statement running only if the required packages are installed

Run the code above in your browser using DataLab