summary.top2vec

Get summary information of a top2vec model. Namely the topic centers and the most similar words
to a certain topic

Learn vector representations of sentences, paragraphs or documents by using the 'Paragraph Vector' algorithms,
namely the distributed bag of words ('PV-DBOW') and the distributed memory ('PV-DM') model.
The techniques in the package are detailed in the paper "Distributed Representations of Sentences and Documents" by Mikolov et al. (2014), available at <doi:10.48550/arXiv.1405.4053>.
The package also provides an implementation to cluster documents based on these embedding using a technique called top2vec.
Top2vec finds clusters in text documents by combining techniques to embed documents and words and density-based clustering.
It does this by embedding documents in the semantic space as defined by the 'doc2vec' algorithm. Next it maps
these document embeddings to a lower-dimensional space using the 'Uniform Manifold Approximation and Projection' (UMAP) clustering algorithm
and finds dense areas in that space using a 'Hierarchical Density-Based Clustering' technique (HDBSCAN). These dense
areas are the topic clusters which can be represented by the corresponding topic vector which is an aggregate of the
document embeddings of the documents which are part of that topic cluster. In the same semantic space similar words can
be found which are representative of the topic.
More details can be found in the paper 'Top2Vec: Distributed Representations of Topics' by D. Angelov available at <doi:10.48550/arXiv.2008.09470>.

Jan Wijffels

doc2vec

Distributed Representations of Sentences, Documents and Topics

BNOSAC 

hiyijian 

summary.top2vec function

<dl><dt>object</dt>
<dd>an object of class <code>top2vec</code> as returned by <code>top2vec</code></dd>
<dt>type</dt>
<dd>a character string with the type of summary information to extract for the topwords. Either 'similarity' or 'c-tfidf'.
The first extracts most similar words to the topic based on semantic similarity, the second by extracting
the words with the highest tf-idf score for each topic</dd>
<dt>top_n</dt>
<dd>integer indicating to find the <code>top_n</code> most similar words to a topic</dd>
<dt>data</dt>
<dd>a data.frame with columns `doc_id` and `text` representing documents. 
For each topic, the function extracts the most similar documents. 
And in case <code>type</code> is <code>'c-tfidf'</code> it get the words with the highest tf-idf scores for each topic.</dd>
<dt>embedding_words</dt>
<dd>a matrix of word embeddings to limit the most similar words to. Defaults to 
the embedding of words from the <code>object</code></dd>
<dt>embedding_docs</dt>
<dd>a matrix of document embeddings to limit the most similar documents to. Defaults to 
the embedding of words from the <code>object</code></dd>
<dt>...</dt>
<dd>not used</dd></dl>

Arguments

Get summary information of a top2vec model — summary.top2vec

<dl>

<dt>object</dt>
<dd>an object of class <code>top2vec</code> as returned by <code>top2vec</code></dd>


<dt>type</dt>
<dd>a character string with the type of summary information to extract for the topwords. Either 'similarity' or 'c-tfidf'.
The first extracts most similar words to the topic based on semantic similarity, the second by extracting
the words with the highest tf-idf score for each topic</dd>


<dt>top_n</dt>
<dd>integer indicating to find the <code>top_n</code> most similar words to a topic</dd>


<dt>data</dt>
<dd>a data.frame with columns `doc_id` and `text` representing documents. 
For each topic, the function extracts the most similar documents. 
And in case <code>type</code> is <code>'c-tfidf'</code> it get the words with the highest tf-idf scores for each topic.</dd>


<dt>embedding_words</dt>
<dd>a matrix of word embeddings to limit the most similar words to. Defaults to 
the embedding of words from the <code>object</code></dd>


<dt>embedding_docs</dt>
<dd>a matrix of document embeddings to limit the most similar documents to. Defaults to 
the embedding of words from the <code>object</code></dd>


<dt>...</dt>
<dd>not used</dd>

</dl>

summary.top2vec: Get summary information of a top2vec model

Description

Usage

Arguments

Examples