terms.data.frame

This extracts words occurring in the neighbourhood of one another, within a certain window range.
The default setting provides the biterms used when fitting <code>BTM</code> with the default window parameter.

Biterm Topic Models find topics in collections of short texts.
It is a word co-occurrence based topic model that learns topics by modeling word-word co-occurrences patterns which are called biterms.
This in contrast to traditional topic models like Latent Dirichlet Allocation and Probabilistic Latent Semantic Analysis
which are word-document co-occurrence topic models.
A biterm consists of two words co-occurring in the same short text window.
This context window can for example be a twitter message, a short answer on a survey, a sentence of a text or a document identifier.
The techniques are explained in detail in the paper 'A Biterm Topic Model For Short Text' by Xiaohui Yan, Jiafeng Guo, Yanyan Lan, Xueqi Cheng (2013) <https://github.com/xiaohuiyan/xiaohuiyan.github.io/blob/master/paper/BTM-WWW13.pdf>.

Jan Wijffels

Biterm Topic Models for Short Text

BNOSAC 

Xiaohui Yan 

terms.data.frame function

<dl><dt>x</dt>
<dd>a tokenised data frame containing one row per token with 2 columns<ul>
<li>the first column is a context identifier (e.g. a tweet id, a document id, a sentence id, an identifier of a survey answer, an identifier of a part of a text)</li>
<li>the second column is a column called of type character containing the sequence of words occurring within the context identifier</li>
</ul></dd>
<dt>type</dt>
<dd>a character string, either 'tokens' or 'biterms'. Defaults to 'tokens'.</dd>
<dt>window</dt>
<dd>integer with the window size for biterm extraction. Defaults to 15.</dd>
<dt>...</dt>
<dd>not used</dd></dl>

Arguments

Get the set of Biterms from a tokenised data frame — terms.data.frame

<dl>

<dt>x</dt>
<dd>a tokenised data frame containing one row per token with 2 columns<ul>
<li>the first column is a context identifier (e.g. a tweet id, a document id, a sentence id, an identifier of a survey answer, an identifier of a part of a text)</li>
<li>the second column is a column called of type character containing the sequence of words occurring within the context identifier</li>
</ul></dd>


<dt>type</dt>
<dd>a character string, either 'tokens' or 'biterms'. Defaults to 'tokens'.</dd>


<dt>window</dt>
<dd>integer with the window size for biterm extraction. Defaults to 15.</dd>


<dt>...</dt>
<dd>not used</dd>

</dl>

terms.data.frame: Get the set of Biterms from a tokenised data frame

Description

Usage

Value

Arguments

See Also

Examples