add_multitoken_label

Given a multitoken category (e.g., named entity ids), this function finds the most frequently occuring string in this category and adds it as a label for the category

Provides text analysis in R, focusing on the use of a tokenized text format. In this format, the positions of tokens are maintained, and each token can be annotated (e.g., part-of-speech tags, dependency relations).
Prominent features include advanced Lucene-like querying for specific tokens or contexts (e.g., documents, sentences),
similarity statistics for words and documents, exporting to DTM for compatibility with many text analysis packages,
and the possibility to reconstruct original text from tokens to facilitate interpretation.

Kasper Welbers

corpustools

Managing, Querying and Analyzing Tokenized Text

add_multitoken_label function

<dl><dt>tc</dt>
<dd>a tcorpus object</dd>
<dt>colloc_id</dt>
<dd>the data column containing the unique id for multitoken tokens</dd>
<dt>feature</dt>
<dd>the name of the feature column</dd>
<dt>new_feature</dt>
<dd>the name of the new feature column</dd>
<dt>pref_subset</dt>
<dd>Optionally, a subset call, to specify a subset that has priority for finding the most frequently occuring string</dd></dl>

Arguments

Choose and add multitoken strings based on multitoken categories — add_multitoken_label

<dl>

<dt>tc</dt>
<dd>a tcorpus object</dd>


<dt>colloc_id</dt>
<dd>the data column containing the unique id for multitoken tokens</dd>


<dt>feature</dt>
<dd>the name of the feature column</dd>


<dt>new_feature</dt>
<dd>the name of the new feature column</dd>


<dt>pref_subset</dt>
<dd>Optionally, a subset call, to specify a subset that has priority for finding the most frequently occuring string</dd>

</dl>

add_multitoken_label: Choose and add multitoken strings based on multitoken categories

Description

Usage

Arguments