tokens_compound

an input <a rd-options="" href="/link/tokens?package=quanteda&version=2.1.2" data-mini-rdoc="quanteda::tokens">tokens</a> object

a character vector, list of character vectors,
<a rd-options="" href="/link/dictionary?package=quanteda&version=2.1.2" data-mini-rdoc="quanteda::dictionary">dictionary</a>, or <a rd-options="" href="/link/collocations?package=quanteda&version=2.1.2" data-mini-rdoc="quanteda::collocations">collocations</a> object. See <a rd-options="" href="/link/pattern?package=quanteda&version=2.1.2" data-mini-rdoc="quanteda::pattern">pattern</a> for
details.

pattern

the concatenation character that will connect the words
making up the multi-word sequences. The default <code>_</code> is recommended since
it will not be removed during normal cleaning and tokenization (while
nearly all other punctuation characters, at least those in the Unicode
punctuation class <code>[P]</code> will be removed).

concatenator

the type of pattern matching: <code>"glob"</code> for "glob"-style
wildcard expressions; <code>"regex"</code> for regular expressions; or <code>"fixed"</code> for
exact matching. See <a rd-options="" href="/link/valuetype?package=quanteda&version=2.1.2" data-mini-rdoc="quanteda::valuetype">valuetype</a> for details.

valuetype

integer; a vector of length 1 or 2 that specifies size of the
window of tokens adjacent to <code>pattern</code> that will be compounded with matches
to <code>pattern</code>. The window can be asymmetric if two elements are specified,
with the first giving the window size before <code>pattern</code> and the second the
window size after. If paddings (empty <code>""</code> tokens) are found, window will
be shrunk to exclude them.

window

logical; if <code>TRUE</code>, ignore case when matching a
<code>pattern</code> or <a rd-options="" href="/link/dictionary?package=quanteda&version=2.1.2" data-mini-rdoc="quanteda::dictionary">dictionary</a> values

case_insensitive

logical; if <code>TRUE</code>, join overlapping compounds into a single
compound; otherwise, form these separately. See examples.

join

Replace multi-token sequences with a multi-word, or "compound" token. The
resulting compound tokens will represent a phrase or multi-word expression,
concatenated with <code>concatenator</code> (by default, the "<code>_</code>" character) to form a
single "token". This ensures that the sequences will be processed
subsequently as single tokens, for instance in constructing a <a rd-options="" href="/link/dfm?package=quanteda&version=2.1.2" data-mini-rdoc="quanteda::dfm">dfm</a>.

A fast, flexible, and comprehensive framework for
quantitative text analysis in R.  Provides functionality for corpus management,
creating and manipulating tokens and ngrams, exploring keywords in context,
forming and manipulating sparse matrices
of documents by features and feature co-occurrences, analyzing keywords, computing feature similarities and
distances, applying content dictionaries, applying supervised and unsupervised machine learning,
visually representing text and text analyses, and more.

Kenneth Benoit

quanteda

Quantitative Analysis of Textual Data

Kohei Watanabe

Haiyan Wang

Paul Nulty

Adam Obeng

Stefan M<c3><bc>ller

Akitaka Matsuo

Jiong Wei Lua

Jouni Kuha

William Lowe

Christian M<c3><bc>ller

Lori Young

Stuart Soroka

Ian Fellows

European Research Council 

tokens_compound function

an input <a rd-options='' href='tokens'>tokens</a> object

a character vector, list of character vectors,
<a rd-options='' href='dictionary'>dictionary</a>, or <a rd-options='' href='collocations'>collocations</a> object. See <a rd-options='' href='pattern'>pattern</a> for
details.

the type of pattern matching: <code>"glob"</code> for "glob"-style
wildcard expressions; <code>"regex"</code> for regular expressions; or <code>"fixed"</code> for
exact matching. See <a rd-options='' href='valuetype'>valuetype</a> for details.

logical; if <code>TRUE</code>, ignore case when matching a
<code>pattern</code> or <a rd-options='' href='dictionary'>dictionary</a> values

Replace multi-token sequences with a multi-word, or "compound" token. The
resulting compound tokens will represent a phrase or multi-word expression,
concatenated with <code>concatenator</code> (by default, the "<code>_</code>" character) to form a
single "token". This ensures that the sequences will be processed
subsequently as single tokens, for instance in constructing a <a rd-options='' href='dfm'>dfm</a>.

tokens_compound: Convert token sequences into compound tokens

Description

Usage

Arguments

Value

Examples