Tokenize-AsWeka

ngram_asweka

An n-gram tokenizer with identical output to the <code>NGramTokenizer</code>
function from the RWeka package.

Tokenization

An n-gram is a sequence of n "words" taken, in order, from a
body of text.  This is a collection of utilities for creating,
displaying, summarizing, and "babbling" n-grams.  The
'tokenization' and "babbling" are handled by very efficient C
code, which can even be built as its own standalone library.
The babbler is a simple Markov chain.  The package also offers
a vignette with complete example 'workflows' and information about
the utilities offered in the package.

Drew Schmidt

ngram

Fast n-Gram 'Tokenization'

Christian Heckendorf

Tokenize-AsWeka function

<dl><dt>str</dt>
<dd>The input text.</dd>
<dt>min, max</dt>
<dd>The minimum and maximum 'n' as in 'n-gram'.</dd>
<dt>sep</dt>
<dd>A set of separator characters for the "words". See details for
information about how this works; it works a little differently
from <code>sep</code> arguments in R functions.</dd></dl>

Arguments

Weka-like n-gram Tokenization — Tokenize-AsWeka

<dl>

<dt>str</dt>
<dd>The input text.</dd>


<dt>min, max</dt>
<dd>The minimum and maximum 'n' as in 'n-gram'.</dd>


<dt>sep</dt>
<dd>A set of separator characters for the "words". See details for
information about how this works; it works a little differently
from <code>sep</code> arguments in R functions.</dd>

</dl>

Tokenize-AsWeka: Weka-like n-gram Tokenization

Description

Usage

Value

Arguments

Details

See Also

Examples