ngram (version 3.2.3)

Tokenize-AsWeka: Weka-like n-gram Tokenization

Description

An n-gram tokenizer with identical output to the NGramTokenizer function from the RWeka package.

Usage

ngram_asweka(str, min = 2, max = 2, sep = " ")

Value

A vector of n-grams listed in decreasing blocks of n, in order within a block. The output matches that of RWeka's n-gram tokenizer.

Arguments

str

The input text.

min, max

The minimum and maximum 'n' as in 'n-gram'.

sep

A set of separator characters for the "words". See details for information about how this works; it works a little differently from sep arguments in R functions.

Details

This n-gram tokenizer behaves similarly in both input and return to the tokenizer in RWeka. Unlike the tokenizer ngram(), the return is not a special class of external pointers; it is a vector, and therefore can be serialized via save() or saveRDS().

See Also

ngram

Examples

Run this code
library(ngram)

str = "A B A C A B B"
ngram_asweka(str, min=2, max=4)

Run the code above in your browser using DataLab