R interfaces to Weka tokenizers.
AlphabeticTokenizer(x, control = NULL) NGramTokenizer(x, control = NULL) WordTokenizer(x, control = NULL)
- a character vector with strings to be tokenized.
- an object of class
Weka_control, or a character vector of control options, or
NULL(default). Available options can be obtained on-line using the Weka Option Wizard
AlphabeticTokenizer is an alphabetic string tokenizer, where
tokens are to be formed only from contiguous alphabetic sequences.
NGramTokenizer splits strings into $n$-grams with given
minimal and maximal numbers of grams.
WordTokenizer is a simple word tokenizer.
- A character vector with the tokenized strings.