Learn R Programming

toaster (version 0.5.5)

nGram: Tokenize (or split) text and emit multi-grams.

Description

Tokenize (or split) text and emit multi-grams.

Usage

nGram(n, ignoreCase = FALSE, delimiter = "[ \\t\\b\\f\\r]+", punctuation = NULL, overlapping = TRUE, reset = NULL, sep = " ", minLength = 1)

Arguments

n
length, in words, of each n-gram
ignoreCase
logical: if FALSE, the n-gram matching is case sensitive and if TRUE, case is ignored during matching.
delimiter
character or string that divides one word from the next. You can use a regular expression as the delimiter value.
punctuation
a regular expression that specifies the punctuation characters parser will remove before it evaluates the input text.
overlapping
logical: true value allows for overlapping n-grams.
reset
a regular expression listing one or more punctuation characters or strings, any of which the nGram parser will recognize as the end of a sentence of text. The end of each sentence resets the search for n-grams, meaning that nGram discards any partial n-grams and proceeds to the next sentence to search for the next n-gram. In other words, no n-gram can span two sentences.
sep
a character string to separate multiple text columns.
minLength
minimum length of words in ngram. Ngrams that contains words below shorter than the limit are omitted. Current implementation is not complete: it filters out ngrams where each word is below the minimum length, i.e. total length of ngram is below n*minLength + (n-1).

Value

pluggable n-gram parser