ngram (version 3.0.4)

Tokenize: n-gram Tokenization

Description

The ngram() function is the main workhorse of this package. It takes an input string and converts it into the internal n-gram representation.

Usage

ngram(str, n = 2, sep = " ")

# S4 method for character ngram(str, n = 2, sep = " ")

Arguments

str

The input text.

n

The 'n' as in 'n-gram'.

sep

A set of separator characters for the "words". See details for information about how this works; it works a little differently from sep arguments in R functions.

Value

An ngram class object.

Details

On evaluation, a copy of the input string is produced and stored as an external pointer. This is necessary because the internal list representation just points to the first char of each word in the input string. So if you (or R's gc) deletes the input string, basically all hell breaks loose.

The sep parameter splits at any of the characters in the string. So sep=", " splits at a comma or a space.

See Also

ngram-class, getters, phrasetable, babble

Examples

Run this code
# NOT RUN {
library(ngram)

str <- "A B A C A B B"
ngram(str, n=2)

str <- "A,B,A,C A B B"
### Split at a space
print(ngram(str), output="full")
### Split at a comma
print(ngram(str, sep=","), output="full")
### Split at a space or a comma
print(ngram(str, sep=", "), output="full")

# }

Run the code above in your browser using DataLab