ngram (version 3.0.4)

ngram-class: Class ngram

Description

An n-gram is an ordered sequence of n "words" taken from a body of "text". The terms "words" and "text" can easily be interpreted literally, or with a more loose interpretation.

Arguments

Slots

str_ptr

A pointer to a copy of the original input string.

strlen

The length of the string.

n

The eponymous 'n' as in 'n-gram'.

ngl_ptr

A pointer to the processed list of n-grams.

ngsize

The length of the ngram list, or in other words, the number of unique n-grams in the input string.

sl_ptr

A pointer to the list of words from the input string.

Details

For example, consider the sequence "A B A C A B B". If we examine the 2-grams (or bigrams) of this sequence, they are

A B, B A, A C, C A, A B, B B

or without repetition:

A B, B A, A C, C A, B B

That is, we take the input string and group the "words" 2 at a time (because n=2). Notice that the number of n-grams and the number of words are not obviously related; counting repetition, the number of n-grams is equal to

nwords - n + 1

Bounds ignoring repetition are highly dependent on the input. A correct but useless bound is

\#ngrams = nwords - (\#repeats - 1) - (n - 1)

An ngram object is an S4 class container that stores some basic summary information (e.g., n), and several external pointers. For information on how to construct an ngram object, see ngram.

See Also

Tokenize