Create a set of n-grams (tokens in sequence) from already tokenized text objects, with an optional skip argument to form skip-grams. Both the n-gram length and the skip lengths take vectors of arguments to form multiple lengths or skips in one pass. Implemented in C++ for efficiency.
tokens_ngrams(
x,
n = 2L,
skip = 0L,
concatenator = concat(x),
verbose = quanteda_options("verbose")
)char_ngrams(x, n = 2L, skip = 0L, concatenator = "_")
tokens_skipgrams(
x,
n,
skip,
concatenator = concat(x),
verbose = quanteda_options("verbose")
)
a tokens object consisting a list of character vectors of n-grams, one list element per text, or a character vector if called on a simple character vector
a tokens object, or a character vector, or a list of characters
integer vector specifying the number of elements to be concatenated
in each n-gram. Each element of this vector will define a
integer vector specifying the adjacency skip size for tokens
forming the n-grams, default is 0 for only immediately neighbouring words.
For skipgrams
, skip
can be a vector of integers, as the
"classic" approach to forming skip-grams is to set skip = skip = 0:4
produces results that include 4 skips, 3 skips, 2 skips, 1 skip, and 0
skips (where 0 skips are typical n-grams formed from adjacent words). See
Guthrie et al (2006).
character for combining words, default is _
(underscore) character
if TRUE
print the number of tokens and documents before and
after the function is applied. The number of tokens does not include paddings.
Normally, these functions will be called through
[tokens](x, ngrams = , ...)
, but these functions are provided
in case a user wants to perform lower-level n-gram construction on tokenized
texts.
tokens_skipgrams()
is a wrapper to tokens_ngrams()
that requires
arguments to be supplied for both n
and skip
. For skip
to 0:
Guthrie, David, Ben Allison, Wei Liu, Louise Guthrie, and Yorick Wilks. 2006.
"A Closer Look at Skip-Gram Modelling." https://aclanthology.org/L06-1210/
# ngrams
tokens_ngrams(tokens(c("a b c d e", "c d e f g")), n = 2:3)
toks <- tokens(c(text1 = "the quick brown fox jumped over the lazy dog"))
tokens_ngrams(toks, n = 1:3)
tokens_ngrams(toks, n = c(2,4), concatenator = " ")
tokens_ngrams(toks, n = c(2,4), skip = 1, concatenator = " ")
# skipgrams
toks <- tokens("insurgents killed in ongoing fighting")
tokens_skipgrams(toks, n = 2, skip = 0:1, concatenator = " ")
tokens_skipgrams(toks, n = 2, skip = 0:2, concatenator = " ")
tokens_skipgrams(toks, n = 3, skip = 0:2, concatenator = " ")
Run the code above in your browser using DataLab