
Create a set of ngrams (tokens in sequence) from already tokenized text objects, with an optional skip argument to form skipgrams. Both the ngram length and the skip lengths take vectors of arguments to form multiple lengths or skips in one pass. Implemented in C++ for efficiency.
tokens_ngrams(x, n = 2L, skip = 0L, concatenator = "_")char_ngrams(x, n = 2L, skip = 0L, concatenator = "_")
tokens_skipgrams(x, n, skip, concatenator = "_")
a tokens object, or a character vector, or a list of characters
integer vector specifying the number of elements to be concatenated
in each ngram. Each element of this vector will define a
integer vector specifying the adjacency skip size for tokens
forming the ngrams, default is 0 for only immediately neighbouring words.
For skipgrams
, skip
can be a vector of integers, as the
"classic" approach to forming skip-grams is to set skip = skip =
0:4
produces results that include 4 skips, 3 skips, 2 skips, 1 skip, and 0
skips (where 0 skips are typical n-grams formed from adjacent words). See
Guthrie et al (2006).
character for combining words, default is _
(underscore) character
a tokens object consisting a list of character vectors of ngrams, one list element per text, or a character vector if called on a simple character vector
Normally, these functions will be called through
tokens(x, ngrams = , ...)
, but these functions are provided
in case a user wants to perform lower-level ngram construction on tokenized
texts.
tokens_skipgrams
is a wrapper to tokens_ngrams
that requires arguments to be supplied for both n
and skip
.
For skip
to 0:
Guthrie, David, Ben Allison, Wei Liu, Louise Guthrie, and Yorick Wilks. 2006. "A Closer Look at Skip-Gram Modelling."
# NOT RUN {
# ngrams
tokens_ngrams(tokens(c("a b c d e", "c d e f g")), n = 2:3)
toks <- tokens(c(text1 = "the quick brown fox jumped over the lazy dog"))
tokens_ngrams(toks, n = 1:3)
tokens_ngrams(toks, n = c(2,4), concatenator = " ")
tokens_ngrams(toks, n = c(2,4), skip = 1, concatenator = " ")
# on character
char_ngrams(letters[1:3], n = 1:3)
# skipgrams
toks <- tokens("insurgents killed in ongoing fighting")
tokens_skipgrams(toks, n = 2, skip = 0:1, concatenator = " ")
tokens_skipgrams(toks, n = 2, skip = 0:2, concatenator = " ")
tokens_skipgrams(toks, n = 3, skip = 0:2, concatenator = " ")
# }
Run the code above in your browser using DataLab