
Replace multi-token sequences with a multi-word, or "compound" token. The
resulting compound tokens will represent a phrase or multi-word expression,
concatenated with concatenator
(by default, the "_
" character) to form a
single "token". This ensures that the sequences will be processed
subsequently as single tokens, for instance in constructing a dfm.
tokens_compound(
x,
pattern,
valuetype = c("glob", "regex", "fixed"),
concatenator = concat(x),
window = 0L,
case_insensitive = TRUE,
join = TRUE,
apply_if = NULL
)
A tokens object in which the token sequences matching pattern
have been replaced by new compounded "tokens" joined by the concatenator.
an input tokens object
a character vector, list of character vectors, dictionary, or collocations object. See pattern for details.
the type of pattern matching: "glob"
for "glob"-style
wildcard expressions; "regex"
for regular expressions; or "fixed"
for
exact matching. See valuetype for details.
character; the concatenation character that will connect the tokens making up a multi-token sequence.
integer; a vector of length 1 or 2 that specifies size of the
window of tokens adjacent to pattern
that will be compounded with matches
to pattern
. The window can be asymmetric if two elements are specified,
with the first giving the window size before pattern
and the second the
window size after. If paddings (empty ""
tokens) are found, window will
be shrunk to exclude them.
logical; if TRUE
, ignore case when matching a
pattern
or dictionary values
logical; if TRUE
, join overlapping compounds into a single
compound; otherwise, form these separately. See examples.
logical vector of length ndoc(x)
; documents are modified
only when corresponding values are TRUE
, others are left unchanged.
txt <- "The United Kingdom is leaving the European Union."
toks <- tokens(txt, remove_punct = TRUE)
# character vector - not compounded
tokens_compound(toks, c("United", "Kingdom", "European", "Union"))
# elements separated by spaces - not compounded
tokens_compound(toks, c("United Kingdom", "European Union"))
# list of characters - is compounded
tokens_compound(toks, list(c("United", "Kingdom"), c("European", "Union")))
# elements separated by spaces, wrapped in phrase() - is compounded
tokens_compound(toks, phrase(c("United Kingdom", "European Union")))
# supplied as values in a dictionary (same as list) - is compounded
# (keys do not matter)
tokens_compound(toks, dictionary(list(key1 = "United Kingdom",
key2 = "European Union")))
# pattern as dictionaries with glob matches
tokens_compound(toks, dictionary(list(key1 = c("U* K*"))), valuetype = "glob")
# note the differences caused by join = FALSE
compounds <- list(c("the", "European"), c("European", "Union"))
tokens_compound(toks, pattern = compounds, join = TRUE)
tokens_compound(toks, pattern = compounds, join = FALSE)
# use window to form ngrams
tokens_remove(toks, pattern = stopwords("en")) |>
tokens_compound(pattern = "leav*", join = FALSE, window = c(0, 3))
Run the code above in your browser using DataLab