Last chance! 50% off unlimited learning
Sale ends in
concatenator
(by default, the "_
" character)
to form a single "token". This ensures that the sequences will be processed
subsequently as single tokens, for instance in constructing a dfm.
tokens_compound(x, sequences, concatenator = "_", valuetype = c("glob", "regex", "fixed"), case_insensitive = TRUE, join = FALSE)
_
is highly
recommended since it will not be removed during normal cleaning and
tokenization (while nearly all other punctuation characters, at least those
in the Unicode punctuation class [P] will be removed."glob"
for
"glob"-style wildcard expressions; "regex"
for regular expressions;
or "fixed"
for exact matching. See valuetype for details.TRUE
, ignore case when matchingTRUE
, join overlapped compoundssequences
have been replaced by compound "tokens" joined by the concatenator
mytexts <- c("The new law included a capital gains tax, and an inheritance tax.",
"New York City has raised taxes: an income tax and inheritance taxes.")
mytoks <- tokens(mytexts, removePunct = TRUE)
# for lists of sequence elements
myseqs <- list(c("tax"), c("income", "tax"), c("capital", "gains", "tax"), c("inheritance", "tax"))
(cw <- tokens_compound(mytoks, myseqs))
dfm(cw)
# when used as a dictionary for dfm creation
mydict <- dictionary(list(tax=c("tax", "income tax", "capital gains tax", "inheritance tax")))
(cw2 <- tokens_compound(mytoks, mydict))
# to pick up "taxes" in the second text, set valuetype = "regex"
(cw3 <- tokens_compound(mytoks, mydict, valuetype = "regex"))
# dictionaries w/glob matches
myDict <- dictionary(list(negative = c("bad* word*", "negative", "awful text"),
positive = c("good stuff", "like? th??")))
toks <- tokens(c(txt1 = "I liked this, when we can use bad words, in awful text.",
txt2 = "Some damn good stuff, like the text, she likes that too."))
tokens_compound(toks, myDict)
# with collocations
collocs <- collocations("capital gains taxes are worse than inheritance taxes", size = 2:3)
toks <- tokens("The new law included capital gains taxes and inheritance taxes.")
tokens_compound(toks, collocs)
Run the code above in your browser using DataLab