Last chance! 50% off unlimited learning
Sale ends in
exclusive = FALSE
then the
behaviour is to apply a "thesaurus" where each value match is replaced by
the dictionary key, converted to capitals if capkeys = TRUE
(so that
the replacements are easily distinguished from features that were terms
found originally in the document).
dfm_lookup(x, dictionary, levels = 1:5, exclusive = TRUE, valuetype = c("glob", "regex", "fixed"), case_insensitive = TRUE, capkeys = !exclusive, verbose = TRUE)
TRUE
, remove all features not in dictionary,
otherwise, replace values in dictionary with keys while leaving other
features unaffected"glob"
for
"glob"-style wildcard expressions; "regex"
for regular expressions;
or "fixed"
for exact matching. See valuetype for details.TRUE
TRUE
, convert dictionary keys to
uppercase to distinguish them from other featuresTRUE
myDict <- dictionary(list(christmas = c("Christmas", "Santa", "holiday"),
opposition = c("Opposition", "reject", "notincorpus"),
taxglob = "tax*",
taxregex = "tax.+$",
country = c("United_States", "Sweden")))
myDfm <- dfm(c("My Christmas was ruined by your opposition tax plan.",
"Does the United_States or Sweden have more progressive taxation?"),
remove = stopwords("english"), verbose = FALSE)
myDfm
# glob format
dfm_lookup(myDfm, myDict, valuetype = "glob")
dfm_lookup(myDfm, myDict, valuetype = "glob", case_insensitive = FALSE)
# regex v. glob format: note that "united_states" is a regex match for "tax*"
dfm_lookup(myDfm, myDict, valuetype = "glob")
dfm_lookup(myDfm, myDict, valuetype = "regex", case_insensitive = TRUE)
# fixed format: no pattern matching
dfm_lookup(myDfm, myDict, valuetype = "fixed")
dfm_lookup(myDfm, myDict, valuetype = "fixed", case_insensitive = FALSE)
Run the code above in your browser using DataLab