
Last chance! 50% off unlimited learning
Sale ends in
Substitute token types based on vectorized one-to-one matching. Since this
function is created for lemmatization or user-defined stemming. It support
substitution of multi-word features by multi-word features, but substitution
is fastest when pattern
and replacement
are character vectors
and valuetype = "fixed"
as the function only substitute types of
tokens. Please use tokens_lookup()
with exclusive = FALSE
to replace dictionary values.
tokens_replace(
x,
pattern,
replacement,
valuetype = "glob",
case_insensitive = TRUE,
verbose = quanteda_options("verbose")
)
tokens object whose token elements will be replaced
a character vector or list of character vectors. See pattern for more details.
a character vector or (if pattern
is a list) list
of character vectors of the same length as pattern
the type of pattern matching: "glob"
for "glob"-style
wildcard expressions; "regex"
for regular expressions; or "fixed"
for
exact matching. See valuetype for details.
logical; if TRUE
, ignore case when matching a
pattern
or dictionary values
print status messages if TRUE
tokens_lookup
# NOT RUN {
toks1 <- tokens(data_corpus_inaugural, remove_punct = TRUE)
# lemmatization
taxwords <- c("tax", "taxing", "taxed", "taxed", "taxation")
lemma <- rep("TAX", length(taxwords))
toks2 <- tokens_replace(toks1, taxwords, lemma, valuetype = "fixed")
kwic(toks2, "TAX") %>%
tail(10)
# stemming
type <- types(toks1)
stem <- char_wordstem(type, "porter")
toks3 <- tokens_replace(toks1, type, stem, valuetype = "fixed", case_insensitive = FALSE)
identical(toks3, tokens_wordstem(toks1, "porter"))
# multi-multi substitution
toks4 <- tokens_replace(toks1, phrase(c("Supreme Court")),
phrase(c("Supreme Court of the United States")))
kwic(toks4, phrase(c("Supreme Court of the United States")))
# }
Run the code above in your browser using DataLab