Last chance! 50% off unlimited learning
Sale ends in
These function select or discard tokens from a tokens object. For
convenience, the functions tokens_remove
and tokens_keep
are defined as
shortcuts for tokens_select(x, pattern, selection = "remove")
and
tokens_select(x, pattern, selection = "keep")
, respectively. The most
common usage for tokens_remove
will be to eliminate stop words from a text
or text-based object, while the most common use of tokens_select
will be to
select tokens with only positive pattern matches from a list of regular
expressions, including a dictionary. startpos
and endpos
determine the
positions of tokens searched for pattern
and areas affected are
expanded by window
.
tokens_select(
x,
pattern,
selection = c("keep", "remove"),
valuetype = c("glob", "regex", "fixed"),
case_insensitive = TRUE,
padding = FALSE,
window = 0,
min_nchar = NULL,
max_nchar = NULL,
startpos = 1L,
endpos = -1L,
verbose = quanteda_options("verbose")
)tokens_remove(x, ...)
tokens_keep(x, ...)
tokens object whose token elements will be removed or kept
a character vector, list of character vectors, dictionary, or collocations object. See pattern for details.
whether to "keep"
or "remove"
the tokens matching
pattern
the type of pattern matching: "glob"
for "glob"-style
wildcard expressions; "regex"
for regular expressions; or "fixed"
for
exact matching. See valuetype for details.
logical; if TRUE
, ignore case when matching a
pattern
or dictionary values
if TRUE
, leave an empty string where the removed tokens
previously existed. This is useful if a positional match is needed between
the pre- and post-selected tokens, for instance if a window of adjacency
needs to be computed.
integer of length 1 or 2; the size of the window of tokens
adjacent to pattern
that will be selected. The window is symmetric unless
a vector of two elements is supplied, in which case the first element will
be the token length of the window before pattern
, and the second will be
the token length of the window after pattern
. The default is 0
, meaning
that only the pattern matched token(s) are selected, with no adjacent
terms.
Terms from overlapping windows are never double-counted, but simply
returned in the pattern match. This is because tokens_select
never
redefines the document units; for this, see kwic()
.
optional numerics specifying the minimum and
maximum length in characters for tokens to be removed or kept; defaults are
NULL
for no limits. These are applied after (and hence, in addition to)
any selection based on pattern matches.
integer; position of tokens in documents where pattern
matching starts and ends, where 1 is the first token in a document. For
negative indexes, counting starts at the ending token of the document, so
that -1 denotes the last token in the document, -2 the second to last, etc.
When the length of the vector is equal to ndoc
, tokens in corresponding
positions will be selected. Otherwise, only the first element in the vector
is used.
if TRUE
print messages about how many tokens were selected
or removed
additional arguments passed by tokens_remove
and
tokens_keep
to tokens_select
. Cannot include
selection
.
a tokens object with tokens selected or removed based on their
match to pattern
# NOT RUN {
## tokens_select with simple examples
toks <- as.tokens(list(letters, LETTERS))
tokens_select(toks, c("b", "e", "f"), selection = "keep", padding = FALSE)
tokens_select(toks, c("b", "e", "f"), selection = "keep", padding = TRUE)
tokens_select(toks, c("b", "e", "f"), selection = "remove", padding = FALSE)
tokens_select(toks, c("b", "e", "f"), selection = "remove", padding = TRUE)
# how case_insensitive works
tokens_select(toks, c("b", "e", "f"), selection = "remove", case_insensitive = TRUE)
tokens_select(toks, c("b", "e", "f"), selection = "remove", case_insensitive = FALSE)
# use window
tokens_select(toks, c("b", "f"), selection = "keep", window = 1)
tokens_select(toks, c("b", "f"), selection = "remove", window = 1)
tokens_remove(toks, c("b", "f"), window = c(0, 1))
tokens_select(toks, pattern = c("e", "g"), window = c(1, 2))
# tokens_remove example: remove stopwords
txt <- c(wash1 <- "Fellow citizens, I am again called upon by the voice of my
country to execute the functions of its Chief Magistrate.",
wash2 <- "When the occasion proper for it shall arrive, I shall
endeavor to express the high sense I entertain of this
distinguished honor.")
tokens_remove(tokens(txt, remove_punct = TRUE), stopwords("english"))
# token_keep example: keep two-letter words
tokens_keep(tokens(txt, remove_punct = TRUE), "??")
# }
Run the code above in your browser using DataLab