
Last chance! 50% off unlimited learning
Sale ends in
These function select or discard tokens from a tokens object. For
convenience, the functions tokens_remove
and tokens_keep
are defined as
shortcuts for tokens_select(x, pattern, selection = "remove")
and
tokens_select(x, pattern, selection = "keep")
, respectively. The most
common usage for tokens_remove
will be to eliminate stop words from a text
or text-based object, while the most common use of tokens_select
will be to
select tokens with only positive pattern matches from a list of regular
expressions, including a dictionary. startpos
and endpos
determine the
positions of tokens searched for pattern
and areas affected are expanded by
window
.
tokens_select(
x,
pattern,
selection = c("keep", "remove"),
valuetype = c("glob", "regex", "fixed"),
case_insensitive = TRUE,
padding = FALSE,
window = 0,
min_nchar = NULL,
max_nchar = NULL,
startpos = 1L,
endpos = -1L,
apply_if = NULL,
verbose = quanteda_options("verbose")
)tokens_remove(x, ...)
tokens_keep(x, ...)
a tokens object with tokens selected or removed based on their
match to pattern
tokens object whose token elements will be removed or kept
a character vector, list of character vectors, dictionary, or collocations object. See pattern for details.
whether to "keep"
or "remove"
the tokens matching
pattern
the type of pattern matching: "glob"
for "glob"-style
wildcard expressions; "regex"
for regular expressions; or "fixed"
for
exact matching. See valuetype for details.
logical; if TRUE
, ignore case when matching a
pattern
or dictionary values
if TRUE
, leave an empty string where the removed tokens
previously existed. This is useful if a positional match is needed between
the pre- and post-selected tokens, for instance if a window of adjacency
needs to be computed.
integer of length 1 or 2; the size of the window of tokens
adjacent to pattern
that will be selected. The window is symmetric unless
a vector of two elements is supplied, in which case the first element will
be the token length of the window before pattern
, and the second will be
the token length of the window after pattern
. The default is 0
, meaning
that only the pattern matched token(s) are selected, with no adjacent
terms.
Terms from overlapping windows are never double-counted, but simply
returned in the pattern match. This is because tokens_select
never
redefines the document units; for this, see kwic()
.
optional numerics specifying the minimum and
maximum length in characters for tokens to be removed or kept; defaults are
NULL
for no limits. These are applied after (and hence, in addition to)
any selection based on pattern matches.
integer; position of tokens in documents where pattern
matching starts and ends, where 1 is the first token in a document. For
negative indexes, counting starts at the ending token of the document, so
that -1 denotes the last token in the document, -2 the second to last, etc.
When the length of the vector is equal to ndoc
, tokens in corresponding
positions will be selected; when it is less than ndoc
, values are
repeated to make them equal in length.
logical vector of length ndoc(x)
; documents are modified
only when corresponding values are TRUE
, others are left unchanged.
if TRUE
print messages about how many tokens were selected
or removed
additional arguments passed by tokens_remove
and
tokens_keep
to tokens_select
. Cannot include
selection
.
## tokens_select with simple examples
toks <- as.tokens(list(letters, LETTERS))
tokens_select(toks, c("b", "e", "f"), selection = "keep", padding = FALSE)
tokens_select(toks, c("b", "e", "f"), selection = "keep", padding = TRUE)
tokens_select(toks, c("b", "e", "f"), selection = "remove", padding = FALSE)
tokens_select(toks, c("b", "e", "f"), selection = "remove", padding = TRUE)
# how case_insensitive works
tokens_select(toks, c("b", "e", "f"), selection = "remove", case_insensitive = TRUE)
tokens_select(toks, c("b", "e", "f"), selection = "remove", case_insensitive = FALSE)
# use window
tokens_select(toks, c("b", "f"), selection = "keep", window = 1)
tokens_select(toks, c("b", "f"), selection = "remove", window = 1)
tokens_remove(toks, c("b", "f"), window = c(0, 1))
tokens_select(toks, pattern = c("e", "g"), window = c(1, 2))
# tokens_remove example: remove stopwords
txt <- c(wash1 <- "Fellow citizens, I am again called upon by the voice of my
country to execute the functions of its Chief Magistrate.",
wash2 <- "When the occasion proper for it shall arrive, I shall
endeavor to express the high sense I entertain of this
distinguished honor.")
tokens_remove(tokens(txt, remove_punct = TRUE), stopwords("english"))
# token_keep example: keep two-letter words
tokens_keep(tokens(txt, remove_punct = TRUE), "??")
Run the code above in your browser using DataLab