Segment text into tokens, each of which is an instance of a particular ‘type’.
token_filter(map_case = TRUE, map_compat = TRUE,
map_quote = TRUE, remove_ignorable = TRUE,
stemmer = NULL, stem_except = drop,
combine = NULL,
drop_letter = FALSE, drop_mark = FALSE,
drop_number = FALSE, drop_punct = FALSE,
drop_symbol = FALSE, drop_other = FALSE,
drop = NULL, drop_except = NULL) tokens(x, filter = token_filter())
object to be tokenized.
filter specifying the transformation from text to token sequence, a list or token filter object.
a logical value indicating whether to apply Unicode case mapping to the text. For most languages, this transformation changes uppercase characters to their lowercase equivalents.
a logical value indicating whether to apply Unicode compatibility mappings to the characters, those required for NFKC and NFKD normal forms.
a logical value indicating whether to replace Unicode quote characters like single quote, double quote, and apostrophe, with an ASCII single quote (').
a logical value indicating whether to remove Unicode "default ignorable" characters like zero-width spaces and soft hyphens.
a character value giving the name of the stemming
algorithm, or NULL to leave words unchanged. The stemming
algorithms are provided by the
Snowball stemming library;
the following stemming algorithms are available:
"arabic", "danish", "dutch",
"english", "finnish", "french",
"german", "hungarian", "italian",
"norwegian", "porter", "portuguese",
"romanian", "russian", "spanish",
"swedish", "tamil", and "turkish".
a character vector of exception words to exempt from
stemming, or NULL. If left unspecified, stem_except
is set equal to the drop argument.
a character vector of multi-word phrases to combine, or
NULL; see ‘Combining words’.
a logical value indicating whether to replace
"letter" tokens (cased letters, kana, idoegraphic, letter-like
numeric characters and other letters) with NA.
a logical value indicating whether to replace
"mark" tokens (subscripts, superscripts, modifier letters,
modifier symbols, and other marks) with NA.
a logical value indicating whether to replace
"number" tokens (decimal digits, words appearing to be
numbers, and other numeric characters) with NA.
a logical value indicating whether to replace
"punct" tokens (punctuation) with NA.
a logical value indicating whether to replace
"symbol" tokens (emoji, math, currency, and other symbols)
with NA.
a logical value indicating whether to replace
"other" tokens (types that do not fall into any other
categories) with NA.
a character vector of types to replace with NA,
or NULL.
a character of types to exempt from the drop
rules specified by the drop_letter, drop_mark,
drop_number, drop_punct, drop_symbol,
drop_other, and drop arguments, or NULL.
A list of the same length as x, with the same names. Each list
item is a character vector with the tokens for the corresponding
element of x.
The combine property of a token_filter enables
transformations that combine two or more words into a single token. For
example, specifying combine = "new york" will
cause consecutive instances of the words new and york
to get replaced by a single token, new york.
tokens splits texts into token sequences. Each token is an instance
of a particular type. This operation proceeds in a series
of stages, controlled by the filter argument:
First, we segment the text into words using the boundaries
defined by
Unicode
Standard Annex #29, Section 4. We categorize each word as
"letter", "mark", "number",
"punct", "symbol", or "other" according
to the first character in the word. For words with two or
more characters that start with extenders like underscore
(_), we use the second character in the word to
categorize it, treating a second extender as a letter.
Next, we normalize the remaining words by applying the
character mappings indicated by the map_case,
map_compat, map_quote, and remove_ignorable.
At the end of the second stage, we have segmented
the text into a sequence of normalized words, in Unicode composed
normal form (NFC, or if map_compat is TRUE,
NFKC).
In the third stage, if the stemmer property is
non-NULL, we apply the indicated stemming algorithm to
each word that does not match one of the elements of the
stem_except character vector.
Next, if the combine property is non-NULL,
we scan the word sequence from left to right, searching for
the longest possible match in the combine list. If
a match exists, we replace the word sequence with a single token
for that type; otherwise, we create a single-word token. See the
‘Combining words’ section below for more details. After
this stage, the sequence elements are ‘tokens’, not
‘words’.
If any of drop_letter, drop_mark,
drop_number, drop_punct, drop_symbol,
or drop_other are TRUE, we replace the tokens
with values in the corresponding categories by NA.
(For multi-word types created by the combine step,
we take the category of the first word in the phrase.)
Then, if the drop property is non-NULL, we replace
tokens that match elements of this character vector with
NA. We can add exceptions to the drop rules by specifying
a non-NULL value for the drop_except property:
drop_except is a character vector, then we we restore
tokens that match elements of vector to their values prior to
dropping.
When filter = NULL, we treat all logical properties as
FALSE and all other properties as NULL.
tokens("The quick ('brown') fox can't jump 32.3 feet, right?")
# don't normalize:
tokens("The quick ('brown') fox can't jump 32.3 feet, right?", NULL)
# drop common function words ('stop' words):
tokens("Able was I ere I saw Elba.",
token_filter(drop = stopwords("english")))
# drop numbers, with some exceptions:"
tokens("0, 1, 2, 3, 4, 5",
token_filter(drop_number = TRUE, drop_except = c("0", "2", "4")))
Run the code above in your browser using DataLab