Segment text into tokens, each of which is an instance of a particular ‘type’.
token_filter(map_case = TRUE, map_compat = TRUE,
map_quote = TRUE, remove_ignorable = TRUE,
stemmer = NULL, stem_except = drop,
combine = NULL,
drop_letter = FALSE, drop_mark = FALSE,
drop_number = FALSE, drop_punct = FALSE,
drop_symbol = FALSE, drop_other = FALSE,
drop = NULL, drop_except = NULL) tokens(x, filter = token_filter())
object to be tokenized.
filter specifying the transformation from text to token sequence, a list or token filter object.
a logical value indicating whether to apply Unicode case mapping to the text. For most languages, this transformation changes uppercase characters to their lowercase equivalents.
a logical value indicating whether to apply Unicode compatibility mappings to the characters, those required for NFKC and NFKD normal forms.
a logical value indicating whether to replace Unicode quote characters like single quote, double quote, and apostrophe, with an ASCII single quote (').
a logical value indicating whether to remove Unicode "default ignorable" characters like zero-width spaces and soft hyphens.
a character value giving the name of the stemming
algorithm, or NULL
to leave words unchanged. The stemming
algorithms are provided by the
Snowball stemming library;
the following stemming algorithms are available:
"arabic"
, "danish"
, "dutch"
,
"english"
, "finnish"
, "french"
,
"german"
, "hungarian"
, "italian"
,
"norwegian"
, "porter"
, "portuguese"
,
"romanian"
, "russian"
, "spanish"
,
"swedish"
, "tamil"
, and "turkish"
.
a character vector of exception words to exempt from
stemming, or NULL
. If left unspecified, stem_except
is set equal to the drop
argument.
a character vector of multi-word phrases to combine, or
NULL
; see ‘Combining words’.
a logical value indicating whether to replace
"letter"
tokens (cased letters, kana, idoegraphic, letter-like
numeric characters and other letters) with NA
.
a logical value indicating whether to replace
"mark"
tokens (subscripts, superscripts, modifier letters,
modifier symbols, and other marks) with NA
.
a logical value indicating whether to replace
"number"
tokens (decimal digits, words appearing to be
numbers, and other numeric characters) with NA
.
a logical value indicating whether to replace
"punct"
tokens (punctuation) with NA
.
a logical value indicating whether to replace
"symbol"
tokens (emoji, math, currency, and other symbols)
with NA
.
a logical value indicating whether to replace
"other"
tokens (types that do not fall into any other
categories) with NA
.
a character vector of types to replace with NA
,
or NULL
.
a character of types to exempt from the drop
rules specified by the drop_letter
, drop_mark
,
drop_number
, drop_punct
, drop_symbol
,
drop_other
, and drop
arguments, or NULL
.
A list of the same length as x
, with the same names. Each list
item is a character vector with the tokens for the corresponding
element of x
.
The combine
property of a token_filter
enables
transformations that combine two or more words into a single token. For
example, specifying combine = "new york"
will
cause consecutive instances of the words new
and york
to get replaced by a single token, new york
.
tokens
splits texts into token sequences. Each token is an instance
of a particular type. This operation proceeds in a series
of stages, controlled by the filter
argument:
First, we segment the text into words using the boundaries
defined by
Unicode
Standard Annex #29, Section 4. We categorize each word as
"letter"
, "mark"
, "number"
,
"punct"
, "symbol"
, or "other"
according
to the first character in the word. For words with two or
more characters that start with extenders like underscore
(_
), we use the second character in the word to
categorize it, treating a second extender as a letter.
Next, we normalize the remaining words by applying the
character mappings indicated by the map_case
,
map_compat
, map_quote
, and remove_ignorable
.
At the end of the second stage, we have segmented
the text into a sequence of normalized words, in Unicode composed
normal form (NFC, or if map_compat
is TRUE
,
NFKC).
In the third stage, if the stemmer
property is
non-NULL
, we apply the indicated stemming algorithm to
each word that does not match one of the elements of the
stem_except
character vector.
Next, if the combine
property is non-NULL
,
we scan the word sequence from left to right, searching for
the longest possible match in the combine
list. If
a match exists, we replace the word sequence with a single token
for that type; otherwise, we create a single-word token. See the
‘Combining words’ section below for more details. After
this stage, the sequence elements are ‘tokens’, not
‘words’.
If any of drop_letter
, drop_mark
,
drop_number
, drop_punct
, drop_symbol
,
or drop_other
are TRUE
, we replace the tokens
with values in the corresponding categories by NA
.
(For multi-word types created by the combine
step,
we take the category of the first word in the phrase.)
Then, if the drop
property is non-NULL
, we replace
tokens that match elements of this character vector with
NA
. We can add exceptions to the drop rules by specifying
a non-NULL
value for the drop_except
property:
drop_except
is a character vector, then we we restore
tokens that match elements of vector to their values prior to
dropping.
When filter = NULL
, we treat all logical properties as
FALSE
and all other properties as NULL
.
tokens("The quick ('brown') fox can't jump 32.3 feet, right?")
# don't normalize:
tokens("The quick ('brown') fox can't jump 32.3 feet, right?", NULL)
# drop common function words ('stop' words):
tokens("Able was I ere I saw Elba.",
token_filter(drop = stopwords("english")))
# drop numbers, with some exceptions:"
tokens("0, 1, 2, 3, 4, 5",
token_filter(drop_number = TRUE, drop_except = c("0", "2", "4")))
Run the code above in your browser using DataLab