Segment text into tokens, each of which is an instance of a particular ‘term’ (formally, a type).
text_filter(map_case = TRUE, map_compat = TRUE,
map_dash = TRUE, map_quote = TRUE,
remove_control = TRUE, remove_ignorable = TRUE,
remove_space = TRUE, ignore_empty = TRUE,
stemmer = NULL, stem_except = drop, combine = NULL,
drop_symbol = FALSE, drop_number = FALSE,
drop_letter = FALSE, drop_kana = FALSE,
drop_ideo = FALSE, drop = NULL, drop_except = select,
select = NULL) tokens(x, filter = text_filter())
object to be tokenized.
filter specifying the transformation from text to
token sequence, a list or text_filter
object.
a logical value indicating whether to apply Unicode case mapping to the text. For most languages, this transformation changes uppercase characters to their lowercase equivalents.
a logical value indicating whether to apply Unicode compatibility mappings to the characters, those required for NFKC and NFKD normal forms.
a logical value indicating whether to replace Unicode dash characters like em dash and en dash with an ASCII dash (-).
a logical value indicating whether to replace Unicode quote characters like single quote, double quote, and apostrophe, with an ASCII single quote (').
a logical value indicating whether to remove non-white-space control characters (from the C0 and C1 character classes, and the delete character).
a logical value indicating whether to remove Unicode "default ignorable" characters like zero-width spaces and soft hyphens.
a logical value indicating whether to remove white-space characters like space and new line.
a logical value indicating whether to ignore tokens which, after applying all other normalizations, are empty (containing no characters). A token can become empty if, for example, it starts as white-space.
a character value giving the name of the stemming
algorithm, or NULL
to leave words unchanged. The stemming
algorithms are provided by the
Snowball stemming library;
the following stemming algorithms are available:
"arabic"
, "danish"
, "dutch"
,
"english"
, "finnish"
, "french"
,
"german"
, "hungarian"
, "italian"
,
"norwegian"
, "porter"
, "portuguese"
,
"romanian"
, "russian"
, "spanish"
,
"swedish"
, "tamil"
, and "turkish"
.
a character vector of exception words to exempt from
stemming, or NULL
. If left unspecified, stem_except
is set equal to the drop
argument.
a character vector of multi-word phrases to combine, or
NULL
; see ‘Combining words’.
a logical value indicating whether to replace
"symbol"
terms (punctuation, emoji, and other words that
are not classified as "number"
, "letter"
, "kana"
,
or "ideo"
) with NA
.
a logical value indicating whether to replace
"number"
terms (starting with numerals) with NA
.
a logical value indicating whether to replace
"letter"
terms (starting with letters excluding kana
and ideographic characters) with NA
.
a logical value indicating whether to replace
"kana"
terms (starting with kana characters)
with NA
.
a logical value indicating whether to replace
"ideo"
terms (starting with ideographic characters)
with NA
.
a character vector of terms to replace with NA
,
or NULL
.
a character of terms to exempt from the drop
rules specified by the drop_symbol
, drop_number
,
drop_letter
, drop_kana
, drop_ideo
, and
drop
arguments, or NULL
. If left unspecified,
drop_except
is set equal to the select
argument.
a character vector of terms to keep, or
NULL
; if non-NULL
, tokens that are not on
this list get replaced with NA
.
A list of the same length as x
, with the same names. Each list
item is a character vector with the tokens for the corresponding
element of x
.
The combine
property of a text_filter
enables
transformations that combine two or more words into a single token. For
example, specifying combine = "new york"
will
cause consecutive instances of the words new
and york
to get replaced by a single token, new york
.
tokens
splits texts into token sequences. Each token is an instance
of a particular term (formally, type). This operation proceeds in a series
of stages, controlled by the filter
argument:
First, we segment the text into words using the boundaries
defined by
Unicode
Standard Annex #29, Section 4. We categorize each word as
"number"
, "letter"
, "kana"
, "ideo"
, or
"symbol"
according to whether the first character is a
numeral, letter, kana, ideographic, or other character,
respectively. For words with two or more characters that start
with extenders like underscore (_
), we use the second
character in the word to categorize it, treating a second
extender as a letter.
Next, we normalize the words by applying the
character mappings indicated by the map_case
,
map_compat
, map_dash
, map_quote
,
remove_control
, remove_ignorable
, and
remove_space
properties. If, after normalization, a
word is empty (for example, if it started out as all white-space
and remove_space
is TRUE
), and if
ignore_empty
is TRUE
, we delete the word from
the sequence. At the end of the second stage, we have segmented
the text into a sequence of normalized words, in Unicode composed
normal form (NFC, or if map_compat
is TRUE
,
NFKC).
In the third stage, if the stemmer
property is
non-NULL
, we apply the indicated stemming algorithm to
each word that does not match one of the elements of the
stem_except
character vector.
Next, if the combine
property is non-NULL
,
we scan the word sequence from left to right, searching for
the longest possible match in the combine
list. If
a match exists, we replace the word sequence with a single token
for that term; otherwise, we create a single-word token. See the
‘Combining words’ section below for more details. After
this stage, the sequence elements are ‘tokens’, not
‘words’.
If any of drop_symbol
, drop_number
,
drop_letter
, drop_kana
, or drop_ideo
are TRUE
, we replace the terms in the
corresponding categories by NA
. (For multi-word terms,
we take the category of the first word in the phrase.)
Then, if the drop
property is non-NULL
, we replace
terms that match elements of this character vector with
NA
. We can add exceptions to the drop rules by specifying
a non-NULL
value for the drop_except
property:
drop_except
is a character vector, then we we restore
terms that match elements of vector to their values prior to
dropping.
Finally, if select
is non-NULL
, we replace
terms that do not match elements of this character vector
with NA
.
When filter = NULL
, we treat all logical properties as
FALSE
and all other properties as NULL
.
tokens("The quick ('brown') fox can't jump 32.3 feet, right?")
# don't normalize:
tokens("The quick ('brown') fox can't jump 32.3 feet, right?", NULL)
# drop common function words ('stop' words):
tokens("Able was I ere I saw Elba.",
text_filter(drop = stopwords("english")))
# drop numbers, with some exceptions:"
tokens("0, 1, 2, 3, 4, 5",
text_filter(drop_number = TRUE, drop_except = c("0", "2", "4")))
Run the code above in your browser using DataLab