Segment text into tokens, each of which is an instance of a particular ‘term’ (formally, a type).
text_filter(map_case = TRUE, map_compat = TRUE,
map_dash = TRUE, map_quote = TRUE,
remove_control = TRUE, remove_ignorable = TRUE,
remove_space = TRUE, ignore_empty = TRUE,
stemmer = NULL, stem_except = drop, combine = NULL,
drop_symbol = FALSE, drop_number = FALSE,
drop_letter = FALSE, drop_kana = FALSE,
drop_ideo = FALSE, drop = NULL, drop_except = select,
select = NULL) tokens(x, filter = text_filter())
object to be tokenized.
filter specifying the transformation from text to
token sequence, a list or text_filter object.
a logical value indicating whether to apply Unicode case mapping to the text. For most languages, this transformation changes uppercase characters to their lowercase equivalents.
a logical value indicating whether to apply Unicode compatibility mappings to the characters, those required for NFKC and NFKD normal forms.
a logical value indicating whether to replace Unicode dash characters like em dash and en dash with an ASCII dash (-).
a logical value indicating whether to replace Unicode quote characters like single quote, double quote, and apostrophe, with an ASCII single quote (').
a logical value indicating whether to remove non-white-space control characters (from the C0 and C1 character classes, and the delete character).
a logical value indicating whether to remove Unicode "default ignorable" characters like zero-width spaces and soft hyphens.
a logical value indicating whether to remove white-space characters like space and new line.
a logical value indicating whether to ignore tokens which, after applying all other normalizations, are empty (containing no characters). A token can become empty if, for example, it starts as white-space.
a character value giving the name of the stemming
algorithm, or NULL to leave words unchanged. The stemming
algorithms are provided by the
Snowball stemming library;
the following stemming algorithms are available:
"arabic", "danish", "dutch",
"english", "finnish", "french",
"german", "hungarian", "italian",
"norwegian", "porter", "portuguese",
"romanian", "russian", "spanish",
"swedish", "tamil", and "turkish".
a character vector of exception words to exempt from
stemming, or NULL. If left unspecified, stem_except
is set equal to the drop argument.
a character vector of multi-word phrases to combine, or
NULL; see ‘Combining words’.
a logical value indicating whether to replace
"symbol" terms (punctuation, emoji, and other words that
are not classified as "number", "letter", "kana",
or "ideo") with NA.
a logical value indicating whether to replace
"number" terms (starting with numerals) with NA.
a logical value indicating whether to replace
"letter" terms (starting with letters excluding kana
and ideographic characters) with NA.
a logical value indicating whether to replace
"kana" terms (starting with kana characters)
with NA.
a logical value indicating whether to replace
"ideo" terms (starting with ideographic characters)
with NA.
a character vector of terms to replace with NA,
or NULL.
a character of terms to exempt from the drop
rules specified by the drop_symbol, drop_number,
drop_letter, drop_kana, drop_ideo, and
drop arguments, or NULL. If left unspecified,
drop_except is set equal to the select argument.
a character vector of terms to keep, or
NULL; if non-NULL, tokens that are not on
this list get replaced with NA.
A list of the same length as x, with the same names. Each list
item is a character vector with the tokens for the corresponding
element of x.
The combine property of a text_filter enables
transformations that combine two or more words into a single token. For
example, specifying combine = "new york" will
cause consecutive instances of the words new and york
to get replaced by a single token, new york.
tokens splits texts into token sequences. Each token is an instance
of a particular term (formally, type). This operation proceeds in a series
of stages, controlled by the filter argument:
First, we segment the text into words using the boundaries
defined by
Unicode
Standard Annex #29, Section 4. We categorize each word as
"number", "letter", "kana", "ideo", or
"symbol" according to whether the first character is a
numeral, letter, kana, ideographic, or other character,
respectively. For words with two or more characters that start
with extenders like underscore (_), we use the second
character in the word to categorize it, treating a second
extender as a letter.
Next, we normalize the words by applying the
character mappings indicated by the map_case,
map_compat, map_dash, map_quote,
remove_control, remove_ignorable, and
remove_space properties. If, after normalization, a
word is empty (for example, if it started out as all white-space
and remove_space is TRUE), and if
ignore_empty is TRUE, we delete the word from
the sequence. At the end of the second stage, we have segmented
the text into a sequence of normalized words, in Unicode composed
normal form (NFC, or if map_compat is TRUE,
NFKC).
In the third stage, if the stemmer property is
non-NULL, we apply the indicated stemming algorithm to
each word that does not match one of the elements of the
stem_except character vector.
Next, if the combine property is non-NULL,
we scan the word sequence from left to right, searching for
the longest possible match in the combine list. If
a match exists, we replace the word sequence with a single token
for that term; otherwise, we create a single-word token. See the
‘Combining words’ section below for more details. After
this stage, the sequence elements are ‘tokens’, not
‘words’.
If any of drop_symbol, drop_number,
drop_letter, drop_kana, or drop_ideo
are TRUE, we replace the terms in the
corresponding categories by NA. (For multi-word terms,
we take the category of the first word in the phrase.)
Then, if the drop property is non-NULL, we replace
terms that match elements of this character vector with
NA. We can add exceptions to the drop rules by specifying
a non-NULL value for the drop_except property:
drop_except is a character vector, then we we restore
terms that match elements of vector to their values prior to
dropping.
Finally, if select is non-NULL, we replace
terms that do not match elements of this character vector
with NA.
When filter = NULL, we treat all logical properties as
FALSE and all other properties as NULL.
tokens("The quick ('brown') fox can't jump 32.3 feet, right?")
# don't normalize:
tokens("The quick ('brown') fox can't jump 32.3 feet, right?", NULL)
# drop common function words ('stop' words):
tokens("Able was I ere I saw Elba.",
text_filter(drop = stopwords("english")))
# drop numbers, with some exceptions:"
tokens("0, 1, 2, 3, 4, 5",
text_filter(drop_number = TRUE, drop_except = c("0", "2", "4")))
Run the code above in your browser using DataLab