Segment text into tokens, each of which is an instance of a particular ‘type’.
text_tokens(x, filter = text_filter(x)) text_ntoken(x, filter = text_filter(x))
text_length(x, filter = text_filter(x))
object to be tokenized.
filter specifying the transformation from text to token sequence.
text_tokens
returns a list of the same length as x
, with
the same names. Each list item is a character vector with the tokens
for the corresponding element of x
.
text_ntoken
returns a numeric vector the same length as x
,
with each element giving the number of non-dropped tokens in the
corresponding text. text_length
is similar, but includes dropped
tokens in the result.
We use the stemming algorithms provided by the Snowball library. These
algorithms are also available in the SnowballC
R package. Unlike
that package, we provide the ability to exempt certain words from stemming,
using the stem_except
argument; see the examples below.
If you do not specify the stem_except
argument, then we set its
value equal to the drop
argument.
We also exempt from stemming any case that would turn internal
punctuation like the full stop in "u.s"
into boundary punctuation
like at the end of "u."
; otherwise, in examples like this, the
stemming procedure would turn single-word tokens into multi-word
tokens (compare text_tokens("u.s")
with text_tokens("u.")
).
For English, this likely only affects words ending in ".s"
.
The combine
property of a text_filter
enables
transformations that combine two or more words into a single token. For
example, specifying combine = "new york"
will
cause consecutive instances of the words new
and york
to get replaced by a single token, new york
.
By default, we set combine = abbreviations("english")
, so
that abbreviations like "Ms."
get treated as single tokens;
with combine = NULL
, trailing punctuation gets split off, and
"Ms."
gets tokenized into two tokens, for "Ms" and ".".
text_tokens
splits texts into token sequences. Each token is an
instance of a particular type. This operation proceeds in a series
of stages, controlled by the filter
argument:
First, we segment the text into words using the boundaries
defined by
Unicode
Standard Annex #29, Section 4. We categorize each word as
"letter"
, "number"
, "punct"
, or
"symbol"
according to the first character in the word.
For words with two or more characters that start with extenders
like underscore (_
), we use the second character in
the word to categorize it, treating a second extender as a
letter.
Next, we normalize the remaining words by applying the
character mappings indicated by the map_case
,
map_quote
, and remove_ignorable
.
At the end of the second stage, we have segmented
the text into a sequence of normalized words, in Unicode composed
normal form (NFC).
In the third stage, if the stemmer
property is
non-NULL
, we apply the indicated stemming algorithm to
each word that does not match one of the elements of the
stem_except
character vector. See the
‘Stemming’ section below for more information.
Next, if the combine
property is non-NULL
,
we scan the word sequence from left to right, searching for
the longest possible match in the combine
list. If
a match exists, we replace the word sequence with a single token
for that type; otherwise, we create a single-word token. See the
‘Combining words’ section below for more details. After
this stage, the sequence elements are ‘tokens’, not
‘words’.
If any of drop_letter
, drop_number
,
drop_punct
, or drop_symbol
are TRUE
,
then we replace the tokens with values in the corresponding
categories by NA
. (For multi-word types created by the
combine
step, we take the category of the first word in
the phrase.) Then, if the drop
property is
non-NULL
, we replace tokens that match elements of this
character vector with NA
. We can add exceptions to the
drop rules by specifying a non-NULL
value for the
drop_except
property: drop_except
is a character
vector, then we we restore tokens that match elements of
vector to their values prior to dropping.
When filter = NULL
, we treat all logical properties as
FALSE
and all other properties as NA
or NULL
.
Terms specified by the stem_except
, combine
,
drop
, and drop_except
need to be stemmed (unless
stemmer
is not NA
), but they do not need to be
normalized. We normalize the argument values in the manner specified
by map_case
, map_quote
, and remove_ignorable
.
Thus, for example, if map_case = TRUE
, then a token filter with
combine = "Mx."
produces the same results as a token filter
with combine = "mx."
.
text_split
, text_types
,
abbreviations
, stopwords
,
term_matrix
.
# NOT RUN {
text_tokens("The quick ('brown') fox can't jump 32.3 feet, right?")
# count non-dropped tokens:
text_ntoken("The quick ('brown') fox can't jump 32.3 feet, right?")
# count dropped and non-dropped tokens:
text_length("The quick ('brown') fox can't jump 32.3 feet, right?")
# don't change case or quotes:
f <- text_filter(map_case = FALSE, map_quote = FALSE)
text_tokens("The quick ('brown') fox can't jump 32.3 feet, right?", f)
# drop common function words ('stop' words):
text_tokens("Able was I ere I saw Elba.",
text_filter(drop = stopwords("english")))
# drop numbers, with some exceptions:"
text_tokens("0, 1, 2, 3, 4, 5",
text_filter(drop_number = TRUE,
drop_except = c("0", "2", "4")))
# apply stemming...
text_tokens("Mary is running", text_filter(stemmer = "english"))
# ...except for certain words
text_tokens("Mary is running",
text_filter(stemmer = "english", stem_except = "mary"))
# combine abbreviations by default
text_tokens("Ms. Jones")
# disable default combinations
text_tokens("Ms. Jones", text_filter(combine = NULL))
# add new combinations
text_tokens("Ms. Jones is from New York City, New York.",
text_filter(combine = c(abbreviations("english"),
"new york", "new york city")))
# }
Run the code above in your browser using DataLab