text_filter(fold_case = TRUE, fold_dash = TRUE,
fold_quote = TRUE, map_compatible = TRUE,
remove_control = TRUE, remove_ignorable = TRUE,
remove_whitespace = TRUE, drop_empty = TRUE,
stemmer = NULL) tokens(x, filter = text_filter())
NULL
.NULL
to disable stemming. The stemming
algorithms are provided by the
http://snowballstem.org/algorithms/;
the following stemming algorithms are available:
arabic, danish, dutch, english, finnish, french,
german, hungarian, italian, norwegian, porter, portuguese,
romanian, russian, spanish, swedish, tamil, and turkish.
x
, with the same names. Each list
item is a character vector with the tokens for the corresponding
element of x
.tokens
splits text at the word boundaries defined by
http://unicode.org/reports/tr29/#Word_Boundaries,
normalizes the text to Unicode NFC normal form, and then applies
a series of further transformations to the resulting tokens
as specified by the filter
argument. To skip the addtional
transformation step, specify filter = NULL
.sentences
. tokens("The quick ('brown') fox can't jump 32.3 feet, right?")
# don't normalize:
tokens("The quick ('brown') fox can't jump 32.3 feet, right?", NULL)
Run the code above in your browser using DataLab