Regexp_Tokenizer()
creates regexp tokenizers which use the
given pattern
and ...
arguments to match tokens or
separators between tokens via gregexpr()
, and then
transform the results of this into character spans of the tokens
found. The given description
is currently kept as an
attribute. whitespace_tokenizer()
tokenizes by treating any sequence of
whitespace characters as a separator.
blankline_tokenizer()
tokenizes by treating any sequence of
blank lines as a separator.
wordpunct_tokenizer()
tokenizes by matching sequences of
alphabetic characters and sequences of (non-whitespace) non-alphabetic
characters.