Regexp_Tokenizer() creates regexp tokenizers which use the
given pattern and ... arguments to match tokens or
separators between tokens via gregexpr(), and then
transform the results of this into character spans of the tokens
found. The given description is currently kept as an
attribute. whitespace_tokenizer() tokenizes by treating any sequence of
whitespace characters as a separator.
blankline_tokenizer() tokenizes by treating any sequence of
blank lines as a separator.
wordpunct_tokenizer() tokenizes by matching sequences of
alphabetic characters and sequences of (non-whitespace) non-alphabetic
characters.