Regexp_Tokenizer() creates regexp span tokenizers which use the
  given pattern and ... arguments to match tokens or
  separators between tokens via gregexpr(), and then
  transform the results of this into character spans of the tokens
  found.
whitespace_tokenizer() tokenizes by treating any sequence of
  whitespace characters as a separator.
blankline_tokenizer() tokenizes by treating any sequence of
  blank lines as a separator.
wordpunct_tokenizer() tokenizes by matching sequences of
  alphabetic characters and sequences of (non-whitespace) non-alphabetic
  characters.