\w+|[^\w\s]+This pre-tokenizer simply splits using the following regex: \w+|[^\w\s]+
This pre-tokenizer simply splits using the following regex: \w+|[^\w\s]+
tok::tok_pre_tokenizer -> tok_pre_tokenizer_whitespace
Other pre_tokenizer:
pre_tokenizer,
pre_tokenizer_byte_level