tokenizers: Regexp tokenizers

Description

Tokenizers using regular expressions to match either tokens or separators between tokens.

Usage

Regexp_Tokenizer(pattern, invert = FALSE, ..., meta = list())
blankline_tokenizer(s)
whitespace_tokenizer(s)
wordpunct_tokenizer(s)

Arguments

pattern

a character string giving the regular expression to use for matching.

invert

a logical indicating whether to match separators between tokens.

...

further arguments to be passed to gregexpr().

Value

Regexp_Tokenizer() returns the created regexp span tokenizer.

blankline_tokenizer(), whitespace_tokenizer() and wordpunct_tokenizer() return the spans of the tokens found in s.

Details

Regexp_Tokenizer() creates regexp span tokenizers which use the given pattern and ... arguments to match tokens or separators between tokens via gregexpr(), and then transform the results of this into character spans of the tokens found.

whitespace_tokenizer() tokenizes by treating any sequence of whitespace characters as a separator.

blankline_tokenizer() tokenizes by treating any sequence of blank lines as a separator.

wordpunct_tokenizer() tokenizes by matching sequences of alphabetic characters and sequences of (non-whitespace) non-alphabetic characters.

Examples

Run this code

# NOT RUN {
## A simple text.
s <- String("  First sentence.  Second sentence.  ")
##           ****5****0****5****0****5****0****5**

spans <- whitespace_tokenizer(s)
spans
s[spans]

spans <- wordpunct_tokenizer(s)
spans
s[spans]
# }

Run the code above in your browser using DataLab

Last chance! 50% off unlimited learning