tokenizers: Regexp tokenizers

Description

Tokenizers using regular expressions to match either tokens or separators between tokens.

Usage

Regexp_Tokenizer(pattern, description = NULL, invert = FALSE, ...)
blankline_tokenizer(s)
whitespace_tokenizer(s)
wordpunct_tokenizer(s)

Arguments

pattern

a character string giving the regular expression to use for matching.

description

a character string describing the tokenizer, or NULL (default).

invert

a logical indicating whether to match separators between tokens.

...

further arguments to be passed to gregexpr().

a String object, or something coercible to this using as.String() (e.g., a character string with appropriate encoding information)

Value

Regexp_Tokenizer() returns the created regexp tokenizer.
blankline_tokenizer(), whitespace_tokenizer() and wordpunct_tokenizer() return the spans of the tokens found in s.

Details

Regexp_Tokenizer() creates regexp tokenizers which use the given pattern and ... arguments to match tokens or separators between tokens via gregexpr(), and then transform the results of this into character spans of the tokens found. The given description is currently kept as an attribute.

whitespace_tokenizer() tokenizes by treating any sequence of whitespace characters as a separator.

blankline_tokenizer() tokenizes by treating any sequence of blank lines as a separator.

wordpunct_tokenizer() tokenizes by matching sequences of alphabetic characters and sequences of (non-whitespace) non-alphabetic characters.

Examples

Run this code

## A simple text.
s <- String("First sentence.  Second sentence.  ")
##           ****5****0****5****0****5****0****5**

spans <- whitespace_tokenizer(s)
spans
s[spans]

spans <- wordpunct_tokenizer(s)
spans
s[spans]

Run the code above in your browser using DataLab