
Last chance! 50% off unlimited learning
Sale ends in
Tokenizers using regular expressions to match either tokens or separators between tokens.
Regexp_Tokenizer(pattern, invert = FALSE, ..., meta = list())
blankline_tokenizer(s)
whitespace_tokenizer(s)
wordpunct_tokenizer(s)
a character string giving the regular expression to use for matching.
a logical indicating whether to match separators between tokens.
further arguments to be passed to gregexpr()
.
a named or empty list of tokenizer metadata tag-value pairs.
Regexp_Tokenizer()
returns the created regexp span tokenizer.
blankline_tokenizer()
, whitespace_tokenizer()
and
wordpunct_tokenizer()
return the spans of the tokens found in
s
.
Regexp_Tokenizer()
creates regexp span tokenizers which use the
given pattern
and ...
arguments to match tokens or
separators between tokens via gregexpr()
, and then
transform the results of this into character spans of the tokens
found.
whitespace_tokenizer()
tokenizes by treating any sequence of
whitespace characters as a separator.
blankline_tokenizer()
tokenizes by treating any sequence of
blank lines as a separator.
wordpunct_tokenizer()
tokenizes by matching sequences of
alphabetic characters and sequences of (non-whitespace) non-alphabetic
characters.
Span_Tokenizer()
for general information on span
tokenizer objects.
# NOT RUN {
## A simple text.
s <- String(" First sentence. Second sentence. ")
## ****5****0****5****0****5****0****5**
spans <- whitespace_tokenizer(s)
spans
s[spans]
spans <- wordpunct_tokenizer(s)
spans
s[spans]
# }
Run the code above in your browser using DataLab