tokenizers
From text2vec v0.3.0
by Dmitriy Selivanov
Tokenization functions, which performs string splitting
simple wrappers around stringi
and stringr
packages functionality.
Usage
word_tokenizer(string)
regexp_tokenizer(string, pattern)
Arguments
- string
character
vector- pattern
character
pattern symbol. Also can be one of modifiers.
Details
Uses str_split under the hood(which build on top of stringi::stri_split
).
Actually just a wrapper for str_split
which is very consistent, flexible and robust.
See str_split and modifiers for details.
Value
list
of character
vectors.
Each element of list containts vector of tokens.
Examples
doc <- c("first second", "bla, bla, blaa")
# split by words
word_tokenizer(doc)
#faster, but far less general - perform split by a fixed single whitespace symbol.
regexp_tokenizer(doc, pattern = stringr::fixed(" "))
Community examples
Looks like there are no examples yet.