Learn R Programming

text2vec (version 0.4.0)

tokenizers: Simple tokenization functions, which performs string splitting

Description

simple wrappers around base regular expressions. For much more faster and functional tokenizers see tokenizers package: https://cran.r-project.org/package=tokenizers. Also see str_split_* functions in stringi and stringr packages. The reason for not including this packages to text2vec dependencies is our desare to keep number of dependencies as small as possible.

Usage

word_tokenizer(strings, ...)

regexp_tokenizer(strings, pattern, ...)

char_tokenizer(strings, ...)

space_tokenizer(strings, ...)

Arguments

strings

character vector

...

other parameters to strsplit function, which is used under the hood.

pattern

character pattern symbol.

Value

list of character vectors. Each element of list containts vector of tokens.

Examples

Run this code
# NOT RUN {
doc = c("first  second", "bla, bla, blaa")
# split by words
word_tokenizer(doc)
#faster, but far less general - perform split by a fixed single whitespace symbol.
regexp_tokenizer(doc, " ", TRUE)
# }

Run the code above in your browser using DataLab