text2vec (version 0.3.0)

tokenizers: Tokenization functions, which performs string splitting

Description

simple wrappers around stringi and stringr packages functionality.

Usage

word_tokenizer(string)
regexp_tokenizer(string, pattern)

Arguments

string
character vector
pattern
character pattern symbol. Also can be one of modifiers.

Value

list of character vectors. Each element of list containts vector of tokens.

Details

Uses str_split under the hood(which build on top of stringi::stri_split). Actually just a wrapper for str_split which is very consistent, flexible and robust. See str_split and modifiers for details.

Examples

doc <- c("first  second", "bla, bla, blaa")
# split by words
word_tokenizer(doc)
#faster, but far less general - perform split by a fixed single whitespace symbol.
regexp_tokenizer(doc, pattern = stringr::fixed(" "))