tokenizers

0th

Percentile

Tokenization functions, which performs string splitting

simple wrappers around stringi and stringr packages functionality.

Usage
word_tokenizer(string)
regexp_tokenizer(string, pattern)
Arguments
string
character vector
pattern
character pattern symbol. Also can be one of modifiers.
Details

Uses str_split under the hood(which build on top of stringi::stri_split). Actually just a wrapper for str_split which is very consistent, flexible and robust. See str_split and modifiers for details.

Value

list of character vectors. Each element of list containts vector of tokens.

Aliases
  • regexp_tokenizer
  • tokenizers
  • word_tokenizer
Examples
doc <- c("first  second", "bla, bla, blaa")
# split by words
word_tokenizer(doc)
#faster, but far less general - perform split by a fixed single whitespace symbol.
regexp_tokenizer(doc, pattern = stringr::fixed(" "))
Documentation reproduced from package text2vec, version 0.3.0, License: MIT + file LICENSE

Community examples

Looks like there are no examples yet.