tokenizers
Simple tokenization functions for string splitting
Few simple tokenization functions. For more comprehensive list see tokenizers
package:
https://cran.r-project.org/package=tokenizers.
Also check stringi::stri_split_*
.
Usage
word_tokenizer(strings, ...)char_tokenizer(strings, ...)
space_tokenizer(strings, sep = " ", xptr = FALSE, ...)
postag_lemma_tokenizer(strings, udpipe_model, tagger = "default",
tokenizer = "tokenizer", pos_keep = character(0),
pos_remove = c("PUNCT", "DET", "ADP", "SYM", "PART", "SCONJ", "CCONJ",
"AUX", "X", "INTJ"))
Arguments
- strings
character
vector- ...
other parameters (usually not used - see source code for details).
- sep
character
,nchar(sep)
= 1 - split strings by this character.- xptr
logical
tokenize at C++ level - could speed-up by 15-50%.- udpipe_model
- udpipe model, can be loaded with
?udpipe::udpipe_load_model
- tagger
"default"
- tagger parameter as per?udpipe::udpipe_annotate
docs.- tokenizer
"tokenizer"
- tokenizer parameter as per?udpipe::udpipe_annotate
docs.- pos_keep
character(0)
specifies which tokens to keep.character(0)
means to keep all of them- pos_remove
c("PUNCT", "DET", "ADP", "SYM", "PART", "SCONJ", "CCONJ", "AUX", "X", "INTJ")
- which tokens to remove.character(0)
is equal to not remove any.
Value
list
of character
vectors. Each element of list contains vector of tokens.
Examples
# NOT RUN {
doc = c("first second", "bla, bla, blaa")
# split by words
word_tokenizer(doc)
#faster, but far less general - perform split by a fixed single whitespace symbol.
space_tokenizer(doc, " ")
# }