itoken
Iterators over input objects
This function creates iterators over input objects to vocabularies, corpora, or DTM and TCM matrices. This iterator is usually used in following functions : create_vocabulary, create_corpus, create_dtm, vectorizers, create_tcm. See them for details.
Usage
itoken(iterable, ...)
"itoken"(iterable, chunks_number = 10, progessbar = interactive(), ids = NULL, ...)
"itoken"(iterable, preprocess_function = identity, tokenizer = function(x) strsplit(x, " ", TRUE), chunks_number = 10, progessbar = interactive(), ids = NULL, ...)
"itoken"(iterable, preprocess_function = identity, tokenizer = function(x) strsplit(x, " ", TRUE), progessbar = interactive(), ...)
"itoken"(iterable, preprocess_function = identity, tokenizer = function(x) strsplit(x, " ", TRUE), ...)
Arguments
- iterable
- an object from which to generate an iterator
- ...
- arguments passed to other methods (not used at the moment)
- chunks_number
integer
, the number of pieces that object should be divided into.- progessbar
logical
indicates whether to show progress bar.- ids
vector
of document ids. Ifids
is not provided,names(iterable)
will be used. Ifnames(iterable) == NULL
, incremental ids will be assigned.- preprocess_function
function
which takes chunk ofcharacter
vectors and does all pre-processing. Usuallypreprocess_function
should return acharacter
vector of preprocessed/cleaned documents. See "Details" section.- tokenizer
function
which takes acharacter
vector frompreprocess_function
, split it into tokens and returns alist
ofcharacter
vectors. If you need to perform stemming - call stemmer inside tokenizer. See examples section.
Details
S3 methods for creating an itoken iterator from list of tokens
list
: all elements of the input list should be character vectors containing tokenscharacter
: raw text source: the user must provide a tokenizer functionifiles
: from files, a user must provide a function to read in the file (to ifiles) and a function to tokenize it (to itoken)idir
: from a directory, the user must provide a function to read in the files (to idir) and a function to tokenize it (to itoken)ilines
: from lines, the user must provide functions to tokenize
See Also
ifiles, idir, ilines, create_vocabulary, create_corpus, create_dtm, vectorizers, create_tcm
Examples
data("movie_review")
txt <- movie_review$review[1:100]
ids <- movie_review$id[1:100]
it <- itoken(txt, tolower, word_tokenizer, chunks_number = 10)
it <- itoken(txt, tolower, word_tokenizer, chunks_number = 10, ids = ids)
# Example of stemming tokenizer
# stem_tokenizer <- function(x) {
# word_tokenizer(x) %>% lapply(SnowballC::wordStem('en'))
# }
Community examples
data("movie_review") txt = movie_review$review[1:100] ids = movie_review$id[1:100] it = itoken(txt, tolower, word_tokenizer, chunks_number = 10) it = itoken(txt, tolower, word_tokenizer, chunks_number = 10, ids = ids) # Example of stemming tokenizer # stem_tokenizer = function(x) { # word_tokenizer(x) %>% lapply(SnowballC::wordStem('en')) # }