text2vec (version 0.4.0)

itoken: Iterators over input objects


This function creates iterators over input objects to vocabularies, corpora, or DTM and TCM matrices. This iterator is usually used in following functions : create_vocabulary, create_corpus, create_dtm, vectorizers, create_tcm. See them for details.


itoken(iterable, ...)

# S3 method for list itoken(iterable, chunks_number = 10, progressbar = interactive(), ids = NULL, ...)

# S3 method for character itoken(iterable, preprocessor = identity, tokenizer = space_tokenizer, chunks_number = 10, progressbar = interactive(), ids = NULL, ...)

# S3 method for iterator itoken(iterable, preprocessor = identity, tokenizer = space_tokenizer, progressbar = interactive(), ...)



an object from which to generate an iterator


arguments passed to other methods (not used at the moment)


integer, the number of pieces that object should be divided into.


logical indicates whether to show progress bar.


vector of document ids. If ids is not provided, names(iterable) will be used. If names(iterable) == NULL, incremental ids will be assigned.


function which takes chunk of character vectors and does all pre-processing. Usually preprocessor should return a character vector of preprocessed/cleaned documents. See "Details" section.


function which takes a character vector from preprocessor, split it into tokens and returns a list of character vectors. If you need to perform stemming - call stemmer inside tokenizer. See examples section.


S3 methods for creating an itoken iterator from list of tokens

  • list: all elements of the input list should be character vectors containing tokens

  • character: raw text source: the user must provide a tokenizer function

  • ifiles: from files, a user must provide a function to read in the file (to ifiles) and a function to tokenize it (to itoken)

  • idir: from a directory, the user must provide a function to read in the files (to idir) and a function to tokenize it (to itoken)

See Also

ifiles, idir, create_vocabulary, create_corpus, create_dtm, vectorizers, create_tcm


Run this code
txt = movie_review$review[1:100]
ids = movie_review$id[1:100]
it = itoken(txt, tolower, word_tokenizer, chunks_number = 10)
it = itoken(txt, tolower, word_tokenizer, chunks_number = 10, ids = ids)
# Example of stemming tokenizer
# stem_tokenizer = function(x) {
#  word_tokenizer(x) %>% lapply(SnowballC::wordStem('en'))
# }
# }

Run the code above in your browser using DataLab