itoken: Iterators over input objects

Description

This function creates iterators over input objects to vocabularies, corpora, or DTM and TCM matrices. This iterator is usually used in following functions : create_vocabulary, create_corpus, create_dtm, vectorizers, create_tcm. See them for details.

Usage

itoken(iterable, ...)
"itoken"(iterable, chunks_number = 10, progessbar = interactive(), ids = NULL, ...)
"itoken"(iterable, preprocess_function = identity, tokenizer = function(x) strsplit(x, " ", TRUE), chunks_number = 10, progessbar = interactive(), ids = NULL, ...)
"itoken"(iterable, preprocess_function = identity, tokenizer = function(x) strsplit(x, " ", TRUE), progessbar = interactive(), ...)
"itoken"(iterable, preprocess_function = identity, tokenizer = function(x) strsplit(x, " ", TRUE), ...)

Arguments

iterable

an object from which to generate an iterator

...

arguments passed to other methods (not used at the moment)

chunks_number

integer, the number of pieces that object should be divided into.

progessbar

logical indicates whether to show progress bar.

ids

vector of document ids. If ids is not provided, names(iterable) will be used. If names(iterable) == NULL, incremental ids will be assigned.

preprocess_function

function which takes chunk of character vectors and does all pre-processing. Usually preprocess_function should return a character vector of preprocessed/cleaned documents. See "Details" section.

tokenizer

function which takes a character vector from preprocess_function, split it into tokens and returns a list of character vectors. If you need to perform stemming - call stemmer inside tokenizer. See examples section.

Details

S3 methods for creating an itoken iterator from list of tokens

list: all elements of the input list should be character vectors containing tokens
character: raw text source: the user must provide a tokenizer function
ifiles: from files, a user must provide a function to read in the file (to ifiles) and a function to tokenize it (to itoken)
idir: from a directory, the user must provide a function to read in the files (to idir) and a function to tokenize it (to itoken)
ilines: from lines, the user must provide functions to tokenize

Examples

Run this code

data("movie_review")
txt <- movie_review$review[1:100]
ids <- movie_review$id[1:100]
it <- itoken(txt, tolower, word_tokenizer, chunks_number = 10)
it <- itoken(txt, tolower, word_tokenizer, chunks_number = 10, ids = ids)
# Example of stemming tokenizer
# stem_tokenizer <- function(x) {
#  word_tokenizer(x) %>% lapply(SnowballC::wordStem('en'))
# }

Run the code above in your browser using DataLab

Description

Usage

Arguments

Details

See Also

Examples