itoken

0th

Percentile

Iterators (and parallel iterators) over input objects

This family of function creates iterators over input objects in order to create vocabularies, or DTM and TCM matrices. iterators usually used in following functions : create_vocabulary, create_dtm, vectorizers, create_tcm. See them for details.

Usage
itoken(iterable, ...)

# S3 method for character itoken(iterable, preprocessor = identity, tokenizer = space_tokenizer, n_chunks = 10, progressbar = interactive(), ids = NULL, ...)

# S3 method for list itoken(iterable, n_chunks = 10, progressbar = interactive(), ids = names(iterable), ...)

# S3 method for iterator itoken(iterable, preprocessor = identity, tokenizer = space_tokenizer, progressbar = interactive(), ...)

itoken_parallel(iterable, ...)

# S3 method for character itoken_parallel(iterable, preprocessor = identity, tokenizer = space_tokenizer, n_chunks = 10, ids = NULL, ...)

# S3 method for iterator itoken_parallel(iterable, preprocessor = identity, tokenizer = space_tokenizer, n_chunks = 1L, ...)

# S3 method for list itoken_parallel(iterable, n_chunks = 10, ids = NULL, ...)

Arguments
iterable

an object from which to generate an iterator

...

arguments passed to other methods

preprocessor

function which takes chunk of character vectors and does all pre-processing. Usually preprocessor should return a character vector of preprocessed/cleaned documents. See "Details" section.

tokenizer

function which takes a character vector from preprocessor, split it into tokens and returns a list of character vectors. If you need to perform stemming - call stemmer inside tokenizer. See examples section.

n_chunks

integer, the number of pieces that object should be divided into. Then each chunk is processed independently (and in case itoken_parallel in parallel if some parallel backend is registered). Usually there is tradeoff: larger number of chunks means lower memory footprint, but slower (if preprocessor, tokenizer functions are efficiently vectorized). And small number of chunks means larger memory footprint but faster execution (again if user supplied preprocessor, tokenizer functions are efficiently vectorized).

progressbar

logical indicates whether to show progress bar.

ids

vector of document ids. If ids is not provided, names(iterable) will be used. If names(iterable) == NULL, incremental ids will be assigned.

Details

S3 methods for creating an itoken iterator from list of tokens

  • list: all elements of the input list should be character vectors containing tokens

  • character: raw text source: the user must provide a tokenizer function

  • ifiles: from files, a user must provide a function to read in the file (to ifiles) and a function to tokenize it (to itoken)

  • idir: from a directory, the user must provide a function to read in the files (to idir) and a function to tokenize it (to itoken)

  • ifiles_parallel: from files in parallel

See Also

ifiles, idir, create_vocabulary, create_dtm, vectorizers, create_tcm

Aliases
  • itoken
  • itoken.character
  • itoken.list
  • itoken.iterator
  • itoken_parallel
  • itoken_parallel.character
  • itoken_parallel.iterator
  • itoken_parallel.list
Examples
# NOT RUN {
data("movie_review")
txt = movie_review$review[1:100]
ids = movie_review$id[1:100]
it = itoken(txt, tolower, word_tokenizer, n_chunks = 10)
it = itoken(txt, tolower, word_tokenizer, n_chunks = 10, ids = ids)
# Example of stemming tokenizer
# stem_tokenizer =function(x) {
#   lapply(word_tokenizer(x), SnowballC::wordStem, language="en")
# }
it = itoken_parallel(movie_review$review[1:100], n_chunks = 4)
system.time(dtm <- create_dtm(it, hash_vectorizer(2**16), type = 'dgTMatrix'))
# }
Documentation reproduced from package text2vec, version 0.6, License: GPL (>= 2) | file LICENSE

Community examples

xueliu.stonybrook@gmail.com at Jun 25, 2017 text2vec v0.4.0

data("movie_review") txt = movie_review$review[1:100] ids = movie_review$id[1:100] it = itoken(txt, tolower, word_tokenizer, chunks_number = 10) it = itoken(txt, tolower, word_tokenizer, chunks_number = 10, ids = ids) # Example of stemming tokenizer # stem_tokenizer = function(x) { # word_tokenizer(x) %>% lapply(SnowballC::wordStem('en')) # }