text2vec (version 0.3.0)

create_tcm: Term-co-occurence matrix construction

Description

This is a high-level function for constructing a term-co-occurrence matrix. If a parallel backend is registered, it will construct the TCM in multiple threads.

Usage

create_tcm(itoken_src, vectorizer, ...)
"create_tcm"(itoken_src, vectorizer, ...)
"create_tcm"(itoken_src, vectorizer, verbose = FALSE, ...)

Arguments

itoken_src
list of iterators over tokens from itoken. Each element is a list of tokens, that is, tokenized and normalized strings.
vectorizer
function vectorizer function. See vectorizers.
...
arguments to foreach function which is used to iterate over itoken_src.
verbose
logical print status messages

Value

dgCMatrix TCM matrix

Details

The user should keep in mind that he or she should split data and and provide a list of itoken iterators. Each element of itoken_src will be handled in a separate thread combined at the end of processing.

See Also

itoken

Examples

Run this code
## Not run: 
# data("movie_review")
# 
# # single threadx
# 
# tokens <- movie_review$review %>% tolower %>% word_tokenizer
# it <- itoken(tokens)
# v <- create_vocabulary(jobs)
# vectorizer <- vocab_vectorizer(v, grow_dtm = FALSE, skip_grams_window = 3L)
# tcm <- create_tcm(itoken(tokens), vectorizer)
# 
# # parallel version
# 
# # set to number of cores on your machine
# N_WORKERS <- 1
# splits <- split_into(movie_review$review, N_WORKERS)
# jobs <- lapply(splits, itoken, tolower, word_tokenizer)
# v <- create_vocabulary(jobs)
# vectorizer <- vocab_vectorizer(v, grow_dtm = FALSE, skip_grams_window = 3L)
# jobs <- lapply(splits, itoken, tolower, word_tokenizer)
# doParallel::registerDoParallel(N_WORKERS)
# tcm <- create_tcm(jobs, vectorizer)
# ## End(Not run)

Run the code above in your browser using DataCamp Workspace