create_tcm
From text2vec v0.3.0
by Dmitriy Selivanov
Term-co-occurence matrix construction
This is a high-level function for constructing a term-co-occurrence matrix. If a parallel backend is registered, it will construct the TCM in multiple threads.
Usage
create_tcm(itoken_src, vectorizer, ...)
"create_tcm"(itoken_src, vectorizer, ...)
"create_tcm"(itoken_src, vectorizer, verbose = FALSE, ...)
Arguments
- itoken_src
list
of iterators over tokens from itoken. Each element is a list of tokens, that is, tokenized and normalized strings.- vectorizer
function
vectorizer function. See vectorizers.- ...
- arguments to foreach function which is used to iterate over
itoken_src
. - verbose
logical
print status messages
Details
The user should keep in mind that he or she should split data and
and provide a list of itoken iterators. Each element of
itoken_src
will be handled in a separate thread combined at the end
of processing.
Value
dgCMatrix
TCM matrix
See Also
Examples
## Not run:
# data("movie_review")
#
# # single threadx
#
# tokens <- movie_review$review %>% tolower %>% word_tokenizer
# it <- itoken(tokens)
# v <- create_vocabulary(jobs)
# vectorizer <- vocab_vectorizer(v, grow_dtm = FALSE, skip_grams_window = 3L)
# tcm <- create_tcm(itoken(tokens), vectorizer)
#
# # parallel version
#
# # set to number of cores on your machine
# N_WORKERS <- 1
# splits <- split_into(movie_review$review, N_WORKERS)
# jobs <- lapply(splits, itoken, tolower, word_tokenizer)
# v <- create_vocabulary(jobs)
# vectorizer <- vocab_vectorizer(v, grow_dtm = FALSE, skip_grams_window = 3L)
# jobs <- lapply(splits, itoken, tolower, word_tokenizer)
# doParallel::registerDoParallel(N_WORKERS)
# tcm <- create_tcm(jobs, vectorizer)
# ## End(Not run)
Community examples
Looks like there are no examples yet.