create_tcm

0th

Percentile

Term-co-occurence matrix construction

This is a high-level function for constructing a term-co-occurrence matrix. If a parallel backend is registered, it will construct the TCM in multiple threads.

Usage
create_tcm(itoken_src, vectorizer, ...)
"create_tcm"(itoken_src, vectorizer, ...)
"create_tcm"(itoken_src, vectorizer, verbose = FALSE, ...)
Arguments
itoken_src
list of iterators over tokens from itoken. Each element is a list of tokens, that is, tokenized and normalized strings.
vectorizer
function vectorizer function. See vectorizers.
...
arguments to foreach function which is used to iterate over itoken_src.
verbose
logical print status messages
Details

The user should keep in mind that he or she should split data and and provide a list of itoken iterators. Each element of itoken_src will be handled in a separate thread combined at the end of processing.

Value

dgCMatrix TCM matrix

See Also

itoken

Aliases
  • create_tcm
  • create_tcm.itoken
  • create_tcm.list
Examples
## Not run: 
# data("movie_review")
# 
# # single threadx
# 
# tokens <- movie_review$review %>% tolower %>% word_tokenizer
# it <- itoken(tokens)
# v <- create_vocabulary(jobs)
# vectorizer <- vocab_vectorizer(v, grow_dtm = FALSE, skip_grams_window = 3L)
# tcm <- create_tcm(itoken(tokens), vectorizer)
# 
# # parallel version
# 
# # set to number of cores on your machine
# N_WORKERS <- 1
# splits <- split_into(movie_review$review, N_WORKERS)
# jobs <- lapply(splits, itoken, tolower, word_tokenizer)
# v <- create_vocabulary(jobs)
# vectorizer <- vocab_vectorizer(v, grow_dtm = FALSE, skip_grams_window = 3L)
# jobs <- lapply(splits, itoken, tolower, word_tokenizer)
# doParallel::registerDoParallel(N_WORKERS)
# tcm <- create_tcm(jobs, vectorizer)
# ## End(Not run)
Documentation reproduced from package text2vec, version 0.3.0, License: MIT + file LICENSE

Community examples

Looks like there are no examples yet.