create_tcm: Term-co-occurence matrix construction

Description

This is a function for constructing a term-co-occurrence matrix(TCM). TCM matrix usually used with GloVe word embedding model.

Usage

create_tcm(it, vectorizer, skip_grams_window = 5L,
  skip_grams_window_context = c("symmetric", "right", "left"),
  weights = 1/seq_len(skip_grams_window), binary_cooccurence = FALSE,
  ...)
# S3 method for itoken
create_tcm(it, vectorizer, skip_grams_window = 5L,
  skip_grams_window_context = c("symmetric", "right", "left"),
  weights = 1/seq_len(skip_grams_window), binary_cooccurence = FALSE,
  ...)
# S3 method for itoken_parallel
create_tcm(it, vectorizer,
  skip_grams_window = 5L, skip_grams_window_context = c("symmetric",
  "right", "left"), weights = 1/seq_len(skip_grams_window),
  binary_cooccurence = FALSE, ...)

Arguments

list of iterators over tokens from itoken. Each element is a list of tokens, that is, tokenized and normalized strings.

vectorizer

function vectorizer function. See vectorizers.

skip_grams_window

integer window for term-co-occurence matrix construction. skip_grams_window should be > 0 if you plan to use vectorizer in create_tcm function. Value of 0L means to not construct the TCM.

skip_grams_window_context

one of c("symmetric", "right", "left") - which context words to use when count co-occurence statistics.

weights

weights for context/distant words during co-occurence statistics calculation. By default we are setting weight = 1 / distance_from_current_word. Should have length equal to skip_grams_window.

binary_cooccurence

FALSE by default. If set to TRUE then function only counts first appearence of the context word and remaining occurrence are ignored. Useful when creating TCM for evaluation of coherence of topic models. "symmetric" by default - take into account skip_grams_window left and right.

...

placeholder for additional arguments (not used at the moment). it.

Value

dgTMatrix TCM matrix

Details

If a parallel backend is registered, it will construct the TCM in multiple threads. The user should keep in mind that he/she should split data and provide a list of itoken iterators. Each element of it will be handled in a separate thread combined at the end of processing.

Examples

Run this code

# NOT RUN {
data("movie_review")

# single thread

tokens = word_tokenizer(tolower(movie_review$review))
it = itoken(tokens)
v = create_vocabulary(jobs)
vectorizer = vocab_vectorizer(v)
tcm = create_tcm(itoken(tokens), vectorizer, skip_grams_window = 3L)

# parallel version

# set to number of cores on your machine
it = token_parallel(movie_review$review[1:N], tolower, word_tokenizer, movie_review$id[1:N])
v = create_vocabulary(jobs)
vectorizer = vocab_vectorizer(v)
dtm = create_dtm(it, vectorizer, type = 'dgTMatrix')
tcm = create_tcm(jobs, vectorizer, skip_grams_window = 3L, skip_grams_window_context = "symmetric")
# }