create_tcm

0th

Percentile

Term-co-occurence matrix construction

This is a function for constructing a term-co-occurrence matrix(TCM). TCM matrix usually used with GloVe word embedding model.

Usage
create_tcm(it, vectorizer, skip_grams_window = 5L,
  skip_grams_window_context = c("symmetric", "right", "left"),
  weights = 1/seq_len(skip_grams_window), binary_cooccurence = FALSE,
  ...)

# S3 method for itoken create_tcm(it, vectorizer, skip_grams_window = 5L, skip_grams_window_context = c("symmetric", "right", "left"), weights = 1/seq_len(skip_grams_window), binary_cooccurence = FALSE, ...)

# S3 method for itoken_parallel create_tcm(it, vectorizer, skip_grams_window = 5L, skip_grams_window_context = c("symmetric", "right", "left"), weights = 1/seq_len(skip_grams_window), binary_cooccurence = FALSE, ...)

Arguments
it

list of iterators over tokens from itoken. Each element is a list of tokens, that is, tokenized and normalized strings.

vectorizer

function vectorizer function. See vectorizers.

skip_grams_window

integer window for term-co-occurence matrix construction. skip_grams_window should be > 0 if you plan to use vectorizer in create_tcm function. Value of 0L means to not construct the TCM.

skip_grams_window_context

one of c("symmetric", "right", "left") - which context words to use when count co-occurence statistics.

weights

weights for context/distant words during co-occurence statistics calculation. By default we are setting weight = 1 / distance_from_current_word. Should have length equal to skip_grams_window.

binary_cooccurence

FALSE by default. If set to TRUE then function only counts first appearence of the context word and remaining occurrence are ignored. Useful when creating TCM for evaluation of coherence of topic models. "symmetric" by default - take into account skip_grams_window left and right.

...

placeholder for additional arguments (not used at the moment). it.

Details

If a parallel backend is registered, it will construct the TCM in multiple threads. The user should keep in mind that he/she should split data and provide a list of itoken iterators. Each element of it will be handled in a separate thread combined at the end of processing.

Value

dgTMatrix TCM matrix

See Also

itoken create_dtm

Aliases
  • create_tcm
  • create_tcm.itoken
  • create_tcm.itoken_parallel
Examples
# NOT RUN {
data("movie_review")

# single thread

tokens = word_tokenizer(tolower(movie_review$review))
it = itoken(tokens)
v = create_vocabulary(jobs)
vectorizer = vocab_vectorizer(v)
tcm = create_tcm(itoken(tokens), vectorizer, skip_grams_window = 3L)

# parallel version

# set to number of cores on your machine
it = token_parallel(movie_review$review[1:N], tolower, word_tokenizer, movie_review$id[1:N])
v = create_vocabulary(jobs)
vectorizer = vocab_vectorizer(v)
dtm = create_dtm(it, vectorizer, type = 'dgTMatrix')
tcm = create_tcm(jobs, vectorizer, skip_grams_window = 3L, skip_grams_window_context = "symmetric")
# }
Documentation reproduced from package text2vec, version 0.6, License: GPL (>= 2) | file LICENSE

Community examples

Looks like there are no examples yet.