text2vec (version 0.4.0)

create_tcm: Term-co-occurence matrix construction

Description

This is a function for constructing a term-co-occurrence matrix(TCM). TCM matrix usually used with GloVe word embedding model.

Usage

create_tcm(it, vectorizer, ...)

# S3 method for itoken create_tcm(it, vectorizer, ...)

# S3 method for list create_tcm(it, vectorizer, verbose = FALSE, work_dir = tempdir(), ...)

Arguments

it

list of iterators over tokens from itoken. Each element is a list of tokens, that is, tokenized and normalized strings.

vectorizer

function vectorizer function. See vectorizers.

...

arguments to foreach function which is used to iterate over it.

verbose

logical print status messages

work_dir

working directory for intermediate results

Value

dgTMatrix TCM matrix

Details

If a parallel backend is registered, it will onstruct the TCM in multiple threads. The user should keep in mind that he/she should split data and provide a list of itoken iterators. Each element of it will be handled in a separate thread combined at the end of processing.

See Also

itoken create_dtm

Examples

Run this code
# NOT RUN {
data("movie_review")

# single thread

tokens = movie_review$review %>% tolower %>% word_tokenizer
it = itoken(tokens)
v = create_vocabulary(jobs)
vectorizer = vocab_vectorizer(v, grow_dtm = FALSE, skip_grams_window = 3L)
tcm = create_tcm(itoken(tokens), vectorizer)

# parallel version

# set to number of cores on your machine
N_WORKERS = 1
splits = split_into(movie_review$review, N_WORKERS)
jobs = lapply(splits, itoken, tolower, word_tokenizer)
v = create_vocabulary(jobs)
vectorizer = vocab_vectorizer(v, grow_dtm = FALSE, skip_grams_window = 3L)
jobs = lapply(splits, itoken, tolower, word_tokenizer)
doParallel::registerDoParallel(N_WORKERS)
tcm = create_tcm(jobs, vectorizer)
# }

Run the code above in your browser using DataLab