create_dtm

0th

Percentile

Document-term matrix construction

This is a high-level function for creating a document-term matrix. If a parallel backend is registered, it will construct the DTM in multiple threads.

Usage
create_dtm(itoken_src, vectorizer, type = c("dgCMatrix", "dgTMatrix", "lda_c"), ...)
"create_dtm"(itoken_src, vectorizer, type = c("dgCMatrix", "dgTMatrix", "lda_c"), ...)
"create_dtm"(itoken_src, vectorizer, type = c("dgCMatrix", "dgTMatrix", "lda_c"), verbose = FALSE, ...)
Arguments
itoken_src
list of iterators over tokens provided by itoken. Each element is a list of tokens, that is, tokenized and normalized strings.
vectorizer
function vectorizer function; see vectorizers.
type
character, one of c("dgCMatrix", "dgTMatrix", "lda_c"). "lda_c" is Blei's lda-c format (a list of 2 * doc_terms_size); see https://www.cs.princeton.edu/~blei/lda-c/readme.txt
...
arguments to the foreach function which is used to iterate over itoken_src.
verbose
logical print status messages
Details

The user should keep in mind that he or she should split the data itself and provide a list of itoken iterators. Each element of itoken_src will be handled in separate thread and combined at the end of processing.

Value

A document-term matrix

See Also

itoken vectorizers create_corpus get_dtm

Aliases
  • create_dtm
  • create_dtm.itoken
  • create_dtm.list
Examples
## Not run: 
# data("movie_review")
# N <- 1000
# it <- itoken(movie_review$review[1:N], preprocess_function = tolower,
#              tokenizer = word_tokenizer)
# v <- create_vocabulary(it)
# #remove very common and uncommon words
# pruned_vocab = prune_vocabulary(v, term_count_min = 10,
#  doc_proportion_max = 0.5, doc_proportion_min = 0.001)
# vectorizer <- vocab_vectorizer(v)
# it <- itoken(movie_review$review[1:N], preprocess_function = tolower,
#              tokenizer = word_tokenizer)
# dtm <- create_dtm(it, vectorizer)
# # get tf-idf matrix from bag-of-words matrix
# dtm_tfidf <- transformer_tfidf(dtm)
# 
# ## Example of parallel mode
# # set to number of cores on your machine
# N_WORKERS <- 1
# doParallel::registerDoParallel(N_WORKERS)
# splits <- split_into(movie_review$review, N_WORKERS)
# jobs <- lapply(splits, itoken, tolower, word_tokenizer, chunks_number = 1)
# vectorizer <- hash_vectorizer()
# dtm <- create_dtm(jobs, vectorizer, type = 'dgTMatrix')
# ## End(Not run)
Documentation reproduced from package text2vec, version 0.3.0, License: MIT + file LICENSE

Community examples

Looks like there are no examples yet.