get_dtm

0th

Percentile

Extract document-term matrix

This function extracts a document-term matrix from a Corpus object.

Usage
get_dtm(corpus, type = c("dgCMatrix", "dgTMatrix", "lda_c"))
Arguments
corpus
HashCorpus or VocabCorpus object. See create_corpus for details.
type
character, one of c("dgCMatrix", "dgTMatrix", "lda_c"). "lda_c" is Blei's lda-c format (a list of 2 * doc_terms_size); see https://www.cs.princeton.edu/~blei/lda-c/readme.txt
Aliases
  • get_dtm
Examples
N <- 1000
tokens <- movie_review$review[1:N] %>% tolower %>% word_tokenizer
it <- itoken(tokens)
v <- create_vocabulary(it)

#remove very common and uncommon words
pruned_vocab = prune_vocabulary(v, term_count_min = 10,
 doc_proportion_max = 0.8, doc_proportion_min = 0.001,
 max_number_of_terms = 10000)

vectorizer <- vocab_vectorizer(v)
it <- itoken(tokens)
corpus <- create_corpus(it, vectorizer)
dtm <- get_dtm(corpus)
Documentation reproduced from package text2vec, version 0.3.0, License: MIT + file LICENSE

Community examples

Looks like there are no examples yet.