This is a high-level function for creating a document-term matrix.
create_dtm(it, vectorizer, type = c("dgCMatrix", "dgTMatrix"), ...)# S3 method for itoken
create_dtm(it, vectorizer, type = c("dgCMatrix",
"dgTMatrix"), ...)
# S3 method for list
create_dtm(it, vectorizer, type = c("dgCMatrix", "dgTMatrix"),
...)
# S3 method for itoken_parallel
create_dtm(it, vectorizer, type = c("dgCMatrix",
"dgTMatrix"), ...)
itoken iterator or list
of itoken
iterators.
function
vectorizer function; see
vectorizers.
character
, one of c("dgCMatrix", "dgTMatrix")
.
arguments to the foreach function which is used to iterate
over it
.
A document-term matrix
If a parallel backend is registered and first argument is a list of itoken
,
iterators, function will construct the DTM in multiple threads.
User should keep in mind that he or she should split the data itself and provide a list of
itoken iterators. Each element of it
will be handled in separate
thread and combined at the end of processing.
# NOT RUN {
data("movie_review")
N = 1000
it = itoken(movie_review$review[1:N], preprocess_function = tolower,
tokenizer = word_tokenizer)
v = create_vocabulary(it)
#remove very common and uncommon words
pruned_vocab = prune_vocabulary(v, term_count_min = 10,
doc_proportion_max = 0.5, doc_proportion_min = 0.001)
vectorizer = vocab_vectorizer(v)
it = itoken(movie_review$review[1:N], preprocess_function = tolower,
tokenizer = word_tokenizer)
dtm = create_dtm(it, vectorizer)
# get tf-idf matrix from bag-of-words matrix
dtm_tfidf = transformer_tfidf(dtm)
## Example of parallel mode
# set to number of cores on your machine
N_WORKERS = 1
if(require(doParallel)) registerDoParallel(N_WORKERS)
splits = split_into(movie_review$review, N_WORKERS)
jobs = lapply(splits, itoken, tolower, word_tokenizer, n_chunks = 1)
vectorizer = hash_vectorizer()
dtm = create_dtm(jobs, vectorizer, type = 'dgTMatrix')
# }
Run the code above in your browser using DataLab