text2vec (version 0.6.4)

create_dtm: Document-term matrix construction

Description

This is a high-level function for creating a document-term matrix.

Usage

create_dtm(it, vectorizer, type = c("dgCMatrix", "dgTMatrix",
  "dgRMatrix", "CsparseMatrix", "TsparseMatrix", "RsparseMatrix"), ...)

# S3 method for itoken create_dtm(it, vectorizer, type = c("dgCMatrix", "dgTMatrix", "dgRMatrix", "CsparseMatrix", "TsparseMatrix", "RsparseMatrix"), ...)

# S3 method for itoken_parallel create_dtm(it, vectorizer, type = c("dgCMatrix", "dgTMatrix", "dgRMatrix", "CsparseMatrix", "TsparseMatrix", "RsparseMatrix"), ...)

Value

A document-term matrix

Arguments

it

itoken iterator or list of itoken iterators.

vectorizer

function vectorizer function; see vectorizers.

type

character, one of c("CsparseMatrix", "TsparseMatrix").

...

placeholder for additional arguments (not used at the moment). over it.

Details

If a parallel backend is registered and first argument is a list of itoken, iterators, function will construct the DTM in multiple threads. User should keep in mind that he or she should split the data itself and provide a list of itoken iterators. Each element of it will be handled in separate thread and combined at the end of processing.

See Also

itoken vectorizers

Examples

Run this code
if (FALSE) {
data("movie_review")
N = 1000
it = itoken(movie_review$review[1:N], preprocess_function = tolower,
             tokenizer = word_tokenizer)
v = create_vocabulary(it)
#remove very common and uncommon words
pruned_vocab = prune_vocabulary(v, term_count_min = 10,
 doc_proportion_max = 0.5, doc_proportion_min = 0.001)
vectorizer = vocab_vectorizer(v)
it = itoken(movie_review$review[1:N], preprocess_function = tolower,
             tokenizer = word_tokenizer)
dtm = create_dtm(it, vectorizer)
# get tf-idf matrix from bag-of-words matrix
dtm_tfidf = transformer_tfidf(dtm)

## Example of parallel mode
it = token_parallel(movie_review$review[1:N], tolower, word_tokenizer, movie_review$id[1:N])
vectorizer = hash_vectorizer()
dtm = create_dtm(it, vectorizer, type = 'TsparseMatrix')
}

Run the code above in your browser using DataLab