text2vec (version 0.4.0)

transform_tf: Scale a document-term matrix

Description

This set of functions scales a document-term matrix.

transform_tf: scale a DTM by one of two methods. If norm = "l1", then then dtm_tf = (count of a particular word in the document) / (total number of words in the document). If norm = "l2", then dtm_tf = (count of a particular word in the document) ^ 2 / (total number words in the document) ^ 2.

transform_binary: scale a DTM so that if a cell is 1 if a word appears in the document; otherwise it is 0.

transform_tfidf: scale a DTM so that dtm_idf = log(count of a particular word in a document) / (number of documents where the term appears + 1)

Usage

transform_tf(dtm, sublinear_tf = FALSE, norm = c("l1", "l2", "none"))

transform_tfidf(dtm, idf = NULL, sublinear_tf = FALSE, norm = c("l1", "l2"))

transform_binary(dtm)

Arguments

dtm

a document-term matrix of class dgCMatrix or dgTMatrix.

sublinear_tf

logical, FALSE by default. Apply sublinear term-frequency scaling, i.e., replace the term frequency with 1 + log(TF).

norm

character Type of normalization to apply to term vectors. "l1" by default, i.e., scale by the number of words in the document.

idf

ddiMatrix a diagonal matrix for IDF scaling. See get_idf. If not provided the IDF scaling matrix will be calculated from the matrix passed to dtm.

Functions

  • transform_tfidf: Scale a document-term matrix via TF-IDF

  • transform_binary: Transform a document-term matrix into binary representation

See Also

get_idf, get_tf

Examples

Run this code
# NOT RUN {
data(moview_review)

txt = movie_review[["review"]][1:1000]
it = itoken(txt, tolower, word_tokenizer)
vocab = vocabulary(it)
#remove very common and uncommon words
pruned_vocab = prune_vocabulary(vocab,
 term_count_min = 10,
 doc_proportion_max = 0.8, doc_proportion_min = 0.001,
 max_number_of_terms = 20000)

it = itoken(txt, tolower, word_tokenizer)
dtm = create_dtm(it, pruned_vocab)

dtm_filtered = dtm %>%
 # functionality overlaps with prune_vocabulary(),
 # but still can be useful in some cases
 # filter out very common and very uncommon terms
 transform_filter_commons( c(0.001, 0.975) )

# simple term-frequency transormation
transformed_tf = dtm %>%
 transform_tf

# tf-idf transormation
idf = get_idf(dtm)
transformed_tfidf = transform_tfidf(dtm,  idf)
# }

Run the code above in your browser using DataLab