text2vec (version 0.3.0)

transform_tf: Scale a document-term matrix

Description

This set of functions scales a document-term matrix.

transform_tf: scale a DTM by one of two methods. If norm = "l1", then then dtm_tf = (count of a particular word in the document) / (total number of words in the document). If norm = "l2", then dtm_tf = (count of a particular word in the document) ^ 2 / (total number words in the document) ^ 2.

transform_binary: scale a DTM so that if a cell is 1 if a word appears in the document; otherwise it is 0.

transform_tfidf: scale a DTM so that dtm_idf = log(count of a particular word in a document) / (number of documents where the term appears + 1)

Usage

transform_tf(dtm, sublinear_tf = FALSE, norm = c("l1", "l2"))
transform_tfidf(dtm, idf = NULL, sublinear_tf = FALSE, norm = c("l1", "l2"))
transform_binary(dtm)

Arguments

dtm
a document-term matrix of class dgCMatrix or dgTMatrix.
sublinear_tf
logical, FALSE by default. Apply sublinear term-frequency scaling, i.e., replace the term frequency with 1 + log(TF).
norm
character Type of normalization to apply to term vectors. "l1" by default, i.e., scale by the number of words in the document.
idf
ddiMatrix a diagonal matrix for IDF scaling. See get_idf. If not provided the IDF scaling matrix will be calculated from the matrix passed to dtm.

Functions

  • transform_tfidf: Scale a document-term matrix via TF-IDF
  • transform_binary: Transform a document-term matrix into binary representation

See Also

get_idf, get_tf

Examples

Run this code
## Not run: 
# data(moview_review)
# 
# txt <- movie_review[['review']][1:1000]
# it <- itoken(txt, tolower, word_tokenizer)
# vocab <- vocabulary(it)
# #remove very common and uncommon words
# pruned_vocab = prune_vocabulary(vocab,
#  term_count_min = 10,
#  doc_proportion_max = 0.8, doc_proportion_min = 0.001,
#  max_number_of_terms = 20000)
# 
# it <- itoken(txt, tolower, word_tokenizer)
# dtm <- create_dtm(it, pruned_vocab)
# 
# dtm_filtered <- dtm %>%
#  # functionality overlaps with prune_vocabulary(),
#  # but still can be useful in some cases
#  # filter out very common and very uncommon terms
#  transform_filter_commons( c(0.001, 0.975) )
# 
# # simple term-frequency transormation
# transformed_tf <- dtm %>%
#  transform_tf
# 
# # tf-idf transormation
# idf <- get_idf(dtm)
# transformed_tfidf <- transform_tfidf(dtm,  idf)
# ## End(Not run)

Run the code above in your browser using DataCamp Workspace