transform_tf
From text2vec v0.3.0
by Dmitriy Selivanov
Scale a documentterm matrix
This set of functions scales a documentterm matrix.
transform_tf
: scale a DTM by one of two methods. If norm =
"l1"
, then then dtm_tf = (count of a particular word in the document)
/ (total number of words in the document)
. If norm = "l2"
, then
dtm_tf = (count of a particular word in the document) ^ 2 / (total
number words in the document) ^ 2
.
transform_binary
: scale a DTM so that if a cell is 1 if a word appears
in the document; otherwise it is 0.
transform_tfidf
: scale a DTM so that dtm_idf = log(count of a
particular word in a document) / (number of documents where the term appears
+ 1)
Usage
transform_tf(dtm, sublinear_tf = FALSE, norm = c("l1", "l2"))
transform_tfidf(dtm, idf = NULL, sublinear_tf = FALSE, norm = c("l1", "l2"))
transform_binary(dtm)
Arguments
 dtm
 a documentterm matrix of class
dgCMatrix
ordgTMatrix
.  sublinear_tf
logical
,FALSE
by default. Apply sublinear termfrequency scaling, i.e., replace the term frequency with1 + log(TF)
. norm
character
Type of normalization to apply to term vectors."l1"
by default, i.e., scale by the number of words in the document. idf
ddiMatrix
a diagonal matrix for IDF scaling. See get_idf. If not provided the IDF scaling matrix will be calculated from the matrix passed todtm
.
Functions

transform_tfidf
: Scale a documentterm matrix via TFIDF 
transform_binary
: Transform a documentterm matrix into binary representation
See Also
Examples
## Not run:
# data(moview_review)
#
# txt < movie_review[['review']][1:1000]
# it < itoken(txt, tolower, word_tokenizer)
# vocab < vocabulary(it)
# #remove very common and uncommon words
# pruned_vocab = prune_vocabulary(vocab,
# term_count_min = 10,
# doc_proportion_max = 0.8, doc_proportion_min = 0.001,
# max_number_of_terms = 20000)
#
# it < itoken(txt, tolower, word_tokenizer)
# dtm < create_dtm(it, pruned_vocab)
#
# dtm_filtered < dtm %>%
# # functionality overlaps with prune_vocabulary(),
# # but still can be useful in some cases
# # filter out very common and very uncommon terms
# transform_filter_commons( c(0.001, 0.975) )
#
# # simple termfrequency transormation
# transformed_tf < dtm %>%
# transform_tf
#
# # tfidf transormation
# idf < get_idf(dtm)
# transformed_tfidf < transform_tfidf(dtm, idf)
# ## End(Not run)
Community examples
Looks like there are no examples yet.