Learn R Programming

text2vec (version 0.2.0)

tf_transformer: Scales Document-Term matrix

Description

tf_transformer scales each document vector by # of terms in corresponding document.

tf = (Number word appears in document) / (Number words in document) or in case 'l2' norm tf = (Number word appears in document) ^ 2 / (Number words in document) ^ 2

binary_transformer store 1 if document contains term and 0 otherwise.

tfidf_transformer

idf = log (Number documents in the corpus) / (Number documents where the term appears + 1)

Usage

tf_transformer(dtm, sublinear_tf = FALSE, norm = c("l1", "l2"))

tfidf_transformer(dtm, idf = NULL, sublinear_tf = FALSE, norm = c("l1", "l2"))

binary_transformer(dtm)

Arguments

dtm
dgCMatrix - Document-Term matrix
sublinear_tf
logical, FALSE by default. Apply sublinear tf scaling, i.e. replace tf with 1 + log(tf).
norm
character - Norm used to normalize term vectors. 'l1' by default, i.e. scale by bumber of words in document.
idf
- ddiMatrix Diagonal matrix for idf-scaling. See dtm_get_idf. If not provided ( NULL ) - idf will be calculated form current data.

Functions

  • tfidf_transformer: Transform Document-Term via TF-IDF scaling
  • binary_transformer: Transform Document-Term into binary format

See Also

dtm_get_idf

Examples

Run this code
data(moview_review)

txt <- movie_review[['review']][1:1000]
it <- itoken(txt, tolower, word_tokenizer)
vocab <- vocabulary(it)
#remove very common and uncommon words
pruned_vocab = prune_vocabulary(vocab, term_count_min = 10,
 doc_proportion_max = 0.8, doc_proportion_min = 0.001, max_number_of_terms = 20000)

it <- itoken(txt, tolower, word_tokenizer)
corpus <- create_vocab_corpus(it, pruned_vocab)
dtm <- get_dtm(corpus, type = 'dgCMatrix' )

dtm_filtered <- dtm %>%
 # filter out very common and very uncommon terms
 filter_commons_transformer( c(0.001, 0.975) )

# simple term-frequency transormation
transformed_tf <- dtm %>%
 tf_transformer

# tf-idf transormation
idf <- dtm_get_idf(dtm)
transformed_tfidf <- dtm %>%
 tfidf_transformer( idf)

Run the code above in your browser using DataLab