tf_transformer: Scales Document-Term matrix

Description

tf_transformer scales each document vector by # of terms in corresponding document.

tf = (Number word appears in document) / (Number words in document) or in case 'l2' norm tf = (Number word appears in document) ^ 2 / (Number words in document) ^ 2

binary_transformer store 1 if document contains term and 0 otherwise.

tfidf_transformer

idf = log (Number documents in the corpus) / (Number documents where the term appears + 1)

Usage

tf_transformer(dtm, sublinear_tf = FALSE, norm = c("l1", "l2"))
tfidf_transformer(dtm, idf = NULL, sublinear_tf = FALSE, norm = c("l1",
  "l2"))
binary_transformer(dtm)

Arguments

dtm

dgCMatrix - Document-Term matrix

sublinear_tf

logical, FALSE by default. Apply sublinear tf scaling, i.e. replace tf with 1 + log(tf).

norm

character - Norm used to normalize term vectors. 'l1' by default, i.e. scale by bumber of words in document.

idf

- ddiMatrix Diagonal matrix for idf-scaling. See dtm_get_idf. If not provided ( NULL ) - idf will be calculated form current data.

Functions

tfidf_transformer: Transform Document-Term via TF-IDF scaling
binary_transformer: Transform Document-Term into binary format

Examples

Run this code

data(moview_review)

txt <- movie_review[['review']][1:1000]
it <- itoken(txt, tolower, word_tokenizer)
vocab <- vocabulary(it)
#remove very common and uncommon words
pruned_vocab = prune_vocabulary(vocab, term_count_min = 10,
 doc_proportion_max = 0.8, doc_proportion_min = 0.001, max_number_of_terms = 20000)

it <- itoken(txt, tolower, word_tokenizer)
corpus <- create_vocab_corpus(it, pruned_vocab)
dtm <- get_dtm(corpus, type = 'dgCMatrix' )

dtm_filtered <- dtm %>%
 # filter out very common and very uncommon terms
 filter_commons_transformer( c(0.001, 0.975) )

# simple term-frequency transormation
transformed_tf <- dtm %>%
 tf_transformer

# tf-idf transormation
idf <- dtm_get_idf(dtm)
transformed_tfidf <- dtm %>%
 tfidf_transformer( idf)

Run the code above in your browser using DataLab

Description

Usage

Arguments

Functions

See Also

Examples