TfIdf: TfIdf

Description

Creates TfIdf(Latent semantic analysis) model. The IDF is defined as follows: idf = log(# documents in the corpus) / (# documents where the term appears + 1)

Usage

TfIdf

Format

R6Class object.

Usage

For usage details see Methods, Arguments and Examples sections.

tfidf = TfIdf$new(smooth_idf = TRUE, norm = c('l1', 'l2', 'none'), sublinear_tf = FALSE)
tfidf$fit_transform(x)
tfidf$transform(x)

Methods

$new(smooth_idf = TRUE, norm = c("l1", "l2", "none"), sublinear_tf = FALSE): Creates tf-idf model
$fit_transform(x): fit model to an input sparse matrix (preferably in "dgCMatrix" format) and then transforms it.
$transform(x): transform new data x using tf-idf from train data

Arguments

tfidf: A TfIdf object
x: An input term-co-occurence matrix. Preferably in dgCMatrix format
smooth_idf: TRUE smooth IDF weights by adding one to document frequencies, as if an extra document was seen containing every term in the collection exactly once. This prevents division by zero.
norm: c("l1", "l2", "none") Type of normalization to apply to term vectors. "l1" by default, i.e., scale by the number of words in the document.
sublinear_tf: FALSE Apply sublinear term-frequency scaling, i.e., replace the term frequency with 1 + log(TF)

Details

Term Frequency Inverse Document Frequency

Examples

Run this code

# NOT RUN {
data("movie_review")
N = 100
tokens = word_tokenizer(tolower(movie_review$review[1:N]))
dtm = create_dtm(itoken(tokens), hash_vectorizer())
model_tfidf = TfIdf$new()
dtm_tfidf = model_tfidf$fit_transform(dtm)
# }

Run the code above in your browser using DataLab

Last chance! 50% off unlimited learning

Description

Usage

Format

Usage

Methods

Arguments

Details

Examples