
Last chance! 50% off unlimited learning
Sale ends in
Creates TfIdf(Latent semantic analysis) model.
The IDF is defined as follows: idf = log(# documents in the corpus) /
(# documents where the term appears + 1)
TfIdf
R6Class
object.
For usage details see Methods, Arguments and Examples sections.
tfidf = TfIdf$new(smooth_idf = TRUE, norm = c('l1', 'l2', 'none'), sublinear_tf = FALSE) tfidf$fit_transform(x) tfidf$transform(x)
$new(smooth_idf = TRUE, norm = c("l1", "l2", "none"), sublinear_tf = FALSE)
Creates tf-idf model
$fit_transform(x)
fit model to an input sparse matrix (preferably in "dgCMatrix" format) and then transforms it.
$transform(x)
transform new data x
using tf-idf from train data
A TfIdf
object
An input term-co-occurence matrix. Preferably in dgCMatrix
format
TRUE
smooth IDF weights by adding one to document
frequencies, as if an extra document was seen containing every term in the
collection exactly once. This prevents division by zero.
c("l1", "l2", "none")
Type of normalization to apply to term vectors.
"l1"
by default, i.e., scale by the number of words in the document.
FALSE
Apply sublinear term-frequency scaling, i.e.,
replace the term frequency with 1 + log(TF)
Term Frequency Inverse Document Frequency
# NOT RUN {
data("movie_review")
N = 100
tokens = word_tokenizer(tolower(movie_review$review[1:N]))
dtm = create_dtm(itoken(tokens), hash_vectorizer())
model_tfidf = TfIdf$new()
dtm_tfidf = model_tfidf$fit_transform(dtm)
# }
Run the code above in your browser using DataLab