text2vec (version 0.6)

TfIdf: TfIdf

Description

Creates TfIdf(Latent semantic analysis) model. "smooth" IDF (default) is defined as follows: idf = log(1 + (# documents in the corpus) / (# documents where the term appears) ) "non-smooth" IDF is defined as follows: idf = log((# documents in the corpus) / (# documents where the term appears) )

Usage

TfIdf

Format

R6Class object.

Usage

For usage details see Methods, Arguments and Examples sections.

tfidf = TfIdf$new(smooth_idf = TRUE, norm = c('l1', 'l2', 'none'), sublinear_tf = FALSE)
tfidf$fit_transform(x)
tfidf$transform(x)

Methods

$new(smooth_idf = TRUE, norm = c("l1", "l2", "none"), sublinear_tf = FALSE)

Creates tf-idf model

$fit_transform(x)

fit model to an input sparse matrix (preferably in "dgCMatrix" format) and then transforms it.

$transform(x)

transform new data x using tf-idf from train data

Arguments

tfidf

A TfIdf object

x

An input term-co-occurence matrix. Preferably in dgCMatrix format

smooth_idf

TRUE smooth IDF weights by adding one to document frequencies, as if an extra document was seen containing every term in the collection exactly once.

norm

c("l1", "l2", "none") Type of normalization to apply to term vectors. "l1" by default, i.e., scale by the number of words in the document.

sublinear_tf

FALSE Apply sublinear term-frequency scaling, i.e., replace the term frequency with 1 + log(TF)

Details

Term Frequency Inverse Document Frequency

Examples

Run this code
# NOT RUN {
data("movie_review")
N = 100
tokens = word_tokenizer(tolower(movie_review$review[1:N]))
dtm = create_dtm(itoken(tokens), hash_vectorizer())
model_tfidf = TfIdf$new()
dtm_tfidf = model_tfidf$fit_transform(dtm)
# }

Run the code above in your browser using DataCamp Workspace