quanteda (version 1.3.4)

dfm_tfidf: Weight a dfm by tf-idf

Description

Weight a dfm by term frequency-inverse document frequency (tf-idf), with full control over options. Uses fully sparse methods for efficiency.

Usage

dfm_tfidf(x, scheme_tf = "count", scheme_df = "inverse", base = 10, ...)

Arguments

x

object for which idf or tf-idf will be computed (a document-feature matrix)

scheme_tf

scheme for dfm_weight; defaults to "count"

scheme_df

scheme for docfreq; defaults to "inverse". Other options to docfreq can be passed through the ellipsis (...).

base

the base for the logarithms in the tf and docfreq calls; default is 10

...

additional arguments passed to docfreq.

Details

dfm_tfidf computes term frequency-inverse document frequency weighting. The default is to use counts instead of normalized term frequency (the relative term frequency within document), but this can be overridden using scheme_tf = "prop".

References

Manning, C. D., Raghavan, P., & Schutze, H. (2008). Introduction to Information Retrieval. Cambridge University Press.

See Also

dfm_weight, docfreq

Examples

Run this code
# NOT RUN {
mydfm <- as.dfm(data_dfm_lbgexample)
head(mydfm[, 5:10])
head(dfm_tfidf(mydfm)[, 5:10])
docfreq(mydfm)[5:15]
head(dfm_weight(mydfm)[, 5:10])

# replication of worked example from
# https://en.wikipedia.org/wiki/Tf-idf#Example_of_tf.E2.80.93idf
wiki_dfm <- 
    matrix(c(1,1,2,1,0,0, 1,1,0,0,2,3),
           byrow = TRUE, nrow = 2,
           dimnames = list(docs = c("document1", "document2"),
                           features = c("this", "is", "a", "sample", 
                                        "another", "example"))) %>%
    as.dfm()
wiki_dfm    
docfreq(wiki_dfm)
dfm_tfidf(wiki_dfm, scheme_tf = "prop") %>% round(digits = 2)

# }
# NOT RUN {
# comparison with tm
if (requireNamespace("tm")) {
    convert(wiki_dfm, to = "tm") %>% weightTfIdf() %>% as.matrix()
    # same as:
    dfm_tfidf(wiki_dfm, base = 2, scheme_tf = "prop")
}
# }

Run the code above in your browser using DataLab