weight: Weight the feature frequencies in a dfm by various methods

Description

Returns a document by feature matrix with the feature frequencies weighted according to one of several common methods.

Usage

weight(x, ...)
## S3 method for class 'dfm':
weight(x, type = c("frequency", "relFreq", "relMaxFreq",
  "logFreq", "tfidf"), smooth = 0, normalize = TRUE, verbose = TRUE, ...)
tf(x)
tfidf(x)
smoother(x, smooth)
weighting(object)
## S3 method for class 'dfm':
weighting(object)

Arguments

document-feature matrix created by dfm

...

not currently used

type

The weighting function to aapply to the dfm. One of:

normTf - Length normalization: dividing the frequency of the feature by the length of the document)
logTf - The natural log of the term frequency
tf-idf - Term-frequency * inve

smooth

amount to apply as additive smoothing to the document-feature matrix prior to weighting, default is 0.5, set to smooth=0 for no smoothing.

normalize

if TRUE (default) then normalize the dfm by relative term frequency prior to computing tfidf

verbose

if TRUE output status messages

object

the dfm object for accessing the weighting setting

Value

The dfm with weighted values
weighting returns a character object describing the type of weighting applied to the dfm.

Details

tf is a shortcut for weight(x, "relFreq")

tfidf is a shortcut for weight(x, "tfidf")

smoother is a shortcut for weight(x, "frequency", smooth)

weighting queries (but cannot set) the weighting applied to the dfm.

References

Manning, Christopher D., Prabhakar Raghavan, and Hinrich Schutze. Introduction to information retrieval. Vol. 1. Cambridge: Cambridge university press, 2008.

Examples

Run this code

dtm <- dfm(subset(inaugCorpus, Year>1980), verbose=FALSE)
x <- apply(dtm, 1, function(tf) tf/max(tf))
topfeatures(dtm)
normDtm <- weight(dtm)
topfeatures(normDtm)
maxTfDtm <- weight(dtm, type="relMaxFreq")
topfeatures(maxTfDtm)
logTfDtm <- weight(dtm, type="logFreq")
topfeatures(logTfDtm)
tfidfDtm <- weight(dtm, type="tfidf")
topfeatures(tfidfDtm)

# combine these methods for more complex weightings, e.g. as in Section 6.4 of
# Introduction to Information Retrieval
logTfDtm <- weight(dtm, type="logFreq")
wfidfDtm <- weight(logTfDtm, type="tfidf", normalize=FALSE)
testdfm <- dfm(inaugTexts[1:5], verbose=FALSE)
print(testdfm[, 1:5])
for (w in c("frequency", "relFreq", "relMaxFreq", "logFreq", "tfidf")) {
    testw <- weight(testdfm, w)
    cat("\nweight test for:", w, "; class:", class(testw), "\n")
    print(testw[, 1:5])
}

Run the code above in your browser using DataLab