frequentTerms: List most frequent terms of a corpus

Description

List terms with the highest number of occurrences in the document-term matrix of a corpus, possibly grouped by the levels of a variable.

Usage

frequentTerms(dtm, variable = NULL, n = 25)

Arguments

dtm

a document-term matrix.

variable

a vector whose length is the number of rows of dtm, or NULL to report most frequent terms by document; use NA to report most frequent terms in the whole corpus.

the number of terms to report for each level.

Value

If variable = NA, one matrix with columns Global and Global % (see below). Else, a list of matrices, one for each level of the variable, with seven columns:
% Term/Levelthe percent of the term's occurrences in all terms occurrences in the level.
% Level/Termthe percent of the term's occurrences that appear in the level (rather than in other levels).
Global %the percent of the term's occurrences in all terms occurrences in the corpus.
Levelthe number of occurrences of the term in the level (internal).
Globalthe number of occurrences of the term in the corpus.
t valuethe quantile of a normal distribution corresponding the probability Prob..
Prob.the probability of observing such an extreme (high or low) number of occurrences of the term in the level, under an hypergeometric distribution.

Details

The probability is that of observing such extreme frequencies of the considered term in the level, under an hypergeometric distribution based on its global frequency in the corpus and on the number of occurrences of all terms in the document or variable level considered. The positive or negative character of the association is visible from the sign of the t value, or by comparing the value of the % Term/Level column with that of the Global % column.

Description

Usage

Arguments

Value

Details

See Also