characteristic_docs: characteristic_docs

Description

Print documents which are the most characteristic of each level of a variable, i.e. those with the lowest Chi-squared distance to the average vocabulary of documents belonging to that level.

Usage

characteristic_docs(corpus, dtm, variable, ndocs = 10, nterms = 25, p = 0.1)

Arguments

corpus

A Corpus object.

dtm

A DocumentTermMatrix object corresponding to corpus.

variable

A vector of values giving the groups for which most frequent terms should be reported.

ndocs

The number of (most characteristic) documents to print.

nterms

The number of terms to highlight in documents.

The maximum p-value up to which specific terms should be hightlighted.

Value

A list with one Corpus object for each level (invisibly).

Details

Occurrences of the nterms most specific terms for each level are highlighted. If stemming or other transformations have been applied to original words using combine_terms, all original words which have been transformed to the specified terms are highlighted.

Examples

Run this code

# NOT RUN {
file <- system.file("texts", "reut21578-factiva.xml", package="tm.plugin.factiva")
corpus <- import_corpus(file, "factiva", language="en")
dtm <- build_dtm(corpus)
characteristic_docs(corpus, dtm, meta(corpus)$Date)

# Also works when terms have been combined
dict <- dictionary(dtm)
dtm2 <- combine_terms(dtm, dict)
characteristic_docs(corpus, dtm2, meta(corpus)$Date)

# }

Run the code above in your browser using DataLab