runCorpusCa: Correspondence analysis from a tm corpus

Description

Compute a simple correspondence analysis on the document-term matrix of a tm corpus.

Usage

runCorpusCa(corpus, dtm = NULL, variables = NULL, sparsity = 0.9, ...)

Arguments

corpus

A tm corpus.

dtm

an optional document-term matrix to use; if missing, DocumentTermMatrix will be called on corpus to create it.

variables

a character vector giving the names of meta-data variables to aggregate the document-term matrix (see Details below).

sparsity

Optional sparsity threshold (between 0 and 1) below which terms should be skipped. See removeSparseTerms from tm.

...

Additional parameters passed to ca.

Value

A ca object as returned by the ca function.

Details

The function runCorpusCa runs a correspondence analysis (CA) on the document-term matrix that can be extracted from a tm corpus by calling the DocumentTermMatrix function, or directly from the dtm object if present.

If no variable is passed via the variables argument, a CA is run on the full document-term matrix (possibly skipping sparse terms, see below). If one or more variables are chosen, the CA will be based on a stacked table whose rows correspond to the levels of the variables: each cell contains the sum of occurrences of a given term in all the documents of the level. Documents that contain a NA are skipped for this variable, but taken into account for the others, if any.

In all cases, variables that have not been selected are added as supplementary rows. If at least one variable is passed, documents are also supplementary rows, while they are active otherwise.

The sparsity argument is passed to removeSparseTerms to remove less significant terms from the document-term matrix. This is especially useful for big corpora, which matrices can grow very large, prompting ca to take up too much memory.

Description

Usage

Arguments

Value

Details

See Also