create_ifl: Create an Indexed Frequency List

Description

The keyperm package stores frequency lists in a special data structure called indexed frequency list. This can currently be created from a tdm object as implemented in the tm package.

Indexed frequency lists are essentially frequency lists stored in a three-column format, similar to the simple triplet matrix internally used by tm to store term-document-matrices. The first column stores number of document i, second number of term j and the third the frequencies with which the term j occurs in document i. Zero occurences are omitted. All columns contain integers, and the frequency list is sorted by document.

The object returned is of class indexed_frequency_list. In addition to the actual frequency list it contains an index for fast access as well as pre-computed total number of tokens per document and total occurences per term.

Usage

create_ifl(
  tdm,
  subset_terms = 1:dim(tdm)[1],
  subset_docs = 1:dim(tdm)[2],
  corpus
)

Value

A list with class indexed_frequency_list containing the following components:

Arguments

tdm: a tdm-matrix from the tm package. Currently, this is the only supported input, but others may be added in later versions.
subset_terms: vector of terms to be considered. Can be integer (indices) or boolean. Terms not included still are counted for total number of token per document.
subset_docs: vector of documents to be considered. Can be integer (indices) or boolean. Documents excluded do not contribute to total number of occurences of a term.
corpus: vector indicating which documents belong to corpus A (first corpus). Can be integer (indices) or boolean. Currently, only comparisons of two corpora are supported.