The keyperm package stores frequency lists in a special data structure called indexed frequency list. This can currently be created from a tdm object as implemented in the tm package.
Indexed frequency lists are essentially frequency lists stored in a three-column format,
similar to the simple triplet matrix internally used by tm to store term-document-matrices.
The first column stores number of document i, second number of term j and the third the
frequencies with which the term j occurs in document i. Zero occurences are omitted.
All columns contain integers, and the frequency list is sorted by document.
The object returned is of class indexed_frequency_list. In addition to the actual frequency
list it contains an index for fast access as well as pre-computed total number of tokens per
document and total occurences per term.
create_ifl(
tdm,
subset_terms = 1:dim(tdm)[1],
subset_docs = 1:dim(tdm)[2],
corpus
)A list with class indexed_frequency_list containing the following components:
a tdm-matrix from the tm package. Currently, this is the only supported input, but others may be added in later versions.
vector of terms to be considered. Can be integer (indices) or boolean. Terms not included still are counted for total number of token per document.
vector of documents to be considered. Can be integer (indices) or boolean. Documents excluded do not contribute to total number of occurences of a term.
vector indicating which documents belong to corpus A (first corpus). Can be integer (indices) or boolean. Currently, only comparisons of two corpora are supported.