readCorpus: Read in a corpus file.

Description

Converts pre-processed document matrices stored in popular formats to stm format.

Usage

readCorpus(corpus, type = c("dtm", "slam", "Matrix"))

Value

documents: A documents object in our format
vocab: A vocab object if information is available to construct one

Arguments

corpus: An input file or filepath to be processed
type: The type of input file. We offer several sources, see details.

Details

This function provides a simple utility for converting other document formats to our own. Briefly- dtm takes as input a standard matrix and converts to our format. slam converts from the simple_triplet_matrix representation used by the slam package. This is also the representation of corpora in the popular tm package and should work in those cases.

dtm expects a matrix object where each row represents a document and each column represents a word in the dictionary.

slam expects a simple_triplet_matrix from that package.

Matrix attempts to coerce the matrix to a simple_triplet_matrix and convert using the functionality built for the slam package. This will work for most applicable classes in the Matrix package such as dgCMatrix.

If you are trying to read a .ldac file see readLdac.

Examples

Run this code


if (FALSE) {

library(textir)
data(congress109)
out <- readCorpus(congress109Counts, type="Matrix")
documents <- out$documents
vocab <- out$vocab
}