readCorpus(corpus, type = c("dtm", "ldac", "slam", "Matrix", "txtorgvocab"))dtm takes as input a standard matrix and converts to our format ldac takes a file path and reads in a document in the sparse format popularized by David Blei's C code implementation of lda. slam converts from the simple_triplet_matrix representation used by the slam package. This is also the representation of corpora in the popular tm package and should work in those cases.
dtm expects a matrix object where each row represents a document and each column represents a word in the dictionary.
ldac expects a file name or path that contains a file in Blei's LDA-C format. From his ReadMe:
"The data is a file where each line is of the form:
[M] [term_1]:[count] [term_2]:[count] ... [term_N]:[count]
where [M] is the number of unique terms in the document, and the
[count] associated with each term is how many times that term appeared
in the document. Note that [term_1] is an integer which indexes the
term; it is not a string."
Because R indexes from one, the values of the term indices are incremented by one on import.
slam expects a simple_triplet_matrix from that package.
Matrix attempts to coerce the matrix to a simple_triplet_matrix and convert using the functionality built for the slam package. This will work for most applicable classes in the Matrix package such as dgCMatrix.
Finally the object txtorgvocab allows the user to easily read in a vocab file generated by the software txtorg. When working in English it is straightforward to read in files created by txtorg. However when working in other languages, particularly Chinese and Arabic, there can often be difficulty reading in the files using read.table or read.csv This function should work well in those circumstances.textProcessor, prepDocumentslibrary(textir)
data(congress109)
out <- readCorpus(congress109Counts, type="Matrix")
documents <- out$documents
vocab <- out$vocabRun the code above in your browser using DataLab