readCorpus(corpus, type = c("dtm", "ldac", "slam", "Matrix", "txtorgvocab"))
dtm
takes as input a standard matrix and converts to our format ldac
takes a file path and reads in a document in the sparse format popularized by David Blei's C code implementation of lda. slam
converts from the simple_triplet_matrix
representation used by the slam
package. This is also the representation of corpora in the popular tm
package and should work in those cases. dtm
expects a matrix object where each row represents a document and each column represents a word in the dictionary.
ldac
expects a file name or path that contains a file in Blei's LDA-C format. From his ReadMe:
"The data is a file where each line is of the form:
[M] [term_1]:[count] [term_2]:[count] ... [term_N]:[count]
where [M] is the number of unique terms in the document, and the [count] associated with each term is how many times that term appeared in the document. Note that [term_1] is an integer which indexes the term; it is not a string."
Because R indexes from one, the values of the term indices are incremented by one on import.
slam
expects a simple_triplet_matrix
from that package.
Matrix
attempts to coerce the matrix to a simple_triplet_matrix
and convert using the functionality built for the slam
package. This will work for most applicable classes in the Matrix
package such as dgCMatrix
.
Finally the object txtorgvocab
allows the user to easily read in a vocab file generated by the software txtorg
. When working in English it is straightforward to read in files created by txtorg. However when working in other languages, particularly Chinese and Arabic, there can often be difficulty reading in the files using read.table
or read.csv
This function should work well in those circumstances.
textProcessor
, prepDocuments
## Not run:
# library(textir)
# data(congress109)
# out <- readCorpus(congress109Counts, type="Matrix")
# documents <- out$documents
# vocab <- out$vocab
# ## End(Not run)
Run the code above in your browser using DataLab