prepDocuments: Prepare documents for analysis with `stm`

Description

Performs several corpus manipulations including removing words and renumbering word indices (to correct for zero-indexing and/or unusued words in the vocab vector).

Usage

prepDocuments(documents, vocab, meta, 
              lower.thresh = 1, upper.thresh = Inf, 
              subsample=NULL, verbose = TRUE)

Arguments

documents

List of documents. For more on the format see stm.

vocab

Character vector of words in the vocabulary.

Value

A list containing a new documents and vocab object.
documentsThe new documents object for use with stm
vocabThe new vocab object for use with stm
metaThe new meta data object for use with stm. Will be the same if no documents are removed.
words.removedA set of indices corresponding to the positions in the original vocab object of words which have been removed.
docs.removedA set of indices corresponding to the positions in the original documents object of documents which no longer contained any words after dropping terms from the vocab.
wordcountsA table giving the the number of documents that each word is found in of the original document set, prior to any removal. This can be passed through a histogram for visual inspection.

Details

The default setting dropthresh=1 means that words which appear in only one document will be dropped. This is often advantageous as there is little information about these words but the added cost of including them in the model can be quite large. In many cases it will be helpful to set this threshold considerably higher. If the vocabulary is in excess of 5000 entries inference can slow quite a bit. If words are removed, the function returns a vector of the original indices for the dropped items. If it removed documents it returns a vector of doc indices removed. Users with accompanying metadata or texts may want to drop those rows from the corresponding objects.

Examples

Run this code

head(gadarian)
#Process the data for analysis.
temp<-textProcessor(documents=gadarian$open.ended.response,metadata=gadarian)
meta<-temp$meta
vocab<-temp$vocab
docs<-temp$documents
out <- prepDocuments(docs, vocab, meta)
docs<-out$documents
vocab<-out$vocab
meta <-out$meta

Run the code above in your browser using DataLab