tm package where various user specified options can be selected.textProcessor(documents, metadata=NULL,
lowercase=TRUE, removestopwords=TRUE, removenumbers=TRUE,
removepunctuation=TRUE, stem=TRUE,
sparselevel=1, language="en",
verbose=TRUE, onlycharacter= FALSE, striphtml=FALSE,
customstopwords=NULL)tm package has a variety odata.frame or matrix object with number of rows equal to the number of documents and one column per meta-data type. The column names are used to label the metadata. The metadatm uses the SnowballC stemmer which as of version 0.5 supports "danish dutch english finnish french german hungarian italian norwegian portuguese romanian russian spanish swedishtm the document term matrix is converted to the stm format using readCorpus.
The processor always strips extra white space but all other processing options are optional. Stemming uses the snowball stemmers and supports a wide variety of languages. Words in the vocabulary can be dropped due to sparsity and stop word removal. If a document no longer contains any words it is dropped from the output. Specifying meta-data is a convenient way to make sure the appropriate rows are dropped from the corresponding metadata file.
When the option sparseLevel is set to a number other than 1, infrequently appearing words are removed. When a term is removed from the vocabulary a message will print to the screen (as long as verbose has not been set to FALSE). The message indicates the number of terms removed (that is, the number of vocabulary entries) as well as the number of tokens removed (appearences of individual words). The function prepDocuments provides additional methods to prune infrequent words. In general the functionality there should be preferred.
We emphasize that this function is a convenience wrapper around the excellent tm package functionality without which it wouldn't be possible.readCorpushead(gadarian)
#Process the data for analysis.
temp<-textProcessor(documents=gadarian$open.ended.response,metadata=gadarian)
meta<-temp$meta
vocab<-temp$vocab
docs<-temp$documents
out <- prepDocuments(docs, vocab, meta)
docs<-out$documents
vocab<-out$vocab
meta <-out$metaRun the code above in your browser using DataLab