tm
package where various user specified options can be selected.
textProcessor(documents, metadata=NULL, lowercase=TRUE, removestopwords=TRUE, removenumbers=TRUE, removepunctuation=TRUE, stem=TRUE, wordLengths=c(3,Inf), sparselevel=1, language="en", verbose=TRUE, onlycharacter= FALSE, striphtml=FALSE, customstopwords=NULL, onlytxtfiles=TRUE)
tm
package has a variety of extra readers for ingesting other file formats (.doc, .pdf, .txt, .xml
).
data.frame
or matrix
object with number of rows equal to the number of documents and one column per meta-data type. The column names are used to label the metadata. The metadata do not affect the text processing, but providing the metadata object insures that if documents are dropped the corresponding metadata rows are dropped as well.
wordLengths[1]
or longer than the maximum word length wordLengths[2]
are discarded. Defaults to c(3, Inf)
, i.e., a minimum word length of 3 characters.
tm
uses the SnowballC
stemmer which as of version 0.5 supports "danish dutch english finnish french german hungarian italian norwegian portuguese romanian russian spanish swedish turkish". These can be specified as any on of the above strings or by the three-letter ISO-639 codes. You can also set language to "na" if you want to leave it deliberately unspecified (see documentation in tm
)TRUE
, When reading files from a local directory, the function will skip over any files that don't end in .txt
.
tm
the document term matrix is converted to the stm
format using readCorpus
.The processor always strips extra white space but all other processing options are optional. Stemming uses the snowball stemmers and supports a wide variety of languages. Words in the vocabulary can be dropped due to sparsity and stop word removal. If a document no longer contains any words it is dropped from the output. Specifying meta-data is a convenient way to make sure the appropriate rows are dropped from the corresponding metadata file.
When the option sparseLevel
is set to a number other than 1, infrequently appearing words are removed. When a term is removed from the vocabulary a message will print to the screen (as long as verbose
has not been set to FALSE
). The message indicates the number of terms removed (that is, the number of vocabulary entries) as well as the number of tokens removed (appearences of individual words). The function prepDocuments
provides additional methods to prune infrequent words. In general the functionality there should be preferred.
We emphasize that this function is a convenience wrapper around the excellent tm
package functionality without which it wouldn't be possible.
Ingo Feinerer, Kurt Hornik, and David Meyer (2008). Text Mining Infrastructure in R. Journal of Statistical Software 25(5): 1-54.
readCorpus
head(gadarian)
#Process the data for analysis.
temp<-textProcessor(documents=gadarian$open.ended.response,metadata=gadarian)
meta<-temp$meta
vocab<-temp$vocab
docs<-temp$documents
out <- prepDocuments(docs, vocab, meta)
docs<-out$documents
vocab<-out$vocab
meta <-out$meta
Run the code above in your browser using DataLab