textProcessor: Process a vector of raw texts

Description

Function that takes in a vector of raw texts (in a variety of languages) and performs basic operations. This function is essentially a wrapper tm package where various user specified options can be selected.

Usage

textProcessor(documents, metadata=NULL, 
              lowercase=TRUE, removestopwords=TRUE, removenumbers=TRUE, 
              removepunctuation=TRUE, stem=TRUE, 
              sparselevel=.99, language="en", 
              verbose=TRUE)

Arguments

documents

The documents to be processed. A character vector where each entry is the full text of a document. If of length one it is assumed to be a filepath containing a directory where each file is a separate document. The tm package has a variety o

metadata

Additional data about the documents. Specifically a data.frame or matrix object with number of rows equal to the number of documents and one column per meta-data type. The column names are used to label the metadata. The metada

lowercase

Whether all words should be converted to lower case. Defaults to TRUE.

removestopwords

Whether stop words should be removed using the SMART stopword list (in English) or the snowball stopword lists (for all other languages). Defaults to TRUE.

removenumbers

Whether numbers should be removed. Defaults to TRUE.

removepunctuation

whether punctuation should be removed. Defaults to TRUE.

stem

Whether or not to stem words. Defaults to TRUE

sparselevel

removes terms where at least sparselevel proportion of the entries are 0.

language

Language used for processing. Defaults to English. tm uses the SnowballC stemmer which as of version 0.5 supports "danish dutch english finnish french german hungarian italian norwegian portuguese romanian russian spanish swedish

verbose

If true prints information as it processes.

Value

documentsA list containing the documents in the stm format.
vocabCharacter vector of vocabulary.
metaData frame or matrix containing the user-supplied metadata for the retained documents.

Details

This function is designed to provide a convenient and quick way to process a relatively small volume texts for analysis with the package. It is designed to quickly ingest data in a simple form like a spreadsheet where each document sits in a single cell. You can also pass the filepath of a single directory to the documents argument. The function will then recursively read in all the files within the directory where each docuemnt is a file. Once the text has been processed by tm the document term matrix is converted to the stm format using readCorpus. The processor always strips extra white space but all other processing options are optional. Stemming uses the snowball stemmers and supports a wide variety of languages. Words in the vocabulary can be dropped due to sparsity and stop word removal. If a document no longer contains any words it is dropped from the output. Specifying meta-data is a convenient way to make sure the appropriate rows are dropped from the corresponding metadata file. We emphasize that this function is a convenience wrapper around the excellent tm package functionality without which it wouldn't be possible.

References

Ingo Feinerer and Kurt Hornik (2013). tm: Text Mining Package. R package version 0.5-9.1. Ingo Feinerer, Kurt Hornik, and David Meyer (2008). Text Mining Infrastructure in R. Journal of Statistical Software 25(5): 1-54.

Examples

Run this code

head(gadarian)
#Process the data for analysis.
temp<-textProcessor(documents=gadarian$open.ended.response,metadata=gadarian)
meta<-temp$meta
vocab<-temp$vocab
docs<-temp$documents
out <- prepDocuments(docs, vocab, meta)
docs<-out$documents
vocab<-out$vocab
meta <-out$meta

Run the code above in your browser using DataLab