textreg-package: Sparse regression package for text that allows for multiple word phrases.

Description

Built on Ifrim's work, but allowing for regularization of phrases, this package does sparse regression using greedy coordinate descent. In a nutshell, the textreg package allows for regressing a vector of +1/-1 labels onto raw text. The textreg package takes care of converting the text to all of the possible related features, allowing you to think of the more organic statement of regressing onto ``text'' in some broad sense.

Arguments

Details

Implementation-wise, it is a wrapper for a modified version of the C++ code written by Georgiana Ifrim to do this regression. It is also designed to (somewhat) integrate with the tm package, a commonly used R package for dealing with text. One warning: this package uses tm, but does need to generate vectors of character strings to pass to the textreg call, which can be quite expensive. You can also pass a filename to the textreg call instead, which allows one to avoid loading a large corpus into memory and then copying it over.

The n-gram package is documented, but it is research code, meaning gaps and errors are possible; the author would appreciate notification of anything that is out of order.

The primary method in this package is the regression call textreg(). This method takes a corpus and a labeling vector and returns a textreg.result object that contains the final regression result along with diagnostic information that can be of use.

Start by reading the documentation for this textreg call, as well as the ``bathtub'' vignette.

References

Ifrim, G., Bakir, G., & Weikum, G. (2008). Fast logistic regression for text categorization with variable-length n-grams. 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 354-362.

Ifrim, G., & Wiuf, C. (2011). Bounded coordinate-descent for biological sequence classification in high dimensional predictor space. 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 708-716.

Jia, J., Miratrix, L., Yu, B., Gawalt, B., Ghaoui, El, L., Barnesmoore, L., & Clavier, S. (2014). Concise Comparative Summaries (CCS) of Large Text Corpora with a Human Experiment. The Annals of Applied Statistics, 8(1), 499-529.

Miratrix, L., & Ackerman, R. (2014). A method for conducting text-based sparse feature selection for interpretability of selected phrases.