generateDictionary: Generates dictionary of decisive terms

Description

Routine applies LASSO regularization to the document-term matrix in order to extract decisive terms that have a statistically significant impact on the response variable.

Usage

generateDictionary(x, response, language = "english", alpha = 1,
  s = "lambda.min", family = "gaussian", grouped = FALSE,
  minWordLength = 3, sparsity = 0.9, weighting = function(x)
  tm::weightTfIdf(x, normalize = FALSE), ...)
# S3 method for Corpus
generateDictionary(x, response, language = "english",
  alpha = 1, s = "lambda.min", family = "gaussian", grouped = FALSE,
  minWordLength = 3, sparsity = 0.9, weighting = function(x)
  tm::weightTfIdf(x, normalize = FALSE), ...)
# S3 method for character
generateDictionary(x, response, language = "english",
  alpha = 1, s = "lambda.min", family = "gaussian", grouped = FALSE,
  minWordLength = 3, sparsity = 0.9, weighting = function(x)
  tm::weightTfIdf(x, normalize = FALSE), ...)
# S3 method for data.frame
generateDictionary(x, response, language = "english",
  alpha = 1, s = "lambda.min", family = "gaussian", grouped = FALSE,
  minWordLength = 3, sparsity = 0.9, weighting = function(x)
  tm::weightTfIdf(x, normalize = FALSE), ...)
# S3 method for TermDocumentMatrix
generateDictionary(x, response,
  language = "english", alpha = 1, s = "lambda.min",
  family = "gaussian", grouped = FALSE, minWordLength = 3,
  sparsity = 0.9, weighting = function(x) tm::weightTfIdf(x, normalize =
  FALSE), ...)
# S3 method for DocumentTermMatrix
generateDictionary(x, response,
  language = "english", alpha = 1, s = "lambda.min",
  family = "gaussian", grouped = FALSE, minWordLength = 3,
  sparsity = 0.9, weighting = function(x) tm::weightTfIdf(x, normalize =
  FALSE), ...)

Arguments

A vector of characters, a data.frame, an object of type Corpus, TermDocumentMatrix or DocumentTermMatrix.

response

Response variable including the given gold standard.

language

Language used for preprocessing operations (default: English).

alpha

Abstraction parameter for switching form LASSO regularization (with default alpha=1) to ridge regression (alpha=0). As alternative options, one can also utilize to an elastic net with any continuous value inbetween.

Value of the parameter lambda at which the LASSO is evaluated. Default is s="lambda.1se" which takes the calculated minimum value for \(\lambda\) and then subtracts one standard error in order to avoid overfitting. This often results in a better performance than using the minimum value itself given by lambda="lambda.min".

family

Distribution for response variable. Default is family="gaussian". For non-negative counts, use family="poisson". For binary variables family="binomial". See glmnet for further details.

grouped

Optional parameter for grouped LASSO passed on to glmnet (default: FALSE).

minWordLength

Removes words given a specific minimum length (default: 3). This preprocessing is applied when the input is a character vector or a corpus and the document-term matrix is generated inside the routine.

sparsity

A numeric for removing sparse terms in the document-term matrix. The argument sparsity specifies the maximal allowed sparsity. Default is sparsity=0.9, however, this is only applied when the document-term matrix is calculated inside the rotuine.

weighting

Weights a document-term matrix by e.g. term frequency - inverse document frequency (default). Other variants can be used from DocumentTermMatrix.

...

Additional parameters passed to function for e.g. preprocessing or glmnet.

Value

Result is a matrix which sentiment values for each document across all defined rules

References

Pr\"ollochs and Feuerriegel (2015). Generating Domain-Specific Dictionaries Using Bayesian Learning. 23rd European Conference on Information Systems (ECIS 2015).

Examples

Run this code

# Create a vector of strings
documents <- c("This is a good thing!",
               "This is a very good thing!",
               "This is okay.",
               "This is a bad thing.",
               "This is a very bad thing.")
response <- c(1, 0.5, 0, -0.5, -1)

# Generate dictionary with LASSO regularization
dictionary <- generateDictionary(documents, response)

# Show dictionary
dictionary
summary(dictionary)
plot(dictionary)

# Compute in-sample performance
sentiment <- predict(dictionary, documents)
compareToResponse(sentiment, response)
plotSentimentResponse(sentiment, response)

# Generate new dictionary with tf weighting innstead of tf-idf

library(tm)
dictionary <- generateDictionary(documents, response, weighting=weightTf)
sentiment <- predict(dictionary, documents)
compareToResponse(sentiment, response)

# Use instead lambda.min from the LASSO estimation
dictionary <- generateDictionary(documents, response, s="lambda.min")
sentiment <- predict(dictionary, documents)
compareToResponse(sentiment, response)

# Generate dictionary without LASSO intercept
dictionary <- generateDictionary(documents, response, intercept=FALSE)
dictionary$intercept
 
## Not run: ------------------------------------
# imdb <- loadImdb()
# 
# # Generate Dictionary
# dictionary_imdb <- generateDictionary(imdb$Corpus, imdb$Rating, family="poisson")
# summary(dictionary_imdb)
# 
# compareDictionaries(dictionary_imdb,
#                     loadDictionaryGI())
#                     
# # Show estimated coefficients with Kernel Density Estimation (KDE)
# plot(dictionary_imdb)
# plot(dictionary_imdb) + xlim(c(-0.1, 0.1))
# 
# # Compute in-sample performance
# pred_sentiment <- predict(dict_imdb, imdb$Corpus)
# compareToResponse(pred_sentiment, imdb$Rating)
# 
# # Test a different sparsity parameter
# dictionary_imdb <- generateDictionary(imdb$Corpus, imdb$Rating, family="poisson", sparsity=0.99)
# summary(dictionary_imdb)
# pred_sentiment <- predict(dict_imdb, imdb$Corpus)
# compareToResponse(pred_sentiment, imdb$Rating)
## ---------------------------------------------

Run the code above in your browser using DataLab