⚠️There's a newer version (2.9.0) of this package. Take me there.

sentimentr

sentimentr is designed to quickly calculate text polarity sentiment at the sentence level and optionally aggregate by rows or grouping variable(s).

sentimentr is a response to my own needs with sentiment detection that were not addressed by the current R tools. My own polarity function in the qdap package is slower on larger data sets. It is a dictionary lookup approach that tries to incorporate weighting for valence shifters (negation and amplifiers/deamplifiers). Matthew Jocker's created the syuzhet package that utilizes dictionary lookups for the Bing, NRC, and Afinn methods. He also utilizes a wrapper for the Stanford coreNLP which uses much more sophisticated analysis. Jocker's dictionary methods are fast but are more prone to error in the case of valence shifters. Jocker's addressed these critiques explaining that the method is good with regard to analyzing general sentiment in a piece of literature. He points to the accuracy of the Stanford detection as well. In my own work I need better accuracy than a simple dictionary lookup; something that considers valence shifters yet optimizes speed which the Stanford's parser does not. This leads to a trade off of speed vs. accuracy. The equation below describes the dictionary method of sentimentr that may give better results than a dictionary approach that does not consider valence shifters but will likely still be less accurate than Stanford's approach. Simply, sentimentr attempts to balance accuracy and speed.

Table of Contents

The Equation

The equation used by the algorithm to assign value to polarity of each sentence fist utilizes the sentiment dictionary (Hu and Liu, 2004) to tag polarized words. Each paragraph (pi = {s1, s2, ..., sn}) composed of sentences, is broken into element sentences (si, j = {w1, w2, ..., wn}) where w are the words within sentences. Each sentence (sj) is broken into a an ordered bag of words. Punctuation is removed with the exception of pause punctuations (commas, colons, semicolons) which are considered a word within the sentence. I will denote pause words as c**w (comma words) for convenience. We can represent these words as an i,j,k notation as wi, j, k. For example w3, 2, 5 would be the fifth word of the second sentence of the third paragraph. While I use the term paragraph this merely represent a complete turn of talk. For example it may be a cell level response in a questionnaire composed of sentences.

The words in each sentence (wi, j, k) are searched and compared to a modified version of Hu, M., & Liu, B.'s (2004) dictionary of polarized words. Positive (wi, j, k + ) and negative (wi, j, k − ) words are tagged with a  + 1 and  − 1 respectively (or other positive/negative weighting if the user provides the sentiment dictionary). I will denote polarized words as p**w for convenience. These will form a polar cluster (ci, j, l) which is a subset of the a sentence (ci, j, l ⊆ si, j).

The polarized context cluster (ci, j, l) of words is pulled from around the polarized word (p**w) and defaults to 4 words before and two words after p**w to be considered as valence shifters. The cluster can be represented as (ci, j, l = {p**wi, j, k − n**b, ..., p**wi, j, k, ..., p**wi, j, k − n**a}), where n**b & n**a are the parameters n.before and n.after set by the user. The words in this polarized context cluster are tagged as neutral (wi, j, k0), negator (wi, j, kn), amplifier (wi, j, ka), or de-amplifier (wi, j, kd). Neutral words hold no value in the equation but do affect word count (n). Each polarized word is then weighted (w) based on the weights from the polarity_dt argument and then further weighted by the function and number of the valence shifters directly surrounding the positive or negative word (p**w). Pause (c**w) locations (punctuation that denotes a pause including commas, colons, and semicolons) are indexed and considered in calculating the upper and lower bounds in the polarized context cluster. This is because these marks indicate a change in thought and words prior are not necessarily connected with words after these punctuation marks. The lower bound of the polarized context cluster is constrained to max{p**wi, j, k − n**b, 1, max{c**wi, j, k < p**wi, j, k}} and the upper bound is constrained to min{p**wi, j, k + n**a, wi, j**n, min{c**wi, j, k > p**wi, j, k}} where wi, j**n is the number of words in the sentence.

The core value in the cluster, the polarized word is acted upon by valence shifters. Amplifiers increase the polarity by 1.8 (.8 is the default weight (z)). Amplifiers (wi, j, ka) become de-amplifiers if the context cluster contains an odd number of negators (wi, j, kn). De-amplifiers work to decrease the polarity. Negation (wi, j, kn) acts on amplifiers/de-amplifiers as discussed but also flip the sign of the polarized word. Negation is determined by raising  − 1 to the power of the number of negators (wi, j, kn) plus 2. Simply, this is a result of a belief that two negatives equal a positive, 3 negatives a negative, and so on.

The "but" conjunctions (i.e., 'but', 'however', and 'although') also weight the context cluster. A but conjunction before the polarized word (wbutconjunction, ..., wi, j, kp) up-weights the cluster by 1 + z2 * {|wbutconjunction|, ..., wi, j, kp} (.85 is the default weight (z2) where |wbutconjunction| are the number of but conjunctions before the polarized word). A but conjunction after the polarized word down-weights the cluster by 1 + {wi, j, kp, ..., |wbutconjunction| *  − 1} * z2. This corresponds to the belief that a but makes the next clause of greater values while lowering the value placed on the prior clause.

The researcher may provide a weight (z) to be utilized with amplifiers/de-amplifiers (default is .8; de-amplifier weight is constrained to  − 1 lower bound). Last, these weighted context clusters (ci, j, l) are summed (ci, j) and divided by the square root of the word count (√wi, j**n) yielding an unbounded polarity score (δi, j) for each sentence.

δi**j = c'i**j/√wijn

Where:

ci, j = ∑((1 + wamp + wdeamp) ⋅ wi, j, kp( − 1)2 + wneg)

wamp = ∑(wneg ⋅ (z ⋅ wi, j, ka))

wdeamp = max(wdeamp,  − 1)

wdeamp = ∑(z( − wneg ⋅ wi, j, ka + wi, j, kd))

wb = 1 + z2 * wb

wb = ∑(|wbutconjunction|, ..., wi, j, kp, wi, j, kp, ..., |wbutconjunction| *  − 1)

wneg = (∑wi, j, kn ) mod 2

To get the mean of all sentences (si, j) within a paragraph (pi) simply take the average sentiment score pi, δi, j = 1/n  ⋅  ∑ δi, j.

Installation

To download the development version of sentimentr:

Download the zip ball or tar ball, decompress and run R CMD INSTALL on it, or use the pacman package to install the development version:

if (!require("pacman")) install.packages("pacman")
pacman::p_load_gh("trinker/sentimentr")

Usage

There are two main functions in sentimentr with three helper functions summarized in the table below:

Examples

if (!require("pacman")) install.packages("pacman")
pacman::p_load(sentimentr)

mytext <- c(
    'do you like it?  But I hate really bad dogs',
    'I am the best friend.',
    'Do you really like it?  I\'m not a fan'
)
sentiment(mytext)

##    element_id sentence_id word_count  sentiment
## 1:          1           1          4  0.5000000
## 2:          1           2          6 -2.6781088
## 3:          2           1          5  0.4472136
## 4:          3           1          5  0.8049845
## 5:          3           2          4  0.0000000

To aggregate by element (column cell or vector element) use sentiment_by with by = NULL.

mytext <- c(
    'do you like it?  But I hate really bad dogs',
    'I am the best friend.',
    'Do you really like it?  I\'m not a fan'
)
sentiment_by(mytext)

##    element_id word_count       sd ave_sentiment
## 1:          1         10 2.247262    -1.0890544
## 2:          2          5       NA     0.4472136
## 3:          3          9 0.569210     0.4024922

To aggregate by grouping variables use sentiment_by using the by argument.

(out <- with(presidential_debates_2012, sentiment_by(dialogue, list(person, time))))

##        person   time word_count        sd ave_sentiment
##  1:     OBAMA time 1       3598 0.4397613    0.10966120
##  2:    LEHRER time 1        765 0.3493838    0.10941383
##  3:     OBAMA time 3       7241 0.4135144    0.09654523
##  4:     OBAMA time 2       7476 0.3832811    0.08893467
##  5:    ROMNEY time 3       8302 0.3909338    0.08108205
##  6:    ROMNEY time 1       4085 0.3510066    0.06613552
##  7: SCHIEFFER time 3       1445 0.3772378    0.06515716
##  8:   CROWLEY time 2       1672 0.2125288    0.05531121
##  9:    ROMNEY time 2       7534 0.3188779    0.04946325
## 10:  QUESTION time 2        583 0.3255268    0.03334828

Plotting

Plotting at Aggregated Sentiment

plot(out)

Plotting at the Sentence Level

The plot method for the class sentiment uses syuzhet's get_transformed_values combined with ggplot2 to make a reasonable, smoothed plot for the duration of the text based on percentage, allowing for comparison between plots of different texts. This plot gives the overall shape of the text's sentiment. The user can see syuzhet::get_transformed_values for more details.

plot(uncombine(out))

Annie Swafford's Examples

Annie Swafford critiqued Jocker's approach to sentiment and gave the following examples of sentences (ase for Annie Swafford example). Here I test each of Jocker's 3 dictionary approaches (Bing, NRC, Afinn), his Stanford wrapper (note I use my own GitHub Stanford wrapper package based off of Jocker's approach as it works more reliably on my own Windows machine), and my own algorithm with both the default Hu & Liu (2004) polarity lexicon as well as Baccianella, Esuli and Sebastiani's (2010) SentiWord lexicon.

if (!require("pacman")) install.packages("pacman")
pacman::p_load_gh("trinker/sentimentr", "trinker/stansent")
pacman::p_load(syuzhet, qdap, microbenchmark)

ase <- c(
    "I haven't been sad in a long time.",
    "I am extremely happy today.",
    "It's a good day.",
    "But suddenly I'm only a little bit happy.",
    "Then I'm not happy at all.",
    "In fact, I am now the least happy person on the planet.",
    "There is no happiness left in me.",
    "Wait, it's returned!",
    "I don't feel so bad after all!"
)

syuzhet <- setNames(as.data.frame(lapply(c("bing", "afinn", "nrc"),
    function(x) get_sentiment(ase, method=x))), c("bing", "afinn", "nrc"))


left_just(data.frame(
    stanford = sentiment_stanford(ase),
    hu_liu = round(sentiment(ase, question.weight = 0)[["sentiment"]], 2),
    sentiword = round(sentiment(ase, sentiword, question.weight = 0)[["sentiment"]], 2),    
    syuzhet,
    sentences = ase,
    stringsAsFactors = FALSE
), "sentences")

  stanford hu_liu sentiword bing afinn nrc
1      0.5      0      0.27   -1    -2   0
2       -1    0.8      0.65    1     3   1
3     -0.5    0.5      0.32    1     3   1
4      0.5      0         0    1     3   1
5      0.5  -0.41     -0.56    1     3   1
6      0.5   0.06      0.05    1     3   1
7      0.5  -0.38     -0.05    1     2   1
8        0      0     -0.14    0     0  -1
9      0.5   0.38      0.24   -1    -3  -1
  sentences                                              
1 I haven't been sad in a long time.                     
2 I am extremely happy today.                            
3 It's a good day.                                       
4 But suddenly I'm only a little bit happy.              
5 Then I'm not happy at all.                             
6 In fact, I am now the least happy person on the planet.
7 There is no happiness left in me.                      
8 Wait, it's returned!                                   
9 I don't feel so bad after all!                         

Also of interest is the computational time used by each of these methods. To demonstrate this I increased Annie's examples by 100 replications and microbenchmark on a few iterations (Stanford takes so long I didn't extend to more). Note that if a text needs to be broken into sentence parts syuzhet has the get_sentences function that uses the openNLP package, this is a time expensive task. sentimentr uses a much faster regex based approach that is nearly as accurate in parsing sentences with a much lower computational time. We see that Stanford takes the longest time while sentimentr and syuzhet are comparable depending upon lexicon used.

ase_100 <- rep(ase, 100)

stanford <- function() {sentiment_stanford(ase_100)}

sentimentr_hu_liu <- function() sentiment(ase_100)
sentimentr_sentiword <- function() sentiment(ase_100, sentiword) 
    
syuzhet_binn <- function() get_sentiment(ase_100, method="bing")
syuzhet_nrc <- function() get_sentiment(ase_100, method="nrc")
syuzhet_afinn <- function() get_sentiment(ase_100, method="afinn")
     
microbenchmark(
    stanford(),
    sentimentr_hu_liu(),
    sentimentr_sentiword(),
    syuzhet_binn(), 
    syuzhet_nrc(),
    syuzhet_afinn(),
    times = 3
)

Unit: milliseconds
                   expr        min         lq       mean     median
             stanford() 19534.8874 19719.3494 19782.8247 19903.8114
    sentimentr_hu_liu()   220.7847   224.1138   226.0406   227.4429
 sentimentr_sentiword()   969.6914   973.4066   979.2458   977.1219
         syuzhet_binn()   356.9010   357.5310   363.1912   358.1610
          syuzhet_nrc()   884.7328   892.4310   914.2375   900.1292
        syuzhet_afinn()   162.0473   162.6307   172.1710   163.2141
         uq        max neval
 19906.7934 19909.7754     3
   228.6686   229.8943     3
   984.0230   990.9240     3
   366.3362   374.5115     3
   928.9898   957.8504     3
   177.2328   191.2516     3

Contact

You are welcome to:

Copy Link

Version

Down Chevron

Install

install.packages('sentimentr')

Monthly Downloads

3,745

Version

0.1.0

License

GPL-2

Maintainer

Last Published

August 23rd, 2015

Functions in sentimentr (0.1.0)