sentimentr
sentimentr is designed to quickly calulate text polarity sentiment at the sentence level and optionally aggregate by rows or grouping variable(s).
sentimentr is a response to my own needs with sentiment detection
that were not addressed by the current R tools. My own polarity
function in the qdap package is slower on larger data sets. It is a
dictionary lookup approach that tries to incorporate weighting for
valence shifters (negation and amplifiers/deamplifiers). Matthew
Jocker's created the
syuzhet package
that utilizes dictionary lookups for the Bing, NRC, and Afinn methods.
He also utilizes a wrapper for the Stanford
coreNLP which uses much
more sophisticated analysis. Jocker's dictionary methods are fast but
are more prone to error in the case of valence shifters. Jocker's
addressed these
critiques
explaining that the method is good with regard to analyzing general
sentiment in a piece of literature. He points to the accuracy of the
Stanford detection as well. In my own work I need better accuracy than a
simple dictionary lookup; something that considers valence shifters yet
optimizes speed which the Stanford's parser does not. This leads to a
trade off of speed vs. accuracy. The equation below describes the
dictionary method of sentimentr that may give better results than a
dictionary approach that does not consider valence shifters but will
likely still be less accurate than Stanford's approach. Simply,
sentimentr attempts to balance accuracy and speed.
Table of Contents
The Equation
The equation used by the algorithm to assign value to polarity of each sentence fist utilizes the sentiment dictionary (Hu and Liu, 2004) to tag polarized words. Each paragraph (pi = {s1, s2, ..., sn}) composed of sentences, is broken into element sentences (si, j = {w1, w2, ..., wn}) where w are the words within sentences. Each sentence (sj) is broken into a an ordered bag of words. Punctuation is removed with the exception of pause punctuations (commas, colons, semicolons) which are considered a word within the sentence. I will denote pause words as c**w (comma words) for convenience. We can represent these words as an i,j,k notation as wi, j, k. For example w3, 2, 5 would be the fifth word of the second sentence of the third paragraph. While I use the term paragraph this merely represent a complete turn of talk. For example t may be a cell level response in a questionnaire composed of sentences.
The words in each sentence (wi, j, k) are searched and compared to a modified version of Hu, M., & Liu, B.'s (2004) dictionary of polarized words. Positive (wi, j, k + ) and negative (wi, j, k − ) words are tagged with a + 1 and − 1 respectively (or other positive/negative weighting if the user provides the sentiment dictionary). I will denote polarized words as p**w for convenience. These will form a polar cluster (ci, j, l) which is a subset of the a sentence (ci, j, l ⊆ si, j).
The polarized context cluster (ci, j, l) of words is
pulled from around the polarized word ($pw}) and defaults to 4 words
before and two words after p**w) to be considered as valence shifters.
The cluster can be represented as
(ci, j, l = {p**wi, j, k − n**b, ..., p**wi, j, k, ..., p**wi, j, k − n**a}),
where n**b & n**a are the parameters n.before
and n.after
set by
the user. The words in this polarized context cluster are tagged as
neutral (wi, j, k0), negator
(wi, j, kn), amplifier
(wi, j, ka), or de-amplifier
(wi, j, kd). Neutral words hold no value
in the equation but do affect word count (n). Each polarized word is
then weighted (w) based on the weights from the polarity_dt
argument
and then further weighted by the function and number of the valence
shifters directly surrounding the positive or negative word (p**w).
Pause (c**w) locations (punctuation that denotes a pause including
commas, colons, and semicolons) are indexed and considered in
calculating the upper and lower bounds in the polarized context cluster.
This is because these marks indicate a change in thought and words prior
are not necessarily connected with words after these punctuation marks.
The lower bound of the polarized context cluster is constrained to
max{p**wi, j, k − n**b, 1, max{c**wi, j, k < p**wi, j, k}}
and the upper bound is constrained to
min{p**wi, j, k + n**a, wi, j**n, min{c**wi, j, k > p**wi, j, k}}
where wi, j**n is the number of words in the sentence.
The core value in the cluster, the polarized word is acted upon by valence shifters. Amplifiers increase the polarity by 1.8 (.8 is the default weight (z)). Amplifiers (wi, j, ka) become de-amplifiers if the context cluster contains an odd number of negators (wi, j, kn). De-amplifiers work to decrease decrease the polarity. Negation (wi, j, kn) acts on amplifiers/de-amplifiers as discussed but also flip the sign of the polarized word. Negation is determined by raising − 1 to the power of the number of negators (wi, j, kn) plus 2. Simply, this is a result of a belief that two negatives equal a positive, 3 negatives a negative and so on.
The "but" conjunctions (i.e., 'but', 'however', and 'although') also weight the context cluster. A but conjunction before the polarized word up-weights the cluster by 1.85 (.85 is the default weight (z2)). A but conjunction after the polarized word down-weights the cluster by 1 - .85 (z2)). The number of occurrences before and after the polarized word are multiplied by 1 and -1 respectively and then summed within context cluster. It is this value that is multiplied by the weight and added to 1.This corresponds to the belief that a but makes the next clause of greater values while lowering the value placed on the prior clause.
The researcher may provide a weight (z) to be utilized with amplifiers/de-amplifiers (default is .8; de-amplifier weight is constrained to − 1 lower bound). Last, these weighted context clusters (ci, j, l) are summed (c′i, j) and divided by the square root of the word count (√wi, j**n) yielding an unbounded polarity score (δi, j) for each sentence.
δi**j = c'i**j/√wijn
Where:
c′i, j = ∑((1 + wamp + wdeamp) ⋅ wi, j, kp( − 1)2 + wneg)
wamp = ∑(wneg ⋅ (z ⋅ wi, j, ka))
wdeamp = max(wdeamp′, − 1)
wdeamp′ = ∑(z( − wneg ⋅ wi, j, ka + wi, j, kd))
wb = 1 + z2 * wb′
wb′ = ∑(|wbutconjunctio**n|, ..., wi, j, kp, wi, j, kp, ..., |wbutconjunctio**n| * − 1)
wneg = (∑wi, j, kn ) mod 2
To get the mean of all sentences (si, j) within a paragraph (pi) simply take the average sentiment score pi, δi, j = 1/n ⋅ ∑ δi, j.
Installation
To download the development version of sentimentr:
Download the zip
ball or tar
ball, decompress
and run R CMD INSTALL
on it, or use the pacman package to install
the development version:
if (!require("pacman")) install.packages("pacman")
pacman::p_load_gh("trinker/sentimentr")
Contact
You are welcome to:
- submit suggestions and bug-reports at: https://github.com/trinker/sentimentr/issues
- send a pull request on: https://github.com/trinker/sentimentr/
- compose a friendly e-mail to: tyler.rinker@gmail.com
Examples
if (!require("pacman")) install.packages("pacman")
pacman::p_load(sentimentr)
mytext <- c(
'do you like it? But I hate really bad dogs',
'I am the best friend.',
'Do you really like it? I\'m not a fan'
)
sentiment(mytext)
## element_id sentence_id word_count sentiment
## 1: 1 1 4 0.5000000
## 2: 1 2 6 -2.6781088
## 3: 2 1 5 0.4472136
## 4: 3 1 5 0.8049845
## 5: 3 2 4 0.0000000
To aggregate by element (column cell or vector element) use
sentiment_by
with by = NULL
.
mytext <- c(
'do you like it? But I hate really bad dogs',
'I am the best friend.',
'Do you really like it? I\'m not a fan'
)
sentiment_by(mytext)
## element_id word_count sd ave_sentiment
## 1: 1 10 2.247262 -1.0890544
## 2: 2 5 NA 0.4472136
## 3: 3 9 0.569210 0.4024922
To aggregate by grouping variables use sentiment_by
using the by
argument.
(out <- with(presidential_debates_2012, sentiment_by(dialogue, list(person, time))))
## person time word_count sd ave_sentiment
## 1: OBAMA time 1 3598 0.4397613 0.10966120
## 2: LEHRER time 1 765 0.3493838 0.10941383
## 3: OBAMA time 3 7241 0.4135144 0.09654523
## 4: OBAMA time 2 7476 0.3832811 0.08893467
## 5: ROMNEY time 3 8302 0.3909338 0.08108205
## 6: ROMNEY time 1 4085 0.3510066 0.06613552
## 7: SCHIEFFER time 3 1445 0.3772378 0.06515716
## 8: CROWLEY time 2 1672 0.2125288 0.05531121
## 9: ROMNEY time 2 7534 0.3188779 0.04946325
## 10: QUESTION time 2 583 0.3255268 0.03334828
plot(out)
## Warning: Removed 2 rows containing non-finite values (stat_boxplot).
## Warning: Removed 2 rows containing missing values (geom_point).
plot(uncombine(out))
Annie Swafford's Examples
Annie
Swafford
critiqued Jocker's approach to sentiment and gave the following examples
of sentences (ase
for Annie Swafford example). Here I test each of
Jocker's 3 dictionary approaches (Bing, NRC, Afinn), his Stanford
wrapper (note I use my own GitHub Stanford wrapper
package based off of Jocker's
approach as it works more reliably on my own Windows machine), and my
own algorithm with both the default Hu & Liu
(2004) polarity
lexicon as well as Baccianella, Esuli and Sebastiani's
(2010) SentiWord lexicon.
if (!require("pacman")) install.packages("pacman")
pacman::p_load_gh("trinker/sentimentr", "trinker/stansent")
pacman::p_load(syuzhet, qdap, microbenchmark)
ase <- c(
"I haven't been sad in a long time.",
"I am extremely happy today.",
"It's a good day.",
"But suddenly I'm only a little bit happy.",
"Then I'm not happy at all.",
"In fact, I am now the least happy person on the planet.",
"There is no happiness left in me.",
"Wait, it's returned!",
"I don't feel so bad after all!"
)
syuzhet <- setNames(as.data.frame(lapply(c("bing", "afinn", "nrc"),
function(x) get_sentiment(ase, method=x))), c("bing", "afinn", "nrc"))
left_just(data.frame(
stanford = sentiment_stanford(ase),
hu_liu = round(sentiment(ase, question.weight = 0)[["sentiment"]], 2),
sentiword = round(sentiment(ase, sentiword, question.weight = 0)[["sentiment"]], 2),
syuzhet,
sentences = ase,
stringsAsFactors = FALSE
), "sentences")
stanford hu_liu sentiword bing afinn nrc
1 0.5 0 0.27 -1 -2 0
2 -1 0.8 0.65 1 3 1
3 -0.5 0.5 0.32 1 3 1
4 0.5 0 0 1 3 1
5 0.5 -0.41 -0.56 1 3 1
6 0.5 0.06 0.05 1 3 1
7 0.5 -0.38 -0.05 1 2 1
8 0 0 -0.14 0 0 -1
9 0.5 0.38 0.24 -1 -3 -1
sentences
1 I haven't been sad in a long time.
2 I am extremely happy today.
3 It's a good day.
4 But suddenly I'm only a little bit happy.
5 Then I'm not happy at all.
6 In fact, I am now the least happy person on the planet.
7 There is no happiness left in me.
8 Wait, it's returned!
9 I don't feel so bad after all!
Also of interest is the time of each of these methods. Here I increase
Annie's examples by 100 and microbenchmark on a few (Stanford takes
so long I didn't extend to more). Note that if a text needs to be broken
into sentence parts syuzhet has the get_sentences
function that
uses the openNLP package, this is a time expensive task.
sentimentr uses a much faster regex based approach that is nearly as
accurate in parsing sentences with a much lower computational time. We
see that Stanford takes the longest time while sentimentr and
syuzhet are comparable depending upon lexicon used.
ase_100 <- rep(ase, 100)
stanford <- function() {sentiment_stanford(ase_100)}
sentimentr_hu_liu <- function() sentiment(ase_100)
sentimentr_sentiword <- function() sentiment(ase_100, sentiword)
syuzhet_binn <- function() get_sentiment(ase_100, method="bing")
syuzhet_nrc <- function() get_sentiment(ase_100, method="nrc")
syuzhet_afinn <- function() get_sentiment(ase_100, method="afinn")
microbenchmark(
stanford(),
sentimentr_hu_liu(),
sentimentr_sentiword(),
syuzhet_binn(),
syuzhet_nrc(),
syuzhet_afinn(),
times = 3
)
Unit: milliseconds
expr min lq mean median
stanford() 20519.1232 20620.1182 20684.4025 20721.1132
sentimentr_hu_liu() 224.5367 232.9833 238.5421 241.4299
sentimentr_sentiword() 977.2767 980.9229 987.7338 984.5692
syuzhet_binn() 254.8387 293.6495 310.7012 332.4602
syuzhet_nrc() 787.3683 790.1853 831.2212 793.0022
syuzhet_afinn() 118.1905 138.8190 149.8055 159.4475
uq max neval
20767.0422 20812.9712 3
245.5448 249.6597 3
992.9624 1001.3556 3
338.6324 344.8045 3
853.1477 913.2931 3
165.6131 171.7787 3