profanity: Compute Profanity Rate

Description

Detect the rate of profanity at the sentence level. This method uses a simple dictionary lookup to find profane words and then compute the rate per sentence. The profanity score ranges between 0 (no profanity used) and 1 (all words used were profane). Note that a single profane phrase would count as just one in the profanity_count column but would count as two words in the word_count column.

Usage

profanity(text.var, profanity_list = lexicon::profanity_alvarez, ...)

Arguments

text.var

The text variable. Can be a get_sentences object or a raw character vector though get_sentences is preferred as it avoids the repeated cost of doing sentence boundary disambiguation every time sentiment is run.

profanity_list

A atomic character vector of profane words. The lexicon package has lists that can be used, including:

lexicon::profanity_alvarez
lexicon::profanity_arr_bad
lexicon::profanity_banned
lexicon::profanity_zac_anger
lexicon::profanity_racist

…

ignored.

Value

Returns a data.table of:

element_id - The id number of the original vector passed to profanity
sentence_id - The id number of the sentences within each element_id
word_count - Word count
profanity_count - Count of the number of profane words
profanity - A score of the percentage of profane words

Examples

Run this code

# NOT RUN {
bw <- sample(lexicon::profanity_alvarez, 4)
mytext <- c(
   sprintf('do you like this %s?  It is %s. But I hate really bad dogs', bw[1], bw[2]),
   'I am the best friend.',
   NA,
   sprintf('I %s hate this %s', bw[3], bw[4]),
   "Do you really like it?  I'm not happy"
)

## works on a character vector but not the preferred method avoiding the 
## repeated cost of doing sentence boundary disambiguation every time 
## `profanity` is run
profanity(mytext)

## preferred method avoiding paying the cost 
mytext2 <- get_sentences(mytext)
profanity(mytext2)

plot(profanity(mytext2))

brady <- get_sentences(crowdflower_deflategate)
brady_swears <- profanity(brady)
brady_swears
hist(brady_swears$profanity)
sum(brady_swears$profanity > 0)
hist(brady_swears$profanity[brady_swears$profanity > 0])
# }

Run the code above in your browser using DataLab