profanity: Compute Profanity Rate

Description

Detect the rate of profanity at the sentence level. This method uses a simple dictionary lookup to find profane words and then compute the rate per sentence. The profanity score ranges between 0 (no profanity used) and 1 (all words used were profane). Note that a single profane phrase would count as just one in the profanity_count column but would count as two words in the word_count column.

Usage

profanity(
  text.var,
  profanity_list = unique(tolower(lexicon::profanity_alvarez)),
  ...
)

Arguments

text.var

The text variable. Can be a get_sentences object or a raw character vector though get_sentences is preferred as it avoids the repeated cost of doing sentence boundary disambiguation every time sentiment is run.

profanity_list

A atomic character vector of profane words. The lexicon package has lists that can be used, including:

unique(tolower(lexicon::profanity_alvarez))
lexicon::profanity_arr_bad
lexicon::profanity_banned
lexicon::profanity_zac_anger
lexicon::profanity_racist

…

ignored.

Value

Returns a data.table of:

element_id - The id number of the original vector passed to profanity
sentence_id - The id number of the sentences within each element_id
word_count - Word count
profanity_count - Count of the number of profane words
profanity - A score of the percentage of profane words

Examples

Run this code

# NOT RUN {
bw <- sample(unique(tolower(lexicon::profanity_alvarez)), 4)
mytext <- c(
   sprintf('do you like this %s?  It is %s. But I hate really bad dogs', bw[1], bw[2]),
   'I am the best friend.',
   NA,
   sprintf('I %s hate this %s', bw[3], bw[4]),
   "Do you really like it?  I'm not happy"
)

## works on a character vector but not the preferred method avoiding the 
## repeated cost of doing sentence boundary disambiguation every time 
## `profanity` is run
profanity(mytext)

## preferred method avoiding paying the cost 
mytext2 <- get_sentences(mytext)
profanity(mytext2)

plot(profanity(mytext2))

brady <- get_sentences(crowdflower_deflategate)
brady_swears <- profanity(brady)
brady_swears

## Distribution of profanity proportion for all comments
hist(brady_swears$profanity)
sum(brady_swears$profanity > 0)

## Distribution of proportions for those profane comments
hist(brady_swears$profanity[brady_swears$profanity > 0])

combo <- combine_data()
combo_sentences <- get_sentences(crowdflower_deflategate)
racist <- profanity(combo_sentences, profanity_list = lexicon::profanity_racist)
combo_sentences[racist$profanity > 0, ]$text
extract_profanity_terms(
    combo_sentences[racist$profanity > 0, ]$text, 
    profanity_list = lexicon::profanity_racist
)

## Remove jerry, que, and illegal from the list
library(textclean)

racist2 <- profanity(
    combo_sentences, 
    profanity_list = textclean::drop_element_fixed(
        lexicon::profanity_racist, 
        c('jerry', 'illegal', 'que')
    )
)
combo_sentences[racist2$profanity > 0, ]$text
# }

Run the code above in your browser using DataLab

Description

Usage

Arguments

Value

See Also

Examples