keywords_rake: Keyword identification using Rapid Automatic Keyword Extraction (RAKE)

Description

RAKE is a basic algorithm which tries to identify keywords in text. Keywords are defined as a sequence of words following one another. The algorithm goes as follows.

candidate keywords are extracted by looking to a contiguous sequence of words which do not contain irrelevant words
a score is being calculated for each word which is part of any candidate keyword, this is done by
- among the words of the candidate keywords, the algorithm looks how many times each word is occurring and how many times it co-occurs with other words
- each word gets a score which is the ratio of the word degree (how many times it co-occurs with other words) to the word frequency
a RAKE score for the full candidate keyword is calculated by summing up the scores of each of the words which define the candidate keyword

The resulting keywords are returned as a data.frame together with their RAKE score.

Usage

keywords_rake(x, term, group, relevant = rep(TRUE, nrow(x)), ngram_max = 2,
  n_min = 2, sep = " ")

Arguments

a data.frame with one row per term as returned by as.data.frame(udpipe_annotate(...))

term

character string with a column in the data frame x, containing 1 term per row. To be used if x is a data.frame.

group

a character vector with 1 or several columns from x which indicates for example a document id or a sentence id. Keywords will be computed within this group in order not to find keywords across sentences or documents for example.

relevant

a logical vector of the same length as nrow(x), indicating if the word in the corresponding row of x is relevant or not. This can be used to exclude stopwords from the keywords calculation or for selecting only nouns and adjectives to find keywords (for example based on the Parts of Speech upos output from udpipe_annotate).

ngram_max

integer indicating the maximum number of words that there should be in each keyword

n_min

integer indicating the frequency of how many times a keywords should at least occur in the data in order to be returned. Defaults to 2.

sep

character string with the separator which will be used to paste together the terms which define the keywords. Defaults to a space: ' '.

Value

a data.frame with columns keyword, ngram and rake which is ordered from low to high rake

keyword: the keyword
ngram: how many terms are in the keyword
freq: how many times did the keyword occur
rake: the ratio of the degree to the frequency as explained in the description, summed up for all words from the keyword

References

Rose, Stuart & Engel, Dave & Cramer, Nick & Cowley, Wendy. (2010). Automatic Keyword Extraction from Individual Documents. Text Mining: Applications and Theory. 1 - 20. 10.1002/9780470689646.ch1.

Examples

Run this code

# NOT RUN {
data(brussels_reviews_anno)
x <- subset(brussels_reviews_anno, language == "nl")
keywords <- keywords_rake(x = x, term = "lemma", group = "doc_id", 
                          relevant = x$xpos %in% c("NN", "JJ"))
head(keywords)

x <- subset(brussels_reviews_anno, language == "fr")
keywords <- keywords_rake(x = x, term = "lemma", group = c("doc_id", "sentence_id"), 
                          relevant = x$xpos %in% c("NN", "JJ"), 
                          ngram_max = 10, n_min = 2, sep = "-")
head(keywords)
# }

Run the code above in your browser using DataLab