textrank: Textrank - extract relevant sentences

Description

The textrank algorithm is a technique to rank sentences in order of importance.

In order to find relevant sentences, the textrank algorithm needs 2 inputs: a data.frame (data) with sentences and a data.frame (terminology) containing tokens which are part of each sentence. Based on these 2 datasets, it calculates the pairwise distance between each sentence by computing how many terms are overlapping (Jaccard distance, implemented in textrank_jaccard). These pairwise distances among the sentences are next passed on to Google's pagerank algorithm to identify the most relevant sentences.

If data contains many sentences, it makes sense not to compute all pairwise sentence distances but instead limiting the calculation of the Jaccard distance to only sentence combinations which are limited by the Minhash algorithm. This is implemented in textrank_candidates_lsh and an example is show below.

Usage

textrank(data, terminology, textrank_dist = textrank_jaccard,
  textrank_candidates = textrank_candidates_all(data$textrank_id),
  max = 1000, options_pagerank = list(directed = FALSE), ...)

Arguments

data

a data.frame with 1 row per sentence where the first column is an identifier of a sentence (e.g. textrank_id) and the second column is the raw sentence. See the example.

terminology

a data.frame with with one row per token indicating which token is part of each sentence. The first column in this data.frame is the identifier which corresponds to the first column of data and the second column indicates the token which is part of the sentence which will be passed on to textrank_dist. See the example.

textrank_dist

a function which calculates the distance between 2 sentences which are represented by a vectors of tokens. The first 2 arguments of the function are the tokens in sentence1 and sentence2. The function should return a numeric value of length one. The larger the value, the larger the connection between the 2 vectors indicating more strength. Defaults to the jaccard distance (textrank_jaccard), indicating the percent of common tokens.

textrank_candidates

a data.frame of candidate sentence to sentence comparisons with columns textrank_id_1 and textrank_id_2 indicating for which combination of sentences we want to compute the Jaccard distance or the distance function as provided in textrank_dist. See for example textrank_candidates_all or textrank_candidates_lsh.

max

integer indicating to reduce the number of sentence to sentence combinations to compute. In case provided, we take only this max amount of rows from textrank_candidates

options_pagerank

a list of arguments passed on to page_rank

...

arguments passed on to textrank_dist

Value

an object of class textrank which is a list with elements:

sentences: a data.frame with columns textrank_id, sentence and textrank where the textrank is the Google Pagerank importance metric of the sentence
sentences_dist: a data.frame with columns textrank_id_1, textrank_id_2 (the sentence id) and weight which is the result of the computed distance between the 2 sentences
pagerank: the result of a call to page_rank

Examples

Run this code

# NOT RUN {
data(joboffer)
head(joboffer)

sentences <- unique(joboffer[, c("sentence_id", "sentence")])
cat(sentences$sentence)
terminology <- subset(joboffer, upos %in% c("NOUN", "ADJ"), select = c("sentence_id", "lemma"))
head(terminology)

## Textrank for finding the most relevant sentences
tr <- textrank(data = sentences, terminology = terminology)
summary(tr, n = 2)
summary(tr, n = 5, keep.sentence.order = TRUE)

## Using minhash to reduce sentence combinations - relevant if you have a lot of sentences
library(textreuse)
minhash <- minhash_generator(n = 1000, seed = 123456789)
candidates <- textrank_candidates_lsh(x = terminology$lemma, sentence_id = terminology$sentence_id,
                                      minhashFUN = minhash, bands = 500)
tr <- textrank(data = sentences, terminology = terminology, textrank_candidates = candidates)
summary(tr, n = 2)

## You can also reduce the number of sentence combinations by sampling
tr <- textrank(data = sentences, terminology = terminology, max = 100)
summary(tr, n = 2)
# }

Run the code above in your browser using DataLab

Description

Usage

Arguments

Value

See Also

Examples