lexRank: Extractive text summarization with LexRank

Description

Compute LexRanks from a vector of documents using the page rank algorithm or degree centrality the methods used to compute lexRank are discussed in "LexRank: Graph-based Lexical Centrality as Salience in Text Summarization."

Usage

lexRank(text, docId = "create", threshold = 0.2, n = 3, returnTies = TRUE, usePageRank = TRUE, damping = 0.85, continuous = FALSE, sentencesAsDocs = FALSE, removePunc = TRUE, removeNum = TRUE, toLower = TRUE, stemWords = TRUE, rmStopWords = TRUE, Verbose = TRUE)

Arguments

text

A character vector of documents to be cleaned and processed by the LexRank algorithm

docId

A vector of document IDs with length equal to the length of text. If docId == "create" then doc IDs will be created as an index from 1 to n, where n is the length of text.

threshold

The minimum simil value a sentence pair must have to be represented in the graph where lexRank is calculated.

The number of sentences to return as the extractive summary. The function will return the top n lexRanked sentences. See returnTies for handling ties in lexRank.

returnTies

TRUE or FALSE indicating whether or not to return greater than n sentence IDs if there is a tie in lexRank. If TRUE, the returned number of sentences will not be limited to n, but rather will return every sentece with a top 3 score. If FALSE, the returned number of sentences will be <=n< code="">. Defaults to TRUE.

usePageRank

TRUE or FALSE indicating whether or not to use the page rank algorithm for ranking sentences. If FALSE, a sentences unweighted centrality will be used as the rank. Defaults to TRUE.

damping

The damping factor to be passed to page rank algorithm. Ignored if usePageRank is FALSE.

continuous

TRUE or FALSE indicating whether or not to use continuous LexRank. Only applies if usePageRank==TRUE. If TRUE, threshold will be ignored and lexRank will be computed using a weighted graph representation of the sentences. Defaults to FALSE.

sentencesAsDocs

TRUE or FALSE, indicating whether or not to treat sentences as documents when calculating tfidf scores for similarity. If TRUE, inverse document frequency will be calculated as inverse sentence frequency (useful for single document extractive summarization).

removePunc

TRUE or FALSE indicating whether or not to remove punctuation from text while tokenizing. If TRUE, puncuation will be removed. Defaults to TRUE.

removeNum

TRUE or FALSE indicating whether or not to remove numbers from text while tokenizing. If TRUE, numbers will be removed. Defaults to TRUE.

toLower

TRUE or FALSE indicating whether or not to coerce all of text to lowercase while tokenizing. If TRUE, text will be coerced to lowercase. Defaults to TRUE.

stemWords

TRUE or FALSE indicating whether or not to stem resulting tokens. If TRUE, the outputted tokens will be tokenized using SnowballC::wordStem(). Defaults to TRUE.

rmStopWords

TRUE, FALSE, or character vector of stopwords to remove from tokens. If TRUE, words in tm::stopwords("SMART") will be removed prior to stemming. If FALSE, no stopword removal will occur. If a character vector is passed, this vector will be used as the list of stopwords to be removed. Defaults to TRUE.

Verbose

TRUE or FALSE indicating whether or not to cat progress messages to the console while running. Defaults to TRUE.

Value

A 2 column dataframe with columns sentenceId and value. sentence contains the ids of the top n sentences in descending order by value. value contains page rank score (if usePageRank==TRUE) or degree centrality (if usePageRank==FALSE).

References

http://www.cs.cmu.edu/afs/cs/project/jair/pub/volume22/erkan04a-html/erkan04a.html

Examples

Run this code

lexRank(c("This is a test.","Tests are fun.",
"Do you think the exam will be hard?","Is an exam the same as a test?",
"How many questions are going to be on the exam?"))

Run the code above in your browser using DataLab