grab.fragments: Grab all fragments in a corpus with given phrase.

Description

Search corpus for passed phrase, using some wildcard notation. Return snippits of text containing this phrase, with a specified number of characters before and after. This gives context for phrases in documents.

Use like this frags = grab.fragments( "israel", bigcorp )

Can take phrases such as 'appl+' which means any word starting with "appl." Can also take phrases such as "big * city" which consist of any three-word phrase with "big" as the first word and "city" as the third word.

If a pattern matches overlapping phrases, it will return the first but not the second.

Usage

grab.fragments(phrase, corp, char.before = 80, char.after = char.before, cap.phrase = TRUE, clean = FALSE)

Arguments

phrase

Phrase to find in corpus

corp

is a tm corpus

char.before

Number of characters of document to pull before phrase to give context.

char.after

As above, but trailing characters. Defaults to char.before value.

cap.phrase

TRUE if the phrase should be put in ALL CAPS. False if left alone.

clean

True means drop all documents without phrase from list. False means leave NULLs in the list.

Value

fragments in corp that have given phrase.List of lists. First list is len(corp) long with NULL values for documents without phrase, and lists of phrases for those documents with the phrase

Examples

Run this code

library( tm )
docs = c( "987654321 test 123456789", "987654321 test test word 123456789",
       "test at start", "a test b", "this is a test", "without the t-word",
       "a test for you and a test for me" )
corpus <- Corpus(VectorSource(docs))
grab.fragments( "test *", corpus, char.before=4, char.after=4 )

Run the code above in your browser using DataLab