Learn R Programming

textreg (version 0.1.3)

grab.fragments: Grab all fragments in a corpus with given phrase.

Description

Search corpus for passed phrase, using some wildcard notation. Return snippits of text containing this phrase, with a specified number of characters before and after. This gives context for phrases in documents.

Use like this frags = grab.fragments( "israel", bigcorp )

Can take phrases such as 'appl+' which means any word starting with "appl." Can also take phrases such as "big * city" which consist of any three-word phrase with "big" as the first word and "city" as the third word.

If a pattern matches overlapping phrases, it will return the first but not the second.

Usage

grab.fragments(phrase, corp, char.before = 80, char.after = char.before, cap.phrase = TRUE, clean = FALSE)

Arguments

phrase
Phrase to find in corpus
corp
is a tm corpus
char.before
Number of characters of document to pull before phrase to give context.
char.after
As above, but trailing characters. Defaults to char.before value.
cap.phrase
TRUE if the phrase should be put in ALL CAPS. False if left alone.
clean
True means drop all documents without phrase from list. False means leave NULLs in the list.

Value

fragments in corp that have given phrase.List of lists. First list is len(corp) long with NULL values for documents without phrase, and lists of phrases for those documents with the phrase

Examples

Run this code
library( tm )
docs = c( "987654321 test 123456789", "987654321 test test word 123456789",
       "test at start", "a test b", "this is a test", "without the t-word",
       "a test for you and a test for me" )
corpus <- Corpus(VectorSource(docs))
grab.fragments( "test *", corpus, char.before=4, char.after=4 )

Run the code above in your browser using DataLab