It is the actual workhorse which directly calls the boilerpipe Java library. Typically called through
functions as listed for parameter exname.
Extractor(exname, content, asText = TRUE, ...)character specifying the extractor to be used. It can take one of the following values:
ArticleExtractorA full-text extractor which is tuned towards news articles.
ArticleSentencesExtractorA full-text extractor which is tuned towards extracting sentences from news articles.
CanolaExtractorA full-text extractor trained on a 'krdwrd'.
DefaultExtractorA quite generic full-text extractor.
KeepEverythingExtractorMarks everything as content.
LargestContentExtractorA full-text extractor which extracts the largest text component of a page.
NumWordsRulesExtractorA quite generic full-text extractor solely based upon the number of words per block.
Text content or URL as character
should content specifed be treated as actual text to be extracted or url (from which HTML document is first downloaded and extracted afterwards), defaults to TRUE
additional parameters
extracted text as character