RDocumentation
Moon
Learn R
Search all packages and functions
boilerpipeR (version 1.3.2)
Interface to the Boilerpipe Java Library
Description
Generic Extraction of main text content from HTML files; removal of ads, sidebars and headers using the boilerpipe
Java library. The extraction heuristics from boilerpipe show a robust performance for a wide range of web site templates.
Copy Link
Copy
Link to current version
Version
Version
1.3.2
1.3
1.2.2
1.1
1.0
Down Chevron
Install
install.packages('boilerpipeR')
Monthly Downloads
238
Version
1.3.2
License
Apache License (== 2.0)
Issues
1
Pull Requests
0
Stars
22
Forks
2
Repository
https://github.com/mannau/boilerpipeR
Maintainer
Mario Annau
Last Published
May 19th, 2021
Functions in boilerpipeR (1.3.2)
Search functions
KeepEverythingExtractor
Marks everything as content.
CanolaExtractor
A full-text extractor trained on a 'krdwrd' Canola (see
https://krdwrd.org/trac/attachment/wiki/Corpora/Canola/CANOLA.pdf
.
DefaultExtractor
A quite generic full-text extractor.
ArticleExtractor
A full-text extractor which is tuned towards news articles.
boilerpipeR-package
Extract the main content from HTML files
content
Wordpress generated Webpage (retrieved from Quantivity Blog
https://quantivity.wordpress.com
). Content is saved as character and ready to be extracted.
Extractor
Generic extraction function which calls boilerpipe extractors
LargestContentExtractor
A full-text extractor which extracts the largest text component of a page.
ArticleSentencesExtractor
A full-text extractor which is tuned towards extracting sentences from news articles.
NumWordsRulesExtractor
A quite generic full-text extractor solely based upon the number of words per block (the current, the previous and the next block).