Learn R Programming

⚠️There's a newer version (1.3) of this package.Take me there.

tm.plugin.webmining (version 0.9)

Retrieve structured, textual data from various web sources

Description

tm.plugin.webmining facilitates text retrieval from feed formats like XML (RSS, ATOM) and JSON. Also direct retrieval from HTML is supported. As most (news) feeds only incorporate small fractions of the original text tm.plugin.webmining even retrieves and extracts the text of the original text source.

Copy Link

Version

Install

install.packages('tm.plugin.webmining')

Monthly Downloads

15

Version

0.9

License

GPL-3

Maintainer

Mario Annau

Last Published

December 21st, 2012

Functions in tm.plugin.webmining (0.9)

Get feed data from NYTimes Article Search (http://developer.nytimes.com/docs/read/article_search_api).

Get feed data from Twitter Search API (https://dev.twitter.com/docs/api/1/get/search).

Read Web Content and respective Link Content from feedurls.

Buildup string for feedquery.

Wrapper/Convenience function to ensure right encoding for different Platforms

GoogleReaderSource

Retrieve feeds through the Google Reader API.

tm.plugin.webmining-package

Retrieve structured, textual data from various web sources

GoogleFinanceSource

Get feed Meta Data from Google Finance.

GoogleBlogSearchSource

Get feed data from Google Blog Search (http://www.google.com/blogsearch).

Extract main content from TextDocuments.

GoogleNewsSource

Get feed data from Google News Search http://news.google.com/

Remove non-ASCII characters from Text.

YahooNewsSource

Get feed data from Yahoo! News (http://news.yahoo.com/).

Enclose Text Content in HTML tags

WebCorpus retrieved from Yahoo! News for the search term "Microsoft" through the YahooNewsSource. Length of retrieved corpus is 20.

trimWhiteSpaces

Trim White Spaces from Text Document.

Retrieve Empty Corpus Elements through $postFUN.

YahooInplaySource

Get News from Yahoo Inplay.

Read content from WebXMLSource/WebHTMLSource/WebJSONSource.

WebCorpus constructor function.

WebCorpus retrieved from the Google Reader API for the R-Bloggers blog consisting only of meta data (no main content available). Length of retrieved corpus is 1000.

extractHTMLStrip

Simply strip HTML Tags from Document

Copy of RCurl:::getURL() including a little bugfix for the .encoding parameter.

auth.google.reader

Authentification and token retrieval from the Google Reader web service.

Update/Extend WebCorpus with new feed items.

Update WebXMLSource/WebHTMLSource/WebJSONSource

Get main content for corpus items, specified by links.

extractContentDOM

Extract Main HTML Content from DOM

ReutersNewsSource

Get feed data from Reuters News RSS feed channels. Reuters provides numerous feed

YahooFinanceSource

Get feed data from Yahoo! Finance.