Learn R Programming

⚠️There's a newer version (0.1.2) of this package.Take me there.

newscatcheR

Programmatically collect normalized news from (almost) any website using R

newscatcheR is an R clone of the python package newscatcher.

The package provides a dataset of news sites and their rss feeds, and two functions that work as a wrapper around tidyRSS to fetch the feed from a given site. It also provides a function to check the dataset for news sources per top level domain.

Installation

You can install the released version of newscatcheR from CRAN with:

install.packages("newscatcheR")

And the development version from GitHub with:

# install.packages("remotes")
remotes::install_github("discindo/newscatcheR")

Or

# install.packages("devtools")
devtools::install_github("discindo/newscatcheR")

Overview

get_news(website) returns the contents of a rss feed of a website.

library(newscatcheR)
# adding a small time delay to avoid simultaneous posts to the API
Sys.sleep(3)
get_news("news.ycombinator.com")
#> GET request successful. Parsing...
#> Warning: Predicate functions must be wrapped in `where()`.
#> 
#>   # Bad
#>   data %>% select(is.character)
#> 
#>   # Good
#>   data %>% select(where(is.character))
#> 
#> ℹ Please update your code.
#> This message is displayed once per session.
#> # A tibble: 30 x 10
#>    feed_title feed_link feed_description feed_pub_date       item_title
#>    <chr>      <chr>     <chr>            <dttm>              <chr>     
#>  1 Hacker Ne… https://… Links for the i… 2020-06-05 12:26:26 Ask HN: H…
#>  2 Hacker Ne… https://… Links for the i… 2020-06-05 12:26:26 SimRefine…
#>  3 Hacker Ne… https://… Links for the i… 2020-06-05 12:26:26 Why Is th…
#>  4 Hacker Ne… https://… Links for the i… 2020-06-05 12:26:26 Mental We…
#>  5 Hacker Ne… https://… Links for the i… 2020-06-05 12:26:26 People tr…
#>  6 Hacker Ne… https://… Links for the i… 2020-06-05 12:26:26 First pho…
#>  7 Hacker Ne… https://… Links for the i… 2020-06-05 12:26:26 A History…
#>  8 Hacker Ne… https://… Links for the i… 2020-06-05 12:26:26 Germany, …
#>  9 Hacker Ne… https://… Links for the i… 2020-06-05 12:26:26 Julia as …
#> 10 Hacker Ne… https://… Links for the i… 2020-06-05 12:26:26 DeepFaceD…
#> # … with 20 more rows, and 5 more variables: item_link <chr>,
#> #   item_description <chr>, item_pub_date <dttm>, item_category <list>,
#> #   item_comments <chr>

get_headlines(website) returns just the headlines of the website’s rss feed.

library(newscatcheR)
# adding a small time delay to avoid simultaneous posts to the API
Sys.sleep(3)  
get_headlines("news.ycombinator.com")
#> GET request successful. Parsing...
#>                                                           feed_entries$item_title
#> 1                          Ask HN: How do I reach making $1-1.5k/mo in 13 months?
#> 2                                                           SimRefinery Recovered
#> 3                                     Why Is the Human Brain So Efficient? (2018)
#> 4                                                                   Mental Wealth
#> 5     People try to do right by each other, no matter the motivation, study finds
#> 6                                       First photo of HS2 tunnel boring machines
#> 7                                                      A History of Clojure [pdf]
#> 8            Germany, France launch Gaia-X platform in bid for ‘tech sovereignty’
#> 9                                                       Julia as a CLI Calculator
#> 10      DeepFaceDrawing Generates Photorealistic Portraits from Freehand Sketches
#> 11      Ask HN: Are my expectations on code quality and professionalism too high?
#> 12                                            The Go Compiler Needs to Be Smarter
#> 13                                                Unker Non-Linear Writing System
#> 14            Signal app downloads spike as US protesters seek message encryption
#> 15           Synthetic red blood cells mimic natural ones, and have new abilities
#> 16                                                   The Beauty of Unix Pipelines
#> 17                                                                  Kids and Time
#> 18                          555 timer teardown: inside the most popular IC (2016)
#> 19                                  Currents: Have Meaningful Discussions at Work
#> 20                                        Words that don't translate into English
#> 21                                                        Homoiconicity Revisited
#> 22                                               Containers from first principles
#> 23 Ask HN: Have you ever gone without a computer or phone for an extended period?
#> 24                                                    Why Sleep Deprivation Kills
#> 25                                                        macOS in QEMU in Docker
#> 26                                                         The Illiac IV Computer
#> 27                                  Cryo-electron microscopy breaks a key barrier
#> 28                                                          Emacs as Email Client
#> 29                        In a photo of a black hole, a possible key to mysteries
#> 30                         WeChat permabans account for using wrongthink password

tld_sources(tld) returns rows from the provided dataset of news sites with the given top level domain

library(newscatcheR)
tld_sources("it")
#> # A tibble: 10 x 2
#>    url                  rss_endpoint                                      
#>    <chr>                <chr>                                             
#>  1 ansa.it              http://www.ansa.it/web/ansait_web_rss_homepage.xml
#>  2 thinkandbuild.it     https://www.thinkandbuild.it/feed/                
#>  3 wasproject.it        http://www.wasproject.it/w/en/blog-2/feed         
#>  4 corriere.it          http://www.corriere.it/rss/homepage.xml           
#>  5 gazzetta.it          http://www.gazzetta.it/rss/calcio.xml             
#>  6 ilfattoquotidiano.it http://www.ilfattoquotidiano.it/feed/             
#>  7 lastampa.it          http://www.lastampa.it/redazione/rss_home.xml     
#>  8 punto-informatico.it http://punto-informatico.it/fader/pixml.xml       
#>  9 repubblica.it        http://www.repubblica.it/rss/homepage/rss2.0.xml  
#> 10 tgcom.mediaset.it    http://www.tgcom.mediaset.it/rss/cronaca.xml

Use Case

This package can be convenient if you need to fetch news from various websites for further analysis and you don’t want to search manually for the URL of their RSS feeds.

Assuming we have the news sites we want to follow:

sites = c("bbc.com", "spiegel.de", "washingtonpost.com")

We can get a list of data frames with:

library(newscatcheR)
lapply(sites, get_news)

Copy Link

Version

Install

install.packages('newscatcheR')

Monthly Downloads

184

Version

0.0.2

License

MIT + file LICENSE

Issues

Pull Requests

Stars

Forks

Maintainer

Novica Nakov

Last Published

June 5th, 2020

Functions in newscatcheR (0.0.2)

tld_sources

tld_sources A helper function to explore news sources by country (or other TLD)
get_headlines

Get headlines A helper function to get just the headlines of the feed
rss_table

RSS table from python package newscatcher
get_news

Get news