Learn R Programming

⚠️There's a newer version (1.0.4) of this package.Take me there.

rvest

Overview

rvest helps you scrape information from web pages. It is designed to work with magrittr to make it easy to express common web scraping tasks, inspired by libraries like beautiful soup.

library(rvest)
lego_movie <- read_html("http://www.imdb.com/title/tt1490017/")

rating <- lego_movie %>% 
  html_nodes("strong span") %>%
  html_text() %>%
  as.numeric()
rating
#> [1] 7.7

cast <- lego_movie %>%
  html_nodes("#titleCast .primary_photo img") %>%
  html_attr("alt")
cast
#>  [1] "Will Arnett"     "Elizabeth Banks" "Craig Berry"     "Alison Brie"    
#>  [5] "David Burrows"   "Anthony Daniels" "Charlie Day"     "Amanda Farinos" 
#>  [9] "Keith Ferguson"  "Will Ferrell"    "Will Forte"      "Dave Franco"    
#> [13] "Morgan Freeman"  "Todd Hansen"     "Jonah Hill"

poster <- lego_movie %>%
  html_nodes(".poster img") %>%
  html_attr("src")
poster
#> [1] "https://m.media-amazon.com/images/M/MV5BMTg4MDk1ODExN15BMl5BanBnXkFtZTgwNzIyNjg3MDE@._V1_UX182_CR0,0,182,268_AL_.jpg"

Installation

Install the release version from CRAN:

install.packages("rvest")

Or the development version from GitHub

# install.packages("devtools")
devtools::install_github("tidyverse/rvest")

Key functions

Once you have read a HTML document with read_html(), you can:

  • Select parts of a document using CSS selectors: html_nodes(doc, "table td") (or if you’ve a glutton for punishment, use XPath selectors with html_nodes(doc, xpath = "//table//td")). If you haven’t heard of selectorgadget, make sure to read vignette("selectorgadget") to learn about it.

  • Extract components with html_name() (the name of the tag), html_text() (all text inside the tag), html_attr() (contents of a single attribute) and html_attrs() (all attributes).

  • (You can also use rvest with XML files: parse with xml(), then extract components using xml_node(), xml_attr(), xml_attrs(), xml_text() and xml_name().)

  • Parse tables into data frames with html_table().

  • Extract, modify and submit forms with html_form(), set_values() and submit_form().

  • Detect and repair encoding problems with guess_encoding() and repair_encoding().

  • Navigate around a website as if you’re in a browser with html_session(), jump_to(), follow_link(), back(), forward(), submit_form() and so on. (This is still a work in progress, so I’d love your feedback.)

To see examples of these function in use, check out the demos.

Inspirations

Code of Conduct

Please note that the rvest project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

Copy Link

Version

Install

install.packages('rvest')

Monthly Downloads

566,894

Version

0.3.6

License

GPL-3

Issues

Pull Requests

Stars

Forks

Maintainer

Last Published

July 25th, 2020

Functions in rvest (0.3.6)

minimal_html

Generate a minimal html5 page.
pluck

Extract elements of a list by position.
rvest-package

rvest: Easily Harvest (Scrape) Web Pages
submit_form

Submit a form back to the server.
set_values

Set values in a form.
session_history

History navigation tools
html_tag

html_tag
xml

Work with xml.
%>%

Pipe operator
html_text

Extract attributes, text and tag name from html.
html

Parse an HTML page.
html_form

Parse forms in a page.
jump_to

Navigate to a new url.
html_table

Parse an html table into a data frame.
html_nodes

Select nodes from an HTML document
encoding

Guess and repair faulty character encoding.
google_form

Make link to google form given id
html_session

Simulate a session in an html browser.