rvest v0.3.2

0

Monthly downloads

0th

Percentile

by Hadley Wickham

Easily Harvest (Scrape) Web Pages

Wrappers around the 'xml2' and 'httr' packages to make it easy to download, then manipulate, HTML and XML.

Readme

rvest

Build Status CRAN\_Status\_Badge Coverage Status

rvest helps you scrape information from web pages. It is designed to work with magrittr to make it easy to express common web scraping tasks, inspired by libraries like beautiful soup.

library(rvest)
lego_movie <- read_html("http://www.imdb.com/title/tt1490017/")

rating <- lego_movie %>% 
  html_nodes("strong span") %>%
  html_text() %>%
  as.numeric()
rating
#> [1] 7.8

cast <- lego_movie %>%
  html_nodes("#titleCast .itemprop span") %>%
  html_text()
cast
#>  [1] "Will Arnett"     "Elizabeth Banks" "Craig Berry"    
#>  [4] "Alison Brie"     "David Burrows"   "Anthony Daniels"
#>  [7] "Charlie Day"     "Amanda Farinos"  "Keith Ferguson" 
#> [10] "Will Ferrell"    "Will Forte"      "Dave Franco"    
#> [13] "Morgan Freeman"  "Todd Hansen"     "Jonah Hill"

poster <- lego_movie %>%
  html_nodes(".poster img") %>%
  html_attr("src")
poster
#> [1] "http://ia.media-imdb.com/images/M/MV5BMTg4MDk1ODExN15BMl5BanBnXkFtZTgwNzIyNjg3MDE@._V1_UX182_CR0,0,182,268_AL_.jpg"

Overview

The most important functions in rvest are:

  • Create an html document from a url, a file on disk or a string containing html with read_html().

  • Select parts of a document using css selectors: html_nodes(doc, "table td") (or if you've a glutton for punishment, use xpath selectors with html_nodes(doc, xpath = "//table//td")). If you haven't heard of selectorgadget, make sure to read vignette("selectorgadget") to learn about it.

  • Extract components with html_tag() (the name of the tag), html_text() (all text inside the tag), html_attr() (contents of a single attribute) and html_attrs() (all attributes).

  • (You can also use rvest with XML files: parse with xml(), then extract components using xml_node(), xml_attr(), xml_attrs(), xml_text() and xml_tag().)

  • Parse tables into data frames with html_table().

  • Extract, modify and submit forms with html_form(), set_values() and submit_form().

  • Detect and repair encoding problems with guess_encoding() and repair_encoding().

  • Navigate around a website as if you're in a browser with html_session(), jump_to(), follow_link(), back(), forward(), submit_form() and so on. (This is still a work in progress, so I'd love your feedback.)

To see examples of these function in use, check out the demos.

Installation

Install the release version from CRAN:

install.packages("rvest")

Or the development version from github

# install.packages("devtools")
devtools::install_github("hadley/rvest")

Inspirations

Functions in rvest

Name Description
google_form Make link to google form given id
html_text Extract attributes, text and tag name from html.
html_form Parse forms in a page.
html_tag html_tag
html_table Parse an html table into a data frame.
encoding Guess and repair faulty character encoding.
jump_to Navigate to a new url.
html Parse an HTML page.
html_nodes Select nodes from an HTML document
html_session Simulate a session in an html browser.
minimal_html Generate a minimal html5 page.
session_history History navigation tools
xml Work with xml.
%>% Pipe operator
submit_form Submit a form back to the server.
set_values Set values in a form.
pluck Extract elements of a list by position.
No Results!

Last month downloads

Details

Encoding UTF-8
License GPL-3
LazyData true
VignetteBuilder knitr
RoxygenNote 5.0.1
URL https://github.com/hadley/rvest
BugReports https://github.com/hadley/rvest/issues
NeedsCompilation no
Packaged 2016-06-16 16:20:43 UTC; hadley
Repository CRAN
Date/Publication 2016-06-17 08:57:12

Include our badge in your README

[![Rdoc](http://www.rdocumentation.org/badges/version/rvest)](http://www.rdocumentation.org/packages/rvest)