rvest v0.3.6
Monthly downloads
Easily Harvest (Scrape) Web Pages
Wrappers around the 'xml2' and 'httr' packages to
make it easy to download, then manipulate, HTML and XML.
Readme
rvest 
Overview
rvest helps you scrape information from web pages. It is designed to work with magrittr to make it easy to express common web scraping tasks, inspired by libraries like beautiful soup.
library(rvest)
lego_movie <- read_html("http://www.imdb.com/title/tt1490017/")
rating <- lego_movie %>%
html_nodes("strong span") %>%
html_text() %>%
as.numeric()
rating
#> [1] 7.7
cast <- lego_movie %>%
html_nodes("#titleCast .primary_photo img") %>%
html_attr("alt")
cast
#> [1] "Will Arnett" "Elizabeth Banks" "Craig Berry" "Alison Brie"
#> [5] "David Burrows" "Anthony Daniels" "Charlie Day" "Amanda Farinos"
#> [9] "Keith Ferguson" "Will Ferrell" "Will Forte" "Dave Franco"
#> [13] "Morgan Freeman" "Todd Hansen" "Jonah Hill"
poster <- lego_movie %>%
html_nodes(".poster img") %>%
html_attr("src")
poster
#> [1] "https://m.media-amazon.com/images/M/MV5BMTg4MDk1ODExN15BMl5BanBnXkFtZTgwNzIyNjg3MDE@._V1_UX182_CR0,0,182,268_AL_.jpg"
Installation
Install the release version from CRAN:
install.packages("rvest")
Or the development version from GitHub
# install.packages("devtools")
devtools::install_github("tidyverse/rvest")
Key functions
Once you have read a HTML document with read_html()
, you can:
Select parts of a document using CSS selectors:
html_nodes(doc, "table td")
(or if you’ve a glutton for punishment, use XPath selectors withhtml_nodes(doc, xpath = "//table//td")
). If you haven’t heard of selectorgadget, make sure to readvignette("selectorgadget")
to learn about it.Extract components with
html_name()
(the name of the tag),html_text()
(all text inside the tag),html_attr()
(contents of a single attribute) andhtml_attrs()
(all attributes).(You can also use rvest with XML files: parse with
xml()
, then extract components usingxml_node()
,xml_attr()
,xml_attrs()
,xml_text()
andxml_name()
.)Parse tables into data frames with
html_table()
.Extract, modify and submit forms with
html_form()
,set_values()
andsubmit_form()
.Detect and repair encoding problems with
guess_encoding()
andrepair_encoding()
.Navigate around a website as if you’re in a browser with
html_session()
,jump_to()
,follow_link()
,back()
,forward()
,submit_form()
and so on. (This is still a work in progress, so I’d love your feedback.)
To see examples of these function in use, check out the demos.
Inspirations
- Python: RoboBrowser, Beautiful Soup.
Code of Conduct
Please note that the rvest project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.
Functions in rvest
Name | Description | |
minimal_html | Generate a minimal html5 page. | |
pluck | Extract elements of a list by position. | |
rvest-package | rvest: Easily Harvest (Scrape) Web Pages | |
submit_form | Submit a form back to the server. | |
set_values | Set values in a form. | |
session_history | History navigation tools | |
html_tag | html_tag | |
xml | Work with xml. | |
%>% | Pipe operator | |
html_text | Extract attributes, text and tag name from html. | |
html | Parse an HTML page. | |
html_form | Parse forms in a page. | |
jump_to | Navigate to a new url. | |
html_table | Parse an html table into a data frame. | |
html_nodes | Select nodes from an HTML document | |
encoding | Guess and repair faulty character encoding. | |
google_form | Make link to google form given id | |
html_session | Simulate a session in an html browser. | |
No Results! |
Vignettes of rvest
Last month downloads
Details
License | GPL-3 |
URL | http://rvest.tidyverse.org/, https://github.com/tidyverse/rvest |
BugReports | https://github.com/tidyverse/rvest/issues |
VignetteBuilder | knitr |
Encoding | UTF-8 |
Language | en-US |
LazyData | true |
RoxygenNote | 7.1.1 |
NeedsCompilation | no |
Packaged | 2020-07-20 14:06:08 UTC; hadley |
Repository | CRAN |
Date/Publication | 2020-07-25 21:50:02 UTC |
suggests | covr , knitr , png , rmarkdown , spelling , stringi (>= 0.3.1) , testthat |
imports | httr (>= 0.5) , magrittr , selectr |
depends | R (>= 3.2) , xml2 |
Contributors | RStudio |
Include our badge in your README
[](http://www.rdocumentation.org/packages/rvest)