read_html: Static web scraping (with xml2)

Description

read_html() works by performing a HTTP request then parsing the HTML received using the xml2 package. This is "static" scraping because it operates only on the raw HTML file. While this works for most sites, in some cases you will need to use read_html_live() if the parts of the page you want to scrape are dynamically generated with javascript.

Generally, we recommend using read_html() if it works, as it will be faster and more robust, as it has fewer external dependencies (i.e. it doesn't rely on the Chrome web browser installed on your computer.)

Usage

read_html(
  x,
  encoding = "",
  ...,
  options = c("RECOVER", "NOERROR", "NOBLANKS", "HUGE")
)

Arguments

x

Usually a string representing a URL. See xml2::read_html() for other options.

encoding

Specify a default encoding for the document. Unless otherwise specified XML documents are assumed to be in UTF-8 or UTF-16. If the document is not UTF-8/16, and lacks an explicit encoding directive, this allows you to supply a default.

...

Additional arguments passed on to methods.

options

Set parsing options for the libxml2 parser. Zero or more of

RECOVER: recover on errors

NOENT

substitute entities

DTDLOAD

load the external subset

DTDATTR

default DTD attributes

DTDVALID

validate with the DTD

NOERROR

suppress error reports

NOWARNING

suppress warning reports

PEDANTIC

pedantic error reporting

NOBLANKS

remove blank nodes

SAX1

use the SAX1 interface internally

XINCLUDE

Implement XInclude substitution

NONET

Forbid network access

NODICT

Do not reuse the context dictionary

NSCLEAN

remove redundant namespaces declarations

NOCDATA

merge CDATA as text nodes

NOXINCNODE

do not generate XINCLUDE START/END nodes

COMPACT

compact small text nodes; no modification of the tree allowed afterwards (will possibly crash if you try to modify the tree)

OLD10

parse using XML-1.0 before update 5

NOBASEFIX

do not fixup XINCLUDE xml:base uris

HUGE

relax any hardcoded limit from the parser

OLDSAX

parse using SAX2 interface before 2.7.0

IGNORE_ENC

ignore internal document encoding hint

BIG_LINES

Store big lines numbers in text PSVI field

Examples

Run this code

# Start by reading a HTML page with read_html():
starwars <- read_html("https://rvest.tidyverse.org/articles/starwars.html")

# Then find elements that match a css selector or XPath expression
# using html_elements(). In this example, each  corresponds
# to a different film
films <- starwars |> html_elements("section")
films

# Then use html_element() to extract one element per film. Here
# we the title is given by the text inside 
title <- films |>
  html_element("h2") |>
  html_text2()
title

# Or use html_attr() to get data out of attributes. html_attr() always
# returns a string so we convert it to an integer using a readr function
episode <- films |>
  html_element("h2") |>
  html_attr("data-id") |>
  readr::parse_integer()
episode

Run the code above in your browser using DataLab