50% off | Unlimited Data & AI Learning
Get 50% off unlimited learning

decapitated (version 0.3.0)

chrome_read_html: Read a URL via headless Chrome and return the raw or rendered <body> innerHTML DOM elements

Description

Read a URL via headless Chrome and return the raw or rendered <body> innerHTML DOM elements

Usage

chrome_read_html(url, render = TRUE, prime = TRUE, work_dir = NULL,
  chrome_bin = Sys.getenv("HEADLESS_CHROME"))

Arguments

url

URL to read from

render

if TRUE then return an xml_document, else the raw HTML (invisibly)

prime

if TRUE preliminary URL retrieval requests will be sent to "prime" the headless Chrome cache. This seems to be necessary primarily on recent versions of macOS. If numeric, that number of "prime" requests will be sent ahead of the capture request. If FALSE no priming requests will be sent.

work_dir

See special Section.

chrome_bin

the path to Chrome (auto-set from HEADLESS_CHROME environment variable)

Working around headless Chrome &amp; OS security restrictions

Security restrictions on various operating systems and OS configurations can cause headless Chrome execution to fail. As a result, headless Chrome operations should use a special directory for decapitated package operations. You can pass this in as work_dir. If work_dir is NULL a .rdecapdata directory will be created in your home directory and used for the data, crash dumps and utility directories for Chrome operations.

tempdir() does not always meet these requirements (after testing on various macOS 10.13 systems) as Chrome does some interesting attribute setting for some of its file operations.

If you pass in a work_dir, it must be one that does not violate OS security restrictions or headless Chrome will not function.

Examples

Run this code
# NOT RUN {
chrome_read_html("https://www.r-project.org/")
# }

Run the code above in your browser using DataLab