Learn R Programming

textpress (version 1.1.0)

read_urls: Read content from URLs

Description

Fetches each URL and returns a structured data frame (one row per node: headings, paragraphs, lists). Like read_csv or read_html: bring an external resource into R. Follows fetch_urls() or fetch_wiki_urls() in the pipeline: fetch = get locations, read = get text.

Usage

read_urls(x, cores = 1, detect_boilerplate = TRUE, remove_boilerplate = TRUE)

Value

A data frame with url, h1_title, date, type, node_id, parent_heading, text, and optionally is_boilerplate.

Arguments

x

A character vector of URLs.

cores

Number of cores for parallel requests (default 1).

detect_boilerplate

Logical. Detect boilerplate (e.g. sign-up, related links).

remove_boilerplate

Logical. If detect_boilerplate is TRUE, remove boilerplate rows; if FALSE, keep them and add is_boilerplate.

Details

Wikipedia is handled with high-fidelity selectors: div.mw-parser-output and h2/h3/h4 hierarchy. Use parent_heading to see which section each node belongs to. The “External links” section and rows with empty text are omitted.

Examples

Run this code
if (FALSE) {
urls <- fetch_urls("R programming", n_pages = 1)$url
nodes <- read_urls(urls[1:3], cores = 1)
}

Run the code above in your browser using DataLab