Fetches each URL and returns a structured data frame (one row per node:
headings, paragraphs, lists). Like read_csv or read_html: bring
an external resource into R. Follows fetch_urls() or fetch_wiki_urls()
in the pipeline: fetch = get locations, read = get text.
A data frame with url, h1_title, date, type, node_id, parent_heading, text, and optionally is_boilerplate.
Arguments
x
A character vector of URLs.
cores
Number of cores for parallel requests (default 1).
detect_boilerplate
Logical. Detect boilerplate (e.g. sign-up, related links).
remove_boilerplate
Logical. If detect_boilerplate is TRUE, remove boilerplate rows; if FALSE, keep them and add is_boilerplate.
Details
Wikipedia is handled with high-fidelity selectors: div.mw-parser-output
and h2/h3/h4 hierarchy. Use parent_heading to see
which section each node belongs to. The “External links” section and
rows with empty text are omitted.