Learn R Programming

MazamaCoreUtils (version 0.6.2)

html_getLinks: Extract links from an HTML page

Description

Parse an HTML page and return all <a href="...">...</a> links as a data frame.

Usage

html_getLinks(url = NULL, relative = TRUE)

html_getLinkNames(url = NULL)

html_getLinkUrls(url = NULL, relative = TRUE)

Value

A tibble with linkName and linkUrl columns.

html_getLinkNames() returns a character vector of link names.

html_getLinkUrls() returns a character vector of link URLs.

Arguments

url

URL or local file path of an HTML page.

relative

Logical specifying whether to return relative URLs. If FALSE, relative URLs are converted to absolute URLs using url as the base.

Details

The returned data frame contains the human-readable link text in linkName and the href value in linkUrl. This is useful for extracting links from index pages, including web-accessible directories that list downloadable files.

Wrapper functions html_getLinkNames() and html_getLinkUrls() return the corresponding columns as character vectors.

Examples

Run this code
if (FALSE) {

# If you want to download lots of USCensus shapefiles
url <- "https://www2.census.gov/geo/tiger/GENZ2019/shp/"

browseURL(url)

dataLinks <- html_getLinks(url)

dataLinks <-
  dataLinks %>%
  dplyr::filter(stringr::str_detect(linkName, "us_county"))

head(dataLinks, 10)

html_getLinkNames(url)
html_getLinkUrls(url, relative = FALSE)
}

Run the code above in your browser using DataLab