Learn R Programming

MazamaCoreUtils (version 0.4.4)

html_getLinks: Find all links in an html page

Description

Parses an html page to extract all <a href="...">...</a> links and return them in a dataframe where linkName is the human readable name and linkUrl is the href portion. By default this function will return relative URLs.

This is especially useful for extracting data from an index page that shows the contents of a web accessible directory.

Wrapper functions html_getLinkNames() and html_getLinkUrls() return the appropriate columns as vectors.

Usage

html_getLinks(url = NULL, relative = TRUE)

html_getLinkNames(url = NULL)

html_getLinkUrls(url = NULL, relative = TRUE)

Arguments

url

URL or file path of an html page.

relative

Logical instruction to return relative URLs.

Value

A dataframe with linkName and/or linkUrl columns.

Examples

Run this code
# NOT RUN {
library(MazamaCoreUtils)

# US Census 2019 shapefiles
dataLinks <- html_getLinks("https://www2.census.gov/geo/tiger/GENZ2019/shp/")

dataLinks <- dataLinks %>%
  dplyr::filter(stringr::str_detect(linkName, "us_county"))
head(dataLinks, 10)

# }

Run the code above in your browser using DataLab