
Read in the content from a .html file. This is generalized, reading in all body text. For finer control the user should utilize the xml2 and rvest packages.
read_html(file, skip = 0, remove.empty = TRUE, trim = TRUE, ...)read_xml(file, skip = 0, remove.empty = TRUE, trim = TRUE, ...)
The path to the .html file.
The number of lines to skip.
logical. If TRUE
empty elements in the vector are
removed.
logical. If TRUE
the leading/training white space is
removed.
Other arguments passed to xml2::read_html().
Returns a character vector.
The xpath is taken from Tony Breyal's response on StackOverflow: https://stackoverflow.com/questions/3195522/is-there-a-simple-way-in-r-to-extract-only-the-text-elements-of-an-html-page/3195926#3195926
# NOT RUN {
html_dat <- read_html(
system.file("docs/textreadr_creed.html", package = "textreadr")
)
# }
# NOT RUN {
url <- "http://www.talkstats.com/index.php"
file <- download(url)
(txt <- read_html(url))
(txt <- read_html(file))
# }
Run the code above in your browser using DataLab