textreadr (version 0.9.0)

read_html: Read in .html Content

Description

Read in the content from a .html file. This is generalized, reading in all body text. For finer control the user should utilize the xml2 and rvest packages.

Usage

read_html(file, skip = 0, remove.empty = TRUE, trim = TRUE, ...)

Arguments

file

The path to the .html file.

skip

The number of lines to skip.

remove.empty

logical. If TRUE empty elements in the vector are removed.

trim

logical. If TRUE the leading/training white space is removed.

Other arguments passed to read_html.

Value

Returns a character vector.

References

The xpath is taken from Tony Breyal's response on StackOverflow: http://stackoverflow.com/questions/3195522/is-there-a-simple-way-in-r-to-extract-only-the-text-elements-of-an-html-page/3195926#3195926

Examples

Run this code
# NOT RUN {
html_dat <- read_html(
    system.file("docs/textreadr_creed.html", package = "textreadr")
)

# }
# NOT RUN {
url <- "http://www.talkstats.com/index.php"
file <- download(url)
(txt <- read_html(url))
(txt <- read_html(file))
# }

Run the code above in your browser using DataCamp Workspace