textreadr (version 0.5.1)

read_html: Read in .html Content

Description

Read in the content from a .html file. This is generalized, reading in all body text. For finer control the user should utilize the xml2 and rvest packages.

Usage

read_html(file, skip = 0, remove.empty = TRUE, trim = TRUE, ...)

Arguments

file
The path to the .html file.
skip
The number of lines to skip.
remove.empty
logical. If TRUE empty elements in the vector are removed.
trim
logical. If TRUE the leading/training white space is reoved.
Other arguments passed to read_html.

Value

Returns a character vector.

References

The xpath is taken from Tony Breyal's response on StackOverflow: http://stackoverflow.com/questions/3195522/is-there-a-simple-way-in-r-to-extract-only-the-text-elements-of-an-html-page/3195926#3195926

Examples

Run this code
html_dat <- read_html(
    system.file("docs/textreadr_creed.html", package = "textreadr")
)

## Not run: ------------------------------------
# url <- "http://www.talkstats.com/index.php"
# file <- download(url)
# (txt <- read_html(url))
# (txt <- read_html(file))
## ---------------------------------------------

Run the code above in your browser using DataLab