Learn R Programming

htm2txt (version 2.0.0)

htm2txt: Convert a html document to a simple plain text by removing all html tags

Description

Convert a html document to a simple plain text by removing all html tags

Usage

htm2txt(htm, merge = TRUE, list = "\t\\* ", pagebreak = "----------")

Arguments

htm

one or more R objects containing html tags, to be converted into a simple plain text.

merge

if TRUE, multiple R objects will be treated like lines in a html document, and will be merged into a string.

list

a string (regular expression) replacing a <li> tag which indicates a numbering or bullet for lists.

pagebreak

a string (regular expression) replacing a <hr> tag which indicates a thematic change in the content or a page break. #'

Value

a simple plain text converted from the html document.

Examples

Run this code
# NOT RUN {
text = htm2txt("<html><body>html texts</body></html>")
text = htm2txt(c("<p>Hello!</p>", "<p>World!</p>"), merge = FALSE)
text = htm2txt("<li>point1<li>point2<hr>", list = "\t> ", pagebreak = "\\*     \\*     \\*")
# }

Run the code above in your browser using DataLab