Learn R Programming

stylo (version 0.5.7)

delete.markup: Delete HTML or XML tags

Description

Function for removing markup tags (e.g. HTML, XML) from a string of characters. All XML markup is assumed to be compliant with the TEI guidelines (http://www.tei-c.org/).

Usage

delete.markup(input.text, markup.type = "plain")

Arguments

input.text
any string of characters (e.g. vector) containing markup tags that have to be deleted.
markup.type
any of the following values: plain (nothing will happen), html (all will be deleted as well as HTML header), xml (TEI header, all strings between tags, and all the tags will be de

Details

This function needs to be used carefully: while a document formatted in compliance with the TEI guidelines will be parsed flawlessly, the cleaning up of an HTML page harvested randomly on the web might cause some side effects, e.g. the footers, disclaimers, etc. will not be removed.

See Also

load.corpus, txt.to.words, txt.to.words.ext, txt.to.features

Examples

Run this code
delete.markup("Gallia est omnis <i>divisa</i> in partes tres", 
           markup.type = "html")

  delete.markup("Gallia<note>Gallia: Gaul.</note> est omnis 
           <emph>divisa</emph> in partes tres", markup.type = "xml")

  delete.markup("<speaker>Hamlet</speaker>Words, words, words...", 
           markup.type = "xml.drama")

Run the code above in your browser using DataLab