Learn R Programming

stylo (version 0.7.5)

delete.markup: Delete HTML or XML tags

Description

Function for removing markup tags (e.g. HTML, XML) from a string of characters. All XML markup is assumed to be compliant with the TEI guidelines (https://tei-c.org/).

Usage

delete.markup(input.text, markup.type = "plain")

Arguments

input.text

any string of characters (e.g. vector) containing markup tags that have to be deleted.

markup.type

any of the following values: plain (nothing will happen), html (all <tags> will be deleted as well as HTML header), xml (TEI header, all strings between <note> </note> tags, and all the tags will be deleted), xml.drama (as above; but, additionally, speaker's names will be deleted, or strings within each the <speaker> </speaker> tags), xml.notitles (as above; but, additionally, all the chapter/section (sub)titles will be deleted, or strings within each the <head> </head> tags).

Author

Maciej Eder, Mike Kestemont

Details

This function needs to be used carefully: while a document formatted in compliance with the TEI guidelines will be parsed flawlessly, the cleaning up of an HTML page harvested randomly on the web might cause some side effects, e.g. the footers, disclaimers, etc. will not be removed.

See Also

load.corpus, txt.to.words, txt.to.words.ext, txt.to.features

Examples

Run this code
  delete.markup("Gallia est omnis divisa in partes tres", 
           markup.type = "html")

  delete.markup("GalliaGallia: Gaul. est omnis 
           divisa in partes tres", markup.type = "xml")

  delete.markup("HamletWords, words, words...", 
           markup.type = "xml.drama")

Run the code above in your browser using DataLab