tm (version 0.6-1)

readXML: Read In an XML Document

Description

Return a function which reads in an XML document. The structure of the XML document is described with a specification.

Usage

readXML(spec, doc)

Arguments

spec
A named list of lists each containing two components. The constructed reader will map each list entry to the content or metadatum of the text document as specified by the named list entry. Valid names include content to access the
doc
An (empty) document of some subclass of TextDocument.

Value

  • A function with the following formals: [object Object],[object Object],[object Object] The function returns doc augmented by the parsed information as described by spec out of the XML file in elem$content. The arguments language and id are used as fallback: language if no corresponding metadata entry is found in elem$content, and id if no corresponding metadata entry is found in elem$content and if elem$uri is null.

Details

Formally this function is a function generator, i.e., it returns a function (which reads in a text document) with a well-defined signature, but can access passed over arguments (e.g., the specification) via lexical scoping.

See Also

Reader for basic information on the reader infrastructure employed by package tm.

Vignette 'Extensions: How to Handle Custom File Formats', and XMLSource.

Examples

readGmane <-
readXML(spec = list(author = list("node", "/item/creator"),
                    content = list("node", "/item/description"),
                    datetimestamp = list("function", function(node)
                    strptime(sapply(XML::getNodeSet(node, "/item/date"), XML::xmlValue),
                             format = "%Y-%m-%dT%H:%M:%S",
                             tz = "GMT")),
                    description = list("unevaluated", ""),
                    heading = list("node", "/item/title"),
                    id = list("node", "/item/link"),
                    origin = list("unevaluated", "Gmane Mailing List Archive")),
                    doc = PlainTextDocument())