Read In an XML Document
Return a function which reads in an XML document. The structure of the XML document is described with a specification.
- A named list of lists each containing two components. The
constructed reader will map each list entry to the content or metadatum of
the text document as specified by the named list entry. Valid names include
contentto access the document's content, and character strings which are mapped to metadata entries.
Each list entry must consist of two components: the first must be a string describing the type of the second argument, and the second is the specification entry. Valid combinations are:
type = "node", spec = "XPathExpression"
- The XPath
specextracts information from an XML node.
type = "attribute", spec = "XPathExpression"
- The XPath
specextracts information from an attribute of an XML node.
type = "function", spec = function(tree) ...
- The function
specis called, passing over a tree representation (as delivered by
xmlInternalTreeParsefrom package XML) of the read in XML document as first argument.
type = "unevaluated", spec = "String"
- The character vector
specis returned without modification.
- An (empty) document of some subclass of
Formally this function is a function generator, i.e., it returns a function (which reads in a text document) with a well-defined signature, but can access passed over arguments (e.g., the specification) via lexical scoping.
A function with the following formals:
- a named list with the component
contentwhich must hold the document to be read in.
- a string giving the language.
- a character giving a unique identifier for the created text document.
docaugmented by the parsed information as described by
specout of the XML file in
elem$content. The arguments
idare used as fallback:
languageif no corresponding metadata entry is found in
idif no corresponding metadata entry is found in
Reader for basic information on the reader infrastructure
employed by package tm.
Vignette 'Extensions: How to Handle Custom File Formats', and
readGmane <- readXML(spec = list(author = list("node", "/item/creator"), content = list("node", "/item/description"), datetimestamp = list("function", function(node) strptime(sapply(XML::getNodeSet(node, "/item/date"), XML::xmlValue), format = "%Y-%m-%dT%H:%M:%S", tz = "GMT")), description = list("unevaluated", ""), heading = list("node", "/item/title"), id = list("node", "/item/link"), origin = list("unevaluated", "Gmane Mailing List Archive")), doc = PlainTextDocument())