readMail: Read In an E-Mail Document

Description

Return a function which reads in an electronic mail document.

Usage

readMail(DateFormat = character())

Value

A function with the following formals:

elem: a named list with the component content which must hold the document to be read in.
language: a string giving the language.
id: a character giving a unique identifier for the created text document.

The function returns a MailDocument representing the text and metadata extracted from elem$content. The argument

id is used as fallback if no corresponding metadata entry is found in elem$content.

Arguments

DateFormat: A character vector giving date-time formats for the “Date” header field in the mail document. By default, the “basic” formats of RFC 5322 are tried.

Author

Ingo Feinerer and Kurt Hornik

Details

Formally this function is a function generator, i.e., it returns a function (which reads in a mail document) with a well-defined signature, but can access passed over arguments (e.g., the “Date” header format) via lexical scoping.

In version 0.3.0 of the tm.plugin.mail package, the reader code was switched to use the Python email library via CRAN package reticulate. Compared to previous versions, this allows to

handle textual message bodies in character sets other than US-ASCII and the use of base64 or quoted-printable transfer encodings (RFC 2045)
handle non-US-ASCII text data in message header fields (RFC 2047)
correctly handle the metadata in structured header fields (RFC 5322)

For messages using the Multipurpose Internet Mail Extensions (MIME) extensions, the texts extracted from the messages are the (suitably decoded) bodies when using the ‘text/plain’ or ‘text/html’ content types, or the body parts using these types when using ‘multipart/mixed’ or ‘multipart/alternative’ (see RFC 2046 for more information). Non-MIME messages are treated like ‘text/plain’. The extracted texts are represented as character vectors with length the number of extracted body parts and names giving the MIME subtype ("plain" or "html").

This allows text mining applications to flexibly handle HTML content “as appropriate” by filtering on the names of the content of the MailDocument objects.

In case the Python processing fails or its results cannot be transferred to R (in particular, when text body parts contain embedded NULs), the reader falls back to simple header field processing appropriate for unstructered headers, and/or extracting no text. Information about problems is provided in the problems element of the metadata.

Examples

Run this code

require("tm")
newsgroup <- system.file("mails", package = "tm.plugin.mail")
news <- VCorpus(DirSource(newsgroup),
                readerControl = list(reader = readMail))
inspect(news)
## Use the high-level content and metadata accessors from package 'NLP':
require("NLP")
content(news[[2]])
meta(news[[2]])
## Processed header fields of the message.
meta(news[[2]])$header

Run the code above in your browser using DataLab