Formally this function is a function generator, i.e., it returns a
function (which reads in a mail document) with a well-defined
signature, but can access passed over arguments (e.g., the
“Date” header format) via lexical scoping.
In version 0.3.0 of the tm.plugin.mail package, the reader code
was switched to use the Python email library via CRAN package
reticulate. Compared to previous versions, this allows to
handle textual message bodies in character sets other than
US-ASCII and the use of base64 or quoted-printable transfer
encodings
(RFC 2045)
handle non-US-ASCII text data in message header fields
(RFC 2047)
correctly handle the metadata in structured header fields
(RFC 5322)
For messages using the Multipurpose Internet Mail Extensions (MIME)
extensions, the texts extracted from the messages are the (suitably
decoded) bodies when using the ‘text/plain’ or
‘text/html’ content types, or the body parts using these
types when using ‘multipart/mixed’ or
‘multipart/alternative’ (see
RFC 2046 for more
information).
Non-MIME messages are treated like ‘text/plain’.
The extracted texts are represented as character vectors with length
the number of extracted body parts and names giving the MIME
subtype ("plain"
or "html"
).
This allows text mining applications to flexibly handle HTML content
“as appropriate” by filtering on the names of the content of
the MailDocument
objects.
In case the Python processing fails or its results cannot be
transferred to R (in particular, when text body parts contain embedded
NULs), the reader falls back to simple header field processing
appropriate for unstructered headers, and/or extracting no text.
Information about problems is provided in the problems
element
of the metadata.