readPDF: Read In a PDF Document

Description

Return a function which reads in a portable document format (PDF) document extracting both its text and its metadata.

Usage

readPDF(engine = c("pdftools", "xpdf", "Rpoppler",
                   "ghostscript", "Rcampdf", "custom"),
        control = list(info = NULL, text = NULL))

Arguments

engine

a character string for the preferred PDF extraction engine (see Details).

control

a list of control options for the engine with the named components info and text (see Details).

Value

A function with the following formals:

elem: a named list with the component uri which must hold a valid file name.
language: a string giving the language.
id: Not used.

The function returns a PlainTextDocument representing the text and metadata extracted from elem$uri.

Details

Formally this function is a function generator, i.e., it returns a function (which reads in a text document) with a well-defined signature, but can access passed over arguments (e.g., the preferred PDF extraction engine and control options) via lexical scoping.

Available PDF extraction engines are as follows.

"pdftools": (default) Poppler PDF rendering library as provided by the functions pdf_info and pdf_text in package pdftools.
"xpdf": command line pdfinfo and pdftotext executables which must be installed and accessible on your system. Suitable utilities are provided by the Xpdf (http://www.foolabs.com/xpdf/) PDF viewer or by the Poppler (http://poppler.freedesktop.org/) PDF rendering library.
"Rpoppler": Poppler PDF rendering library as provided by the functions PDF_info and PDF_text in package Rpoppler.
"ghostscript": Ghostscript using pdf_info.ps and ps2ascii.ps.
"Rcampdf": Perl CAM::PDF PDF manipulation library as provided by the functions pdf_info and pdf_text in package Rcampdf, available from the repository at http://datacube.wu.ac.at.
"custom": custom user-provided extraction engine.

Control parameters for engine "xpdf" are as follows.

info: a character vector specifying options passed over to the pdfinfo executable.
text: a character vector specifying options passed over to the pdftotext executable.

Control parameters for engine "custom" are as follows.

info: a function extracting metadata from a PDF. The function must accept a file path as first argument and must return a named list with the components Author (as character string), CreationDate (of class POSIXlt), Subject (as character string), Title (as character string), and Creator (as character string).
text: a function extracting content from a PDF. The function must accept a file path as first argument and must return a character vector.

Examples

Run this code

# NOT RUN {
uri <- sprintf("file://%s", system.file(file.path("doc", "tm.pdf"), package = "tm"))
pdf <- readPDF()(elem = list(uri = uri), language = "en", id = "id1")
cat(content(pdf)[1])
VCorpus(URISource(uri, mode = ""),
        readerControl = list(reader = readPDF(engine = "ghostscript")))
# }