tm (version 0.7-7)

readPDF: Read In a PDF Document

Description

Return a function which reads in a portable document format (PDF) document extracting both its text and its metadata.

Usage

readPDF(engine = c("pdftools", "xpdf", "Rpoppler",
                   "ghostscript", "Rcampdf", "custom"),
        control = list(info = NULL, text = NULL))

Arguments

engine

a character string for the preferred PDF extraction engine (see Details).

control

a list of control options for the engine with the named components info and text (see Details).

Value

A function with the following formals:

elem

a named list with the component uri which must hold a valid file name.

language

a string giving the language.

id

Not used.

The function returns a PlainTextDocument representing the text and metadata extracted from elem$uri.

Details

Formally this function is a function generator, i.e., it returns a function (which reads in a text document) with a well-defined signature, but can access passed over arguments (e.g., the preferred PDF extraction engine and control options) via lexical scoping.

Available PDF extraction engines are as follows.

"pdftools"

(default) Poppler PDF rendering library as provided by the functions pdf_info and pdf_text in package pdftools.

"xpdf"

command line pdfinfo and pdftotext executables which must be installed and accessible on your system. Suitable utilities are provided by the Xpdf (http://www.foolabs.com/xpdf/) PDF viewer or by the Poppler (http://poppler.freedesktop.org/) PDF rendering library.

"Rpoppler"

Poppler PDF rendering library as provided by the functions PDF_info and PDF_text in package Rpoppler.

"ghostscript"

Ghostscript using pdf_info.ps and ps2ascii.ps.

"Rcampdf"

Perl CAM::PDF PDF manipulation library as provided by the functions pdf_info and pdf_text in package Rcampdf, available from the repository at http://datacube.wu.ac.at.

"custom"

custom user-provided extraction engine.

Control parameters for engine "xpdf" are as follows.

info

a character vector specifying options passed over to the pdfinfo executable.

text

a character vector specifying options passed over to the pdftotext executable.

Control parameters for engine "custom" are as follows.

info

a function extracting metadata from a PDF. The function must accept a file path as first argument and must return a named list with the components Author (as character string), CreationDate (of class POSIXlt), Subject (as character string), Title (as character string), and Creator (as character string).

text

a function extracting content from a PDF. The function must accept a file path as first argument and must return a character vector.

See Also

Reader for basic information on the reader infrastructure employed by package tm.

Examples

Run this code
# NOT RUN {
uri <- paste0("file://",
              system.file(file.path("doc", "tm.pdf"), package = "tm"))
engine <- if(nzchar(system.file(package = "pdftools"))) {
    "pdftools" 
} else {
    "ghostscript"
}
reader <- readPDF(engine)
pdf <- reader(elem = list(uri = uri), language = "en", id = "id1")
cat(content(pdf)[1])
VCorpus(URISource(uri, mode = ""),
        readerControl = list(reader = readPDF(engine = "ghostscript")))
# }

Run the code above in your browser using DataCamp Workspace