Return a function which reads in a portable document format (PDF) document extracting both its text and its metadata.
readPDF(engine = c("pdftools", "xpdf", "Rpoppler",
"ghostscript", "Rcampdf", "custom"),
control = list(info = NULL, text = NULL))a character string for the preferred PDF extraction engine (see Details).
a list of control options for the engine with the named
components info and text (see Details).
A function with the following formals:
elema named list with the component uri which must
hold a valid file name.
languagea string giving the language.
idNot used.
The function returns a PlainTextDocument representing the text
and metadata extracted from elem$uri.
Formally this function is a function generator, i.e., it returns a function
(which reads in a text document) with a well-defined signature, but can access
passed over arguments (e.g., the preferred PDF extraction
engine and control options) via lexical scoping.
Available PDF extraction engines are as follows.
"pdftools"(default) Poppler PDF rendering library
as provided by the functions pdf_info and
pdf_text in package pdftools.
"xpdf"command line pdfinfo and
pdftotext executables which must be installed and accessible on
your system. Suitable utilities are provided by the Xpdf
(http://www.foolabs.com/xpdf/) PDF viewer or by the
Poppler (http://poppler.freedesktop.org/) PDF rendering
library.
"Rpoppler"Poppler PDF rendering library as
provided by the functions PDF_info and
PDF_text in package Rpoppler.
"ghostscript"Ghostscript using pdf_info.ps and
ps2ascii.ps.
"Rcampdf"Perl CAM::PDF PDF manipulation library
as provided by the functions pdf_info and pdf_text
in package Rcampdf, available from the repository at
http://datacube.wu.ac.at.
"custom"custom user-provided extraction engine.
Control parameters for engine "xpdf" are as follows.
infoa character vector specifying options passed over to
the pdfinfo executable.
texta character vector specifying options passed over to
the pdftotext executable.
Control parameters for engine "custom" are as follows.
infoa function extracting metadata from a PDF.
The function must accept a file path as first argument and must return a
named list with the components Author (as character string),
CreationDate (of class POSIXlt), Subject (as
character string), Title (as character string), and Creator
(as character string).
texta function extracting content from a PDF. The function must accept a file path as first argument and must return a character vector.
Reader for basic information on the reader infrastructure
employed by package tm.
# NOT RUN {
uri <- sprintf("file://%s", system.file(file.path("doc", "tm.pdf"), package = "tm"))
pdf <- readPDF()(elem = list(uri = uri), language = "en", id = "id1")
cat(content(pdf)[1])
VCorpus(URISource(uri, mode = ""),
readerControl = list(reader = readPDF(engine = "ghostscript")))
# }
Run the code above in your browser using DataLab