tm (version 0.6-1)

readPDF: Read In a PDF Document

Description

Return a function which reads in a portable document format (PDF) document extracting both its text and its metadata.

Usage

readPDF(engine = c("xpdf", "Rpoppler", "ghostscript", "Rcampdf", "custom"),
        control = list(info = NULL, text = NULL))

Arguments

engine
a character string for the preferred PDF extraction engine (see Details).
control
a list of control options for the engine with the named components info and text (see Details).

Value

  • A function with the following formals: [object Object],[object Object],[object Object] The function returns a PlainTextDocument representing the text and metadata extracted from elem$uri.

Details

Formally this function is a function generator, i.e., it returns a function (which reads in a text document) with a well-defined signature, but can access passed over arguments (e.g., the preferred PDF extraction engine and control options) via lexical scoping.

Available PDF extraction engines are as follows. [object Object],[object Object],[object Object],[object Object],[object Object]

Control parameters for engine "xpdf" are as follows. [object Object],[object Object]

Control parameters for engine "custom" are as follows. [object Object],[object Object]

See Also

Reader for basic information on the reader infrastructure employed by package tm.

Examples

Run this code
uri <- sprintf("file://%s", system.file(file.path("doc", "tm.pdf"), package = "tm"))
if(all(file.exists(Sys.which(c("pdfinfo", "pdftotext"))))) {
    pdf <- readPDF(control = list(text = "-layout"))(elem = list(uri = uri),
                                                     language = "en",
                                                     id = "id1")
    content(pdf)[1:13]
}
VCorpus(URISource(uri, mode = ""),
        readerControl = list(reader = readPDF(engine = "ghostscript")))

Run the code above in your browser using DataLab