tm (version 0.5-10)

readPDF: Read In a PDF Document

Description

Return a function which reads in a portable document format (PDF) document extracting both its text and its meta data.

Usage

readPDF(engine = c("xpdf", "Rpoppler", "ghostscript", "Rcampdf", "custom"),
        control = list(info = NULL, text = NULL))

Arguments

engine
a character string for the preferred PDF extraction engine (see Details).
control
a list of control options for the engine with the named components info and text (see Details).

Value

  • A function with the signature elem, language, id: [object Object],[object Object],[object Object] The function returns a PlainTextDocument representing the text and meta data extracted from elem$uri.

Details

Formally this function is a function generator, i.e., it returns a function (which reads in a text document) with a well-defined signature, but can access passed over arguments (e.g., the preferred PDF extraction engine and control options) via lexical scoping.

Available PDF extraction engines are as follows. [object Object],[object Object],[object Object],[object Object],[object Object]

Control parameters for engine "xpdf" are as follows. [object Object],[object Object]

Control parameters for engine "custom" are as follows. [object Object],[object Object]

See Also

getReaders to list available reader functions.

Examples

Run this code
uri <- system.file(file.path("doc", "tm.pdf"), package = "tm")
if(all(file.exists(Sys.which(c("pdfinfo", "pdftotext"))))) {
    pdf <- readPDF(control = list(text = "-layout"))(elem = list(uri = uri),
                                                     language = "en",
                                                     id = "id1")
    pdf[1:13]
}
Corpus(URISource(uri),
       readerControl = list(reader = readPDF(engine = "ghostscript")))

Run the code above in your browser using DataLab