getPDF: Extract text from PDF files and return a word-occurrence data.frame.
Description
getPDF returns a word-occurrence data.frame from PDF files.
It needs XPDF in order to run (http://www.foolabs.com/xpdf/download.html),
and uses parallel to perform parallel computation.
Usage
getPDF(
myPDFs,
minword = 1,
maxword = 20,
minFreqWord = 1,
pathToPdftotext = ""
)
Value
A list of list with word-occurrence data.frame and file name.
Arguments
- myPDFs
A character vector containing PDF file names.
- minword
An integer specifying the minimum number of letters per word
into the returned data.frame.
- maxword
An integer to specifying the maximum number of letters per
word into the returned data.frame.
- minFreqWord
An integer specifying the minimum word frequency into the
returned data.frame.
- pathToPdftotext
A character containing an alternative path to XPDF
pdftotext function, see Details section.
Details
getPDF uses XPDF pdftotext function to extract the
content of PDF files into a TXT file. If pdftotext is not in the
PATH, an alternative is to provide the full path of the program into
the pathToPdftotext parameter.
Examples
Run this codeif (FALSE) {
getPDF(myPDFs = "mypdf.pdf")
}
Run the code above in your browser using DataLab