getPDF: Extract text from PDF files and return a word-occurrence data.frame.
Description
getPDF returns a word-occurrence data.frame from PDF files.
It needs XPDF in order to run (http://www.foolabs.com/xpdf/download.html),
and uses parallel to perform parallel computation.
Usage
getPDF(myPDFs, minword = 1, maxword = 20, minFreqWord = 1,
pathToPdftotext = "")
Arguments
myPDFs
A character vector containing PDF file names.
minword
An integer specifying the minimum number of letters per word
into the returned data.frame.
maxword
An integer to specifying the maximum number of letters per
word into the returned data.frame.
minFreqWord
An integer specifying the minimum word frequency into the
returned data.frame.
pathToPdftotext
A character containing an alternative path to XPDF
pdftotext function, see Details section.
Value
A list of list with word-occurrence data.frame and file name.
Details
getPDF uses XPDF pdftotext function to extract the
content of PDF files into a TXT file. If pdftotext is not in the
PATH, an alternative is to provide the full path of the program into
the pathToPdftotext parameter.
Examples
Run this code# NOT RUN {
getPDF(myPDFs = "mypdf.pdf")
# }
Run the code above in your browser using DataLab