getPDF: Extract text from PDF files and return a word-occurrence data.frame.
Description
getPDF
returns a word-occurrence data.frame from PDF files.
It needs XPDF
in order to run (http://www.foolabs.com/xpdf/download.html),
and uses parallel
to perform parallel computation.
Usage
getPDF(myPDFs, minword = 1, maxword = 20, minFreqWord = 1,
pathToPdftotext = "")
Arguments
myPDFs
A character vector containing PDF file names.
minword
An integer specifying the minimum number of letters per word
into the returned data.frame.
maxword
An integer to specifying the maximum number of letters per
word into the returned data.frame.
minFreqWord
An integer specifying the minimum word frequency into the
returned data.frame.
pathToPdftotext
A character containing an alternative path to XPDF
pdftotext
function, see Details section.
Value
A list of list with word-occurrence data.frame and file name.
Details
getPDF
uses XPDF pdftotext
function to extract the
content of PDF files into a TXT file. If pdftotext
is not in the
PATH
, an alternative is to provide the full path of the program into
the pathToPdftotext
parameter.
Examples
Run this code# NOT RUN {
getPDF(myPDFs = "mypdf.pdf")
# }
Run the code above in your browser using DataLab