textreadr (version 0.9.0)

read_pdf: Read a Portable Document Format into R

Description

A wrapper for pdf_text to read PDFs into R.

Usage

read_pdf(file, skip = 0, remove.empty = TRUE, trim = TRUE,
  ocr = TRUE, ...)

Arguments

file

A path to a PDF file.

skip

Integer; the number of lines of the data file to skip before beginning to read data.

remove.empty

logical. If TRUE empty elements in the vector are removed.

trim

logical. If TRUE the leading/training white space is removed.

ocr

logical. If TRUE documents with a non-text pull using pdf_text will be re-run using OCR via the ocr function. This will create temporary .png files and will require a much larger compute time.

Other arguments passed to pdf_text.

Value

Returns a data.frame with the page number (page_id), line number (element_id), and the text.

See Also

readPDF

Examples

Run this code
# NOT RUN {
pdf_dat <- read_pdf(
    system.file("docs/rl10075oralhistoryst002.pdf", package = "textreadr")
)

pdf_dat_b <- read_pdf(
    system.file("docs/rl10075oralhistoryst002.pdf", package = "textreadr"),
    skip = 1
)

# }
# NOT RUN {
library(textshape)
system.file("docs/rl10075oralhistoryst002.pdf", package = "textreadr") %>%
    read_pdf(1) %>%
    `[[`('text') %>%
    head(-1) %>%
    textshape::combine() %>%
    gsub("([A-Z])( )([A-Z])", "\\1_\\3", .) %>%
    strsplit("(-| )(?=[A-Z_]+:)", perl=TRUE) %>%
    `[[`(1) %>%
    textshape::split_transcript()
# }
# NOT RUN {
# }
# NOT RUN {
## An image based .pdf file returns nothing.  Using the tesseract package as
## a backend for OCR overcomes this problem.

## Non-ocr
read_pdf(
    system.file("docs/McCune2002Choi2010.pdf", package = "textreadr"),
    ocr = FALSE
)

read_pdf(
    system.file("docs/McCune2002Choi2010.pdf", package = "textreadr"),
    ocr = TRUE
)
# }

Run the code above in your browser using DataLab