textreadr (version 0.5.1)

read_pdf: Read a Portable Document Format into R

Description

A wrapper for pdf_text to read PDFs into R.

Usage

read_pdf(file, skip = 0, remove.empty = TRUE, trim = TRUE, ...)

Arguments

file
A path to a PDF file.
skip
Integer; the number of lines of the data file to skip before beginning to read data.
remove.empty
logical. If TRUE empty elements in the vector are removed.
trim
logical. If TRUE the leading/training white space is reoved.
Other arguments passed to pdf_text.

Value

Returns a data.frame with the page number (page_id), line number (element_id), and the text.

See Also

readPDF

Examples

Run this code
pdf_dat <- read_pdf(
    system.file("docs/rl10075oralhistoryst002.pdf", package = "textreadr")
)

pdf_dat_b <- read_pdf(
    system.file("docs/rl10075oralhistoryst002.pdf", package = "textreadr"),
    skip = 1
)

## Not run: ------------------------------------
# library(textshape)
# system.file("docs/rl10075oralhistoryst002.pdf", package = "textreadr") %>%
#     read_pdf(1) %>%
#     `[[`('text') %>%
#     head(-1) %>%
#     textshape::combine() %>%
#     gsub("([A-Z])( )([A-Z])", "\\1_\\3", .) %>%
#     strsplit("(-| )(?=[A-Z_]+:)", perl=TRUE) %>%
#     `[[`(1) %>%
#     textshape::split_transcript()
## ---------------------------------------------

Run the code above in your browser using DataCamp Workspace