textreadr (version 0.3.0)

read_pdf: Read a Portable Document Format into R

Description

A wrapper for pdf_text to read PDFs into R.

Usage

read_pdf(file, skip = 0)

Arguments

file
A path to a PDF file.
skip
Integer; the number of lines of the data file to skip before beginning to read data.

Value

Returns a data.frame with the page number (page_id), line number (element_id), and the text.

See Also

readPDF

Examples

Run this code
pdf_dat <- read_pdf(
    system.file("docs/rl10075oralhistoryst002.pdf", package = "textreadr")
)

pdf_dat_b <- read_pdf(
    system.file("docs/rl10075oralhistoryst002.pdf", package = "textreadr"),
    skip = 1
)

## Not run: 
# library(textshape)
# system.file("docs/rl10075oralhistoryst002.pdf", package = "textreadr") %>%
#     read_pdf(1) %>%
#     `[[`('text') %>%
#     head(-1) %>%
#     textshape::combine() %>%
#     gsub("([A-Z])( )([A-Z])", "\\1_\\3", .) %>%
#     strsplit("(-| )(?=[A-Z_]+:)", perl=TRUE) %>%
#     `[[`(1) %>%
#     textshape::split_transcript()
# ## End(Not run)

Run the code above in your browser using DataLab