textreadr (version 1.2.0)

read_document: Generic Function to Read in a Document

Description

Generic function to read in a .pdf, .txt, .html, .rtf, .docx, or .doc file.

Usage

read_document(
  file,
  skip = 0,
  remove.empty = TRUE,
  trim = TRUE,
  combine = FALSE,
  format = FALSE,
  ocr = TRUE,
  ...
)

Arguments

file

The path to the a .pdf, .txt, .html, .rtf, .docx, or .doc file.

skip

The number of lines to skip.

remove.empty

logical. If TRUE empty elements in the vector are removed.

trim

logical. If TRUE the leading/training white space is removed.

combine

logical. If TRUE the vector is concatenated into a single string via textshape::combine().

format

For .doc files only. Logical. If TRUE the output will keep doc formatting (e.g., bold, italics, underlined). This corresponds to the -f flag in antiword.

ocr

logical. If TRUE .pdf documents with a non-text pull using pdftools::pdf_text() will be re-run using OCR via the tesseract::ocr() function. This will create temporary .png files and will require a much larger compute time.

...

Value

Returns a base::list() of string base::vector()s.

Examples

Run this code
# NOT RUN {
## .pdf
pdf_doc <- system.file("docs/rl10075oralhistoryst002.pdf",
    package = "textreadr")
read_document(pdf_doc)

## .html
html_doc <- system.file("docs/textreadr_creed.html", package = "textreadr")
read_document(html_doc)

## .docx
docx_doc <- system.file("docs/Yasmine_Interview_Transcript.docx",
    package = "textreadr")
read_document(docx_doc)

## .doc
doc_doc <- system.file("docs/Yasmine_Interview_Transcript.doc",
    package = "textreadr")
read_document(doc_doc)

## .txt
txt_doc <- system.file('docs/textreadr_creed.txt', package = "textreadr")
read_document(txt_doc)

## .pptx 
pptx_doc <- system.file('docs/Hello_World.pptx', package = "textreadr")
read_document(pptx_doc)

## .rtf
# }
# NOT RUN {
rtf_doc <- download(
    'https://raw.githubusercontent.com/trinker/textreadr/master/inst/docs/trans7.rtf'
)
read_document(rtf_doc)
# }
# NOT RUN {
# }
# NOT RUN {
## URLs
read_document('http://www.talkstats.com/index.php')
# }

Run the code above in your browser using DataLab