textreadr (version 0.9.0)

read_document: Generic Function to Read in a Document

Description

Generic function to read in a .pdf, .txt, .html, .rtf, .docx, or .doc file.

Usage

read_document(file, skip = 0, remove.empty = TRUE, trim = TRUE,
  combine = FALSE, format = FALSE, ocr = TRUE, ...)

Arguments

file

The path to the a .pdf, .txt, .html, .rtf, .docx, or .doc file.

skip

The number of lines to skip.

remove.empty

logical. If TRUE empty elements in the vector are removed.

trim

logical. If TRUE the leading/training white space is removed.

combine

logical. If TRUE the vector is concatenated into a single string via combine.

format

For .doc files only. Logical. If TRUE the output will keep doc formatting (e.g., bold, italics, underlined). This corresponds to the -f flag in antiword.

ocr

logical. If TRUE .pdf documents with a non-text pull using pdf_text will be re-run using OCR via the ocr function. This will create temporary .png files and will require a much larger compute time.

Other arguments passed to read_pdf, read_html, read_docx, read_doc, or readLines.

Value

Returns a list of string vectors.

Examples

Run this code
# NOT RUN {
## .pdf
pdf_doc <- system.file("docs/rl10075oralhistoryst002.pdf",
    package = "textreadr")
read_document(pdf_doc)

## .html
html_doc <- system.file("docs/textreadr_creed.html", package = "textreadr")
read_document(html_doc)

## .docx
docx_doc <- system.file("docs/Yasmine_Interview_Transcript.docx",
    package = "textreadr")
read_document(docx_doc)

## .doc
doc_doc <- system.file("docs/Yasmine_Interview_Transcript.doc",
    package = "textreadr")
read_document(doc_doc)

## .txt
txt_doc <- system.file('docs/textreadr_creed.txt', package = "textreadr")
read_document(txt_doc)

## .rtf
# }
# NOT RUN {
rtf_doc <- download(
    'https://raw.githubusercontent.com/trinker/textreadr/master/inst/docs/trans7.rtf'
)
read_document(rtf_doc)
# }
# NOT RUN {
# }
# NOT RUN {
## URLs
read_document('http://www.talkstats.com/index.php')
# }

Run the code above in your browser using DataCamp Workspace