textreadr (version 0.5.1)

read_document: Generic Function to Read in a Document

Description

Generic function to read in a .pdf, .txt, .html, .docx, or .doc file.

Usage

read_document(file, combine = FALSE, format = FALSE, ...)

Arguments

file
The path to the a .pdf, .txt, .html, .docx, or .doc file.
combine
logical. If TRUE the vector is concatenated into a single string via combine.
format
For .doc files only. Logical. If TRUE the output will keep doc formatting (e.g., bold, italics, underlined). This corresponds to the -f flag in antiword.
Other arguments passed to read_pdf, read_html, read_docx, read_doc, or readLines.

Value

Returns a list of string vectors.

Examples

Run this code
## .pdf
pdf_doc <- system.file("docs/rl10075oralhistoryst002.pdf",
    package = "textreadr")
read_document(pdf_doc)

## .html
html_doc <- system.file("docs/textreadr_creed.html", package = "textreadr")
read_document(html_doc)

## .docx
docx_doc <- system.file("docs/Yasmine_Interview_Transcript.docx",
    package = "textreadr")
read_document(docx_doc)

## .doc
doc_doc <- system.file("docs/Yasmine_Interview_Transcript.doc",
    package = "textreadr")
read_document(doc_doc)

## .txt
txt_doc <- system.file('docs/textreadr_creed.txt', package = "textreadr")
read_document(txt_doc)

## Not run: ------------------------------------
# ## URLs
# read_document('http://www.talkstats.com/index.php')
## ---------------------------------------------

Run the code above in your browser using DataCamp Workspace