Learn R Programming

rock (version 0.5.4)

doc_to_txt: Convert a document (.docx, .pdf, etc) to a plain text file

Description

This is a thin wrapper around textreadr::read_document() that also writed the result to output, doing its best to correctly write UTF-8 (based on the approach recommended in this blog post).

Usage

doc_to_txt(
  input,
  output = NULL,
  encoding = rock::opts$get("encoding"),
  newExt = NULL,
  preventOverwriting = rock::opts$get("preventOverwriting"),
  silent = rock::opts$get("silent"),
  skip = 0,
  remove.empty = TRUE,
  trim = TRUE,
  combine = FALSE,
  format = FALSE,
  ocr = TRUE,
  ...
)

Arguments

input

The path to the input file.

output

The path and filename to write to. If this is a path to an existing directory (without a filename specified), the input filename will be used, and the extension will be replaced with extension.

encoding

The encoding to use when writing the text file.

newExt

The extension to append: only used if output = NULL and newExt is not NULL, in which case the output will be written to a file with the same name as input but with newExt as extension.

preventOverwriting

Whether to prevent overwriting existing files.

silent

Whether to the silent or chatty.

skip

The number of lines to skip (see textreadr::read_document()).

remove.empty

If TRUE empty elements in the vector are removed (see textreadr::read_document()).

trim

If TRUE the leading/training white space is removed (see textreadr::read_document()).

combine

If TRUE the vector is concatenated into a single string textshape::combine(). (see textreadr::read_document()).

format

For .doc files only. Logical. If TRUE the output will keep doc formatting (e.g., bold, italics, underlined). This corresponds to the -f flag in antiword (see textreadr::read_document()).

ocr

If TRUE .pdf documents with a non-text pull using pdftools::pdf_text() will be re-run using OCR via the tesseract::ocr() function. This will create temporary .png files and will require a much larger compute time (see textreadr::read_document()).

Value

The converted source, as a character vector.

Examples

Run this code
# NOT RUN {
print(
  rock::doc_to_txt(
    input = system.file(
      "extdata/doc-to-test.docx", package="rock"
    )
  )
);
# }

Run the code above in your browser using DataLab