This is a thin wrapper around textreadr::read_document()
that also
writed the result to output
, doing its best to correctly write UTF-8
(based on the approach recommended in this blog post).
doc_to_txt(
input,
output = NULL,
encoding = rock::opts$get("encoding"),
newExt = NULL,
preventOverwriting = rock::opts$get("preventOverwriting"),
silent = rock::opts$get("silent"),
skip = 0,
remove.empty = TRUE,
trim = TRUE,
combine = FALSE,
format = FALSE,
ocr = TRUE,
...
)
The path to the input file.
The path and filename to write to. If this is a path to
an existing directory (without a filename specified), the input
filename
will be used, and the extension will be replaced with extension
.
The encoding to use when writing the text file.
The extension to append: only used if output = NULL
and
newExt
is not NULL
, in which case the output will be written to a file
with the same name as input
but with newExt
as extension.
Whether to prevent overwriting existing files.
Whether to the silent or chatty.
The number of lines to skip (see textreadr::read_document()
).
If TRUE
empty elements in the vector are
removed (see textreadr::read_document()
).
If TRUE
the leading/training white space is
removed (see textreadr::read_document()
).
If TRUE
the vector is concatenated into a single string
textshape::combine()
. (see textreadr::read_document()
).
For .doc files only. Logical. If TRUE
the output will keep
doc formatting (e.g., bold, italics, underlined). This corresponds to
the -f
flag in antiword (see textreadr::read_document()
).
If TRUE
.pdf documents with a non-text pull using
pdftools::pdf_text()
will be re-run using OCR via the tesseract::ocr()
function. This will create temporary .png files and will require a much
larger compute time (see textreadr::read_document()
).
Other arguments passed to textreadr::read_pdf()
,
textreadr::read_html()
, textreadr::read_docx()
, textreadr::read_doc()
,
or base::readLines()
(by textreadr::read_document()
).
The converted source, as a character vector.
# NOT RUN {
print(
rock::doc_to_txt(
input = system.file(
"extdata/doc-to-test.docx", package="rock"
)
)
);
# }
Run the code above in your browser using DataLab