If an input file did not exist, could not be downloaded, was a directory, or
Tika could not process it, the result will be as.character(NA)
for
that file.
By default, output = "text"
and this produces plain text with no
metadata. Some formatting is preserved in this case using tabs, newlines and
spaces.
Setting output
to either "xml"
or the shortcut "x"
will
produce a strict form of HTML
known as XHTML
, with metadata in
the head
node and formatted text in the body
.
Content retains more formatting with "xml"
. For example, a Word or
Excel table will become a HTML table
, with table data as text in
td
elements. The "html"
option and its shortcut "h"
seem to produce the same result as "xml"
.
Parse XHTML output with xml2::read_html
.
Setting output
to "jsonRecursive"
or its shortcut "J"
produces a tree structure in `json`. Metadata fields are at the top level.
The XHTML
or plain text will be found in the X-TIKA:content
field. By default the text is XHTML
. This can be changed to plain
text like this: output=c("jsonRecursive","text")
or
output=c("J","t")
. This syntax is meant to mirror Tika's. Parse
json
with jsonlite::fromJSON
.
If output_dir
is specified, then the converted files will also be
saved to this directory. It's best to use an empty directory because Tika
may overwrite existing files. Tika seems to add an extra file extension to
each file to reduce the chance, but it's still best to use an empty
directory. The file locations within the output_dir
maintain the same
general path structure as the input files. Downloaded files have a path
similar to the `tempdir()` that R uses. The original paths are now relative
to output_dir
. Files are appended with .txt
for the default
plain text, but can be .json
, .xml
, or .html
depending
on the output
setting. One way to get a list of the processed files
is to use list.files
with recursive=TRUE
.
If output_dir
is not specified, files are saved to a volatile temp
directory named by tempdir()
and will be deleted when R shuts down.
If this function will be run on very large batches repeatedly, these
temporary files can be cleaned up every time by adding
cleanup=TRUE
.