Learn R Programming

fulltext (version 0.1.6)

pdfx: PDF-to-XML conversion of scientific articles using pdfx

Description

Uses a web service provided by Utopia at http://pdfx.cs.man.ac.uk/. Beware, this can be quite slow. pdfx posts the pdf from your machine to the web service, pdfx_html takes the output of pdfx and gives back a html version of extracted text, and pdfx_targz gives a tar.gz version of the extracted text. This will not work with PDFs that are scans of text, or mostly of images.

Usage

pdfx(file, what = "parsed", ...)
pdfx_html(input, ...)
pdfx_targz(input, write_path, ...)

Arguments

file
(character) Path to a file, or files on your machine. Required.
what
(character) One of parsed or text.
...
Further args passed to GET. These aren't named, so just do e.g. , verbose(), or timeout(3)
input
Output from pdfx function
write_path
Path to write tar ball to.

Value

pdfx gives XML parsed to xml_document, pdfx_html gives html, pdfx_targz writes a tar.gz file to disk.

Examples

Run this code
## Not run: 
# path <- system.file("examples", "example2.pdf", package = "fulltext")
# pdfx(file = path)
# 
# out <- pdfx(file = path)
# pdfx_html(out)
# 
# out <- pdfx(file = path)
# tarfile <- tempfile(fileext = "tar.gz")
# pdfx_targz(input = out, write_path = tarfile)
# ## End(Not run)

Run the code above in your browser using DataLab