Learn R Programming

fulltext (version 0.1.6)

ft_extract: Extract text from a single pdf document

Description

ft_extract attemps to make it easy to extract text from PDFs, using a variety of extraction tools. Inputs can be either paths to PDF files, or the output of ft_get (class ft_data).

Usage

ft_extract(x, which = "xpdf", ...)
"print"(x, ...)
"print"(x, ...)

Arguments

x
Path to a pdf file, or an object of class ft_data, the output from ft_get
which
One of gs or xpdf (default).
...
further args passed on

Value

An object of class gs_char, xpdf_char

Details

For xpdf, you can pass on addition options via flags. See Examples. Right now, you can't pass options to Ghostscript if you're using the gs option.

xpdf installation: See http://www.foolabs.com/xpdf/download.html for instructions on how to download and install xpdf. For OSX, you an also get xpdf via homebrew.

ghostscript installation: See http://www.ghostscript.com/doc/9.16/Install.htm for instructions on how to download and install ghostscript

Examples

Run this code
## Not run: 
# path <- system.file("examples", "example1.pdf", package = "fulltext")
# 
# (res_xpdf <- ft_extract(path)) # xpdf is the default
# (res_xpdf <- ft_extract(path, "xpdf"))
# (res_gs <- ft_extract(path, "gs"))
# 
# # pass on options to xpdf
# ## preserve layout from pdf
# ft_extract(path, "xpdf", "-layout")
# ## preserve table structure as much as possible
# ft_extract(path, "xpdf", "-table")
# ## last page to convert is page 2
# ft_extract(path, "xpdf", "-l 2")
# ## first page to convert is page 3
# ft_extract(path, "xpdf", "-f 3")
# 
# # use on output of ft_get() to extract pdf to text
# ## arxiv
# res <- ft_get('cond-mat/9309029', from = "arxiv")
# res2 <- ft_extract(res)
# res$arxiv$data
# res2$arxiv$data
# res2$arxiv$data$data[[1]]$data
# 
# ## biorxiv
# res <- ft_get('10.1101/012476')
# res2 <- ft_extract(res)
# res$biorxiv$data
# res2$biorxiv$data
# res2$biorxiv$data$data[[1]]$data
# ## End(Not run)

Run the code above in your browser using DataLab