Learn R Programming

fulltext (version 0.1.6)

ft_get: Get full text

Description

ft_get is a one stop shop to fetch full text of articles, either XML or PDFs. We have specific support for PLOS via the rplos package, Entrez via the rentrez package, and arXiv via the aRxiv package. For other publishers, we have helpers to ft_get to sort out links for full text based on user input. See Details for help on how to use this function.

Usage

ft_get(x, from = NULL, plosopts = list(), bmcopts = list(), entrezopts = list(), elifeopts = list(), cache = FALSE, backend = "rds", path = "~/.fulltext", ...)
"ft_get"(x, from = NULL, plosopts = list(), bmcopts = list(), entrezopts = list(), elifeopts = list(), cache = FALSE, backend = "rds", path = "~/.fulltext", ...)
"ft_get"(x, from = NULL, plosopts = list(), bmcopts = list(), entrezopts = list(), elifeopts = list(), cache = FALSE, backend = "rds", path = "~/.fulltext", ...)
"ft_get"(x, from = NULL, plosopts = list(), bmcopts = list(), entrezopts = list(), elifeopts = list(), cache = FALSE, backend = "rds", path = "~/.fulltext", ...)

Arguments

x
Either identifiers for papers, either DOIs (or other ids) as a list of charcter strings, or a character vector, OR an object of class ft, as returned from ft_search
from
Source to query. Optional.
plosopts
PLOS options. See plos_fulltext
bmcopts
BMC options. See bmc_xml
entrezopts
Entrez options. See entrez_search and entrez_fetch
elifeopts
eLife options
cache
(logical) To cache results or not. If cache=TRUE, raw XML, or other format that article is in is written to disk, then pulled from disk when further manipulations are done on the data. See also cache
backend
(character) One of rds, rcache, or redis
path
(character) Path to local folder. If the folder doesn't exist, we create it for you.
...
Further args passed on to GET

Value

An object of class ft_data (of type S3) with slots for each of the publishers. The returned object is split up by publishers because the full text format is the same within publisher - which should facilitate text mining downstream as different steps may be needed for each publisher's content.

Notes on specific publishers

  • arXiv - The IDs passed are not actually DOIs, though they look similar. Thus, there's no way to not pass in the from parameter as we can't determine unambiguously that the IDs passed in are from arXiv.org.

Details

There are various ways to use ft_get:
  • Pass in only DOIs - leave from parameter NULL. This route will first query Crossref API for the publisher of the DOI, then we'll use the appropriate method to fetch full text from the publisher. If a publisher is not found for the DOI, then we'll throw back a message telling you a publisher was not found.
  • Pass in DOIs (or other pub IDs) and use the from parameter. This route means we don't have to make an extra API call to Crossref (thus, this route is faster) to determine the publisher for each DOI. We go straight to getting full text based on the publisher.
  • Use ft_search to search for articles. Then pass that output to this function, which will use info in that object. This behaves the same as the previous option in that each DOI has publisher info so we know how to get full text for each DOI.

Note that some publishers are available via Entrez, but often not recent articles, where "recent" may be a few months to a year or so. In that case, make sure to specify the publisher, or else you'll get back no data.

Examples

Run this code
## Not run: 
# # If you just have DOIs and don't know the publisher
# ## PLOS
# ft_get('10.1371/journal.pone.0086169')
# ## PeerJ
# ft_get('10.7717/peerj.228')
# ## eLife
# ft_get('10.7554/eLife.03032')
# ## BMC
# ft_get(c('10.1186/2049-2618-2-7', '10.1186/2193-1801-3-7'))
# ## FrontiersIn
# res <- ft_get(c('10.3389/fphar.2014.00109', '10.3389/feart.2015.00009'))
# ## Hindawi - via Entrez
# res <- ft_get(c('10.1155/2014/292109','10.1155/2014/162024','10.1155/2014/249309'))
# ## F1000Research - via Entrez
# ft_get('10.12688/f1000research.6522.1')
# ## Two different publishers via Entrez - retains publisher names
# res <- ft_get(c('10.1155/2014/292109', '10.12688/f1000research.6522.1'))
# res$hindawi
# res$f1000research
# ## Pensoft
# ft_get('10.3897/zookeys.499.8360')
# ### you'll need to specify the publisher for a DOI from a recent publication
# ft_get('10.3897/zookeys.515.9332', from = "pensoft")
# ## Copernicus
# out <- ft_get(c('10.5194/angeo-31-2157-2013', '10.5194/bg-12-4577-2015'))
# out$copernicus
# ## arXiv - only pdf, you have to pass in the from parameter
# res <- ft_get(x='cond-mat/9309029', from = "arxiv", cache=TRUE, backend="rds")
# res %>% ft_extract
# ## bioRxiv - only pdf
# res <- ft_get(x='10.1101/012476')
# res$biorxiv
# ## Karger Publisher
# ft_get('10.1159/000369331')
# ## CogentOA Publisher
# ft_get('10.1080/23311916.2014.938430')
# ## MDPI Publisher
# ft_get('10.3390/nu3010063')
# ft_get('10.3390/nu7085279')
# ft_get(c('10.3390/nu3010063', '10.3390/nu7085279')) # not working, only getting 1
# 
# # If you know the publisher, give DOI and publisher
# ## by default, PLOS gives back XML
# ft_get('10.1371/journal.pone.0086169', from='plos')
# ## you can instead get json
# ft_get('10.1371/journal.pone.0086169', from='plos', plosopts=list(wt="json"))
# 
# (dois <- searchplos(q="*:*", fl='id',
#    fq=list('doc_type:full',"article_type:\"research article\""), limit=5)$data$id)
# ft_get(dois, from='plos')
# ft_get(c('10.7717/peerj.228','10.7717/peerj.234'), from='entrez')
# 
# # elife
# ft_get('10.7554/eLife.04300', from='elife')
# ft_get(c('10.7554/eLife.04300', '10.7554/eLife.03032'), from='elife')
# ## search for elife papers via Entrez
# dois <- ft_search("elife[journal]", from = "entrez")
# ft_get(dois)
# 
# # bmc
# ft_get('http://www.microbiomejournal.com/content/download/xml/2049-2618-2-7.xml', from='bmc')
# urls <- c('http://www.biomedcentral.com/content/download/xml/1471-2393-14-71.xml',
#  'http://www.springerplus.com/content/download/xml/2193-1801-3-7.xml',
#  'http://www.microbiomejournal.com/content/download/xml/2049-2618-2-7.xml')
# ft_get(urls, from='bmc')
# 
# # Frontiers in Pharmacology (publisher: Frontiers)
# doi <- '10.3389/fphar.2014.00109'
# ft_get(doi, from="entrez")
# 
# # Hindawi Journals
# ft_get(c('10.1155/2014/292109','10.1155/2014/162024','10.1155/2014/249309'), from='entrez')
# res <- ft_search(query='ecology', from='crossref', limit=50,
#                  crossrefopts = list(filter=list(has_full_text = TRUE,
#                                                  member=98,
#                                                  type='journal-article')))
# 
# out <- ft_get(res$crossref$data$DOI[1:20], from='entrez')
# 
# # Frontiers Publisher - Frontiers in Aging Nueroscience
# res <- ft_get("10.3389/fnagi.2014.00130", from='entrez')
# res$entrez
# 
# # Search entrez, get some DOIs
# (res <- ft_search(query='ecology', from='entrez'))
# res$entrez$data$doi
# ft_get(res$entrez$data$doi[1], from='entrez')
# ft_get(res$entrez$data$doi[1:3], from='entrez')
# 
# # Caching
# res <- ft_get('10.1371/journal.pone.0086169', from='plos', cache=TRUE, backend="rds")
# 
# # Search entrez, and pass to ft_get()
# (res <- ft_search(query='ecology', from='entrez'))
# ft_get(res)
# ## End(Not run)

Run the code above in your browser using DataLab