_____ .__ .__ __ __
_/ ____\_ __| | | |_/ |_ ____ ___ ____/ |_
\ __\ | \ | | |\ __\/ __ \\ \/ /\ __\
| | | | / |_| |_| | \ ___/ > < | |
|__| |____/|____/____/__| \___ >__/\_ \ |__|
\/ \/
Get full text articles from lots of places
Checkout the fulltext manual to get started.
rOpenSci has a number of R packages to get either full text, metadata, or both from various publishers. The goal of fulltext
is to integrate these packages to create a single interface to many data sources.
fulltext
makes it easy to do text-mining by supporting the following steps:
- Search for articles -
ft_search
- Fetch articles -
ft_get
- Get links for full text articles (xml, pdf) -
ft_links
- Extract text from articles / convert formats -
ft_extract
- Collect all texts into a data.frame -
ft_table
Previously supported use cases, extracted out to other packages:
- Collect bits of articles that you actually need - moved to package pubchunks
- Supplementary data from papers has been moved to the suppdata package
It's easy to go from the outputs of ft_get
to text-mining packages such as
tm and
quanteda.
Data sources in fulltext
include:
- Crossref - via the
rcrossref
package - Public Library of Science (PLOS) - via the
rplos
package - Biomed Central
- arXiv - via the
aRxiv
package - bioRxiv - via the
biorxivr
package - PMC/Pubmed via Entrez - via the
rentrez
package - Many more are supported via the above sources (e.g., Royal Society Open Science is
available via Pubmed)
- We will add more, as publishers open up, and as we have time...See the master list here
Authentication: A number of publishers require authentication via API key, and some even more
draconian authentication processes involving checking IP addresses. We are working on supporting
all the various authentication things for different publishers, but of course all the OA content
is already easily available. See the Authentication section in ?fulltext-package
after
loading the package.
We'd love your feedback. Let us know what you think in the issue tracker
Article full text formats by publisher: https://github.com/ropensci/fulltext/blob/master/vignettes/formats.Rmd
Installation
Stable version from CRAN
install.packages("fulltext")
Development version from GitHub
devtools::install_github("ropensci/fulltext")
Load library
library('fulltext')
Search
ft_search()
- get metadata on a search query.
ft_search(query = 'ecology', from = 'crossref')
#> Query:
#> [ecology]
#> Found:
#> [PLoS: 0; BMC: 0; Crossref: 201140; Entrez: 0; arxiv: 0; biorxiv: 0; Europe PMC: 0; Scopus: 0; Microsoft: 0]
#> Returned:
#> [PLoS: 0; BMC: 0; Crossref: 10; Entrez: 0; arxiv: 0; biorxiv: 0; Europe PMC: 0; Scopus: 0; Microsoft: 0]
Get full text links
ft_links()
- get links for articles (xml and pdf).
res1 <- ft_search(query = 'biology', from = 'entrez', limit = 5)
ft_links(res1)
#> <fulltext links>
#> [Found] 5
#> [IDs] ID_31472450 ID_30692680 ID_30656621 ID_29887338 ID_28674916 ...
Or pass in DOIs directly
ft_links(res1$entrez$data$doi, from = "entrez")
#> <fulltext links>
#> [Found] 5
#> [IDs] ID_31472450 ID_30692680 ID_30656621 ID_29887338 ID_28674916 ...
Get full text
ft_get()
- get full or partial text of articles.
ft_get('10.7717/peerj.228')
#> <fulltext text>
#> [Docs] 1
#> [Source] ext - /Users/sckott/Library/Caches/R/fulltext
#> [IDs] 10.7717/peerj.228 ...
Extract chunks
library(pubchunks)
x <- ft_get(c('10.7554/eLife.03032', '10.7554/eLife.32763'), from = "elife")
x %>% ft_collect() %>% pub_chunks("publisher") %>% pub_tabularize()
#> $elife
#> $elife$`10.7554/eLife.03032`
#> publisher .publisher
#> 1 eLife Sciences Publications, Ltd elife
#>
#> $elife$`10.7554/eLife.32763`
#> publisher .publisher
#> 1 eLife Sciences Publications, Ltd elife
Get multiple fields at once
x %>% ft_collect() %>% pub_chunks(c("doi","publisher")) %>% pub_tabularize()
#> $elife
#> $elife$`10.7554/eLife.03032`
#> doi publisher .publisher
#> 1 10.7554/eLife.03032 eLife Sciences Publications, Ltd elife
#>
#> $elife$`10.7554/eLife.32763`
#> doi publisher .publisher
#> 1 10.7554/eLife.32763 eLife Sciences Publications, Ltd elife
Pull out the data.frame's
x %>%
ft_collect() %>%
pub_chunks(c("doi", "publisher", "author")) %>%
pub_tabularize() %>%
.$elife
#> $`10.7554/eLife.03032`
#> doi publisher authors.given_names
#> 1 10.7554/eLife.03032 eLife Sciences Publications, Ltd Ya
#> authors.surname authors.given_names.1 authors.surname.1 authors.given_names.2
#> 1 Zhao Jimin Lin Beiying
#> authors.surname.2 authors.given_names.3 authors.surname.3
#> 1 Xu Sida Hu
#> authors.given_names.4 authors.surname.4 authors.given_names.5
#> 1 Xue Zhang Ligang
#> authors.surname.5 .publisher
#> 1 Wu elife
#>
#> $`10.7554/eLife.32763`
#> doi publisher authors.given_names
#> 1 10.7554/eLife.32763 eLife Sciences Publications, Ltd Natasha
#> authors.surname authors.given_names.1 authors.surname.1 authors.given_names.2
#> 1 Mhatre Robert Malkin Rittik
#> authors.surname.2 authors.given_names.3 authors.surname.3
#> 1 Deb Rohini Balakrishnan
#> authors.given_names.4 authors.surname.4 .publisher
#> 1 Daniel Robert elife
Extract text from PDFs
There are going to be cases in which some results you find in ft_search()
have full text available in text, xml, or other machine readable formats, but some may be open access, but only in pdf format. We have a series of convenience functions in this package to help extract text from pdfs, both locally and remotely.
Locally, using code adapted from the package tm
, and two pdf to text parsing backends
pdf <- system.file("examples", "example2.pdf", package = "fulltext")
ft_extract(pdf)
#> <document>/Library/Frameworks/R.framework/Versions/3.6/Resources/library/fulltext/examples/example2.pdf
#> Title: pone.0107412 1..10
#> Producer: Acrobat Distiller 9.0.0 (Windows); modified using iText 5.0.3 (c) 1T3XT BVBA
#> Creation date: 2014-09-18
Interoperability with other packages downstream
cache_options_set(path = (td <- 'foobar'))
#> $cache
#> [1] TRUE
#>
#> $backend
#> [1] "ext"
#>
#> $path
#> [1] "/Users/sckott/Library/Caches/R/foobar"
#>
#> $overwrite
#> [1] FALSE
res <- ft_get(c('10.7554/eLife.03032', '10.7554/eLife.32763'), type = "pdf")
library(readtext)
x <- readtext::readtext(file.path(cache_options_get()$path, "*.pdf"))
library(quanteda)
quanteda::corpus(x)
#> Corpus consisting of 2 documents and 0 docvars.
Contributors
- Scott Chamberlain https://github.com/sckott
- Will Pearse https://github.com/willpearse
- Katrin Leinweber https://github.com/katrinleinweber
Meta
- Please report any issues or bugs.
- License: MIT
- Get citation information for
fulltext
:citation(package = 'fulltext')
- Please note that this project is released with a Contributor Code of Conduct.
By participating in this project you agree to abide by its terms.