Learn R Programming

⚠️There's a newer version (2.0) of this package.Take me there.
  _____     .__  .__   __                   __
_/ ____\_ __|  | |  |_/  |_  ____ ___  ____/  |_
\   __\  |  \  | |  |\   __\/ __ \\  \/  /\   __\
 |  | |  |  /  |_|  |_|  | \  ___/ >    <  |  |
 |__| |____/|____/____/__|  \___  >__/\_ \ |__|
                                \/      \/

Get full text articles from lots of places

Checkout the fulltext manual to get started.


rOpenSci has a number of R packages to get either full text, metadata, or both from various publishers. The goal of fulltext is to integrate these packages to create a single interface to many data sources.

fulltext makes it easy to do text-mining by supporting the following steps:

  • Search for articles - ft_search
  • Fetch articles - ft_get
  • Get links for full text articles (xml, pdf) - ft_links
  • Extract text from articles / convert formats - ft_extract
  • Collect all texts into a data.frame - ft_table

Previously supported use cases, extracted out to other packages:

  • Collect bits of articles that you actually need - moved to package pubchunks
  • Supplementary data from papers has been moved to the suppdata package

It's easy to go from the outputs of ft_get to text-mining packages such as tm and quanteda.

Data sources in fulltext include:

available via Pubmed)

  • We will add more, as publishers open up, and as we have time...See the master list here

Authentication: A number of publishers require authentication via API key, and some even more draconian authentication processes involving checking IP addresses. We are working on supporting all the various authentication things for different publishers, but of course all the OA content is already easily available. See the Authentication section in ?fulltext-package after loading the package.

We'd love your feedback. Let us know what you think in the issue tracker

Article full text formats by publisher: https://github.com/ropensci/fulltext/blob/master/vignettes/formats.Rmd

Installation

Stable version from CRAN

install.packages("fulltext")

Development version from GitHub

devtools::install_github("ropensci/fulltext")

Load library

library('fulltext')

Search

ft_search() - get metadata on a search query.

ft_search(query = 'ecology', from = 'crossref')
#> Query:
#>   [ecology] 
#> Found:
#>   [PLoS: 0; BMC: 0; Crossref: 201140; Entrez: 0; arxiv: 0; biorxiv: 0; Europe PMC: 0; Scopus: 0; Microsoft: 0] 
#> Returned:
#>   [PLoS: 0; BMC: 0; Crossref: 10; Entrez: 0; arxiv: 0; biorxiv: 0; Europe PMC: 0; Scopus: 0; Microsoft: 0]

Get full text links

ft_links() - get links for articles (xml and pdf).

res1 <- ft_search(query = 'biology', from = 'entrez', limit = 5)
ft_links(res1)
#> <fulltext links>
#> [Found] 5 
#> [IDs] ID_31472450 ID_30692680 ID_30656621 ID_29887338 ID_28674916 ...

Or pass in DOIs directly

ft_links(res1$entrez$data$doi, from = "entrez")
#> <fulltext links>
#> [Found] 5 
#> [IDs] ID_31472450 ID_30692680 ID_30656621 ID_29887338 ID_28674916 ...

Get full text

ft_get() - get full or partial text of articles.

ft_get('10.7717/peerj.228')
#> <fulltext text>
#> [Docs] 1 
#> [Source] ext - /Users/sckott/Library/Caches/R/fulltext 
#> [IDs] 10.7717/peerj.228 ...

Extract chunks

library(pubchunks)
x <- ft_get(c('10.7554/eLife.03032', '10.7554/eLife.32763'), from = "elife")
x %>% ft_collect() %>% pub_chunks("publisher") %>% pub_tabularize()
#> $elife
#> $elife$`10.7554/eLife.03032`
#>                          publisher .publisher
#> 1 eLife Sciences Publications, Ltd      elife
#> 
#> $elife$`10.7554/eLife.32763`
#>                          publisher .publisher
#> 1 eLife Sciences Publications, Ltd      elife

Get multiple fields at once

x %>% ft_collect() %>% pub_chunks(c("doi","publisher")) %>% pub_tabularize()
#> $elife
#> $elife$`10.7554/eLife.03032`
#>                   doi                        publisher .publisher
#> 1 10.7554/eLife.03032 eLife Sciences Publications, Ltd      elife
#> 
#> $elife$`10.7554/eLife.32763`
#>                   doi                        publisher .publisher
#> 1 10.7554/eLife.32763 eLife Sciences Publications, Ltd      elife

Pull out the data.frame's

x %>%
  ft_collect() %>% 
  pub_chunks(c("doi", "publisher", "author")) %>%
  pub_tabularize() %>%
  .$elife
#> $`10.7554/eLife.03032`
#>                   doi                        publisher authors.given_names
#> 1 10.7554/eLife.03032 eLife Sciences Publications, Ltd                  Ya
#>   authors.surname authors.given_names.1 authors.surname.1 authors.given_names.2
#> 1            Zhao                 Jimin               Lin               Beiying
#>   authors.surname.2 authors.given_names.3 authors.surname.3
#> 1                Xu                  Sida                Hu
#>   authors.given_names.4 authors.surname.4 authors.given_names.5
#> 1                   Xue             Zhang                Ligang
#>   authors.surname.5 .publisher
#> 1                Wu      elife
#> 
#> $`10.7554/eLife.32763`
#>                   doi                        publisher authors.given_names
#> 1 10.7554/eLife.32763 eLife Sciences Publications, Ltd             Natasha
#>   authors.surname authors.given_names.1 authors.surname.1 authors.given_names.2
#> 1          Mhatre                Robert            Malkin                Rittik
#>   authors.surname.2 authors.given_names.3 authors.surname.3
#> 1               Deb                Rohini      Balakrishnan
#>   authors.given_names.4 authors.surname.4 .publisher
#> 1                Daniel            Robert      elife

Extract text from PDFs

There are going to be cases in which some results you find in ft_search() have full text available in text, xml, or other machine readable formats, but some may be open access, but only in pdf format. We have a series of convenience functions in this package to help extract text from pdfs, both locally and remotely.

Locally, using code adapted from the package tm, and two pdf to text parsing backends

pdf <- system.file("examples", "example2.pdf", package = "fulltext")
ft_extract(pdf)
#> <document>/Library/Frameworks/R.framework/Versions/3.6/Resources/library/fulltext/examples/example2.pdf
#>   Title: pone.0107412 1..10
#>   Producer: Acrobat Distiller 9.0.0 (Windows); modified using iText 5.0.3 (c) 1T3XT BVBA
#>   Creation date: 2014-09-18

Interoperability with other packages downstream

cache_options_set(path = (td <- 'foobar'))
#> $cache
#> [1] TRUE
#> 
#> $backend
#> [1] "ext"
#> 
#> $path
#> [1] "/Users/sckott/Library/Caches/R/foobar"
#> 
#> $overwrite
#> [1] FALSE
res <- ft_get(c('10.7554/eLife.03032', '10.7554/eLife.32763'), type = "pdf")
library(readtext)
x <- readtext::readtext(file.path(cache_options_get()$path, "*.pdf"))
library(quanteda)
quanteda::corpus(x)
#> Corpus consisting of 2 documents and 0 docvars.

Contributors

Meta

By participating in this project you agree to abide by its terms.

Copy Link

Version

Install

install.packages('fulltext')

Monthly Downloads

56

Version

1.4.0

License

MIT + file LICENSE

Issues

Pull Requests

Stars

Forks

Maintainer

Scott Chamberlain

Last Published

December 13th, 2019

Functions in fulltext (1.4.0)

eupmc

Europe PMC utilities
ft_browse_sections

This function is defunct.
ft_chunks

Extract chunks of data from articles
fulltext-defunct

Defunct functions in fulltext
fulltext-package

Fulltext search and retrieval of scholarly texts.
ft_tabularize

Extract chunks of data from articles
ft_providers

Search for information on journals or publishers.
pdfx

This function is defunct.
ft_links

Get full text links
ft_table

Collect metadata and text into a data.frame
%>%

Pipe operator
ft_extract_corpus

This function is defunct.
ft_search

Search for full text
ft_collect

Collect article text from local files
ft_serialize

Serialize raw text to other formats, including to disk
ft_extract

Extract text from a single pdf document
ft_get-warnings

fulltext warnings details
get_text

This function is defunct.
microsoft-internals

Microsoft Academic search
ft_get

Download full text articles
ft_type_sum

Type summary
ft_get_si

This function is defunct.
scopus_search

Scopus search
ftxt_cache

Inspect and manage cached files
tabularize

This function is defunct.
cache

Set or get cache options
bmc_search

Search for gene sequences available for a species from NCBI.
as.ft_data

Coerce directory of papers to ft_data object
collect

This function is defunct.
biorxiv_search

Biorxiv search
cache_file_info

Get information on possibly bad files in your cache
chunks

This function is defunct.
ft_browse

Browse an article in your default browser
ft_abstract

Get abstracts