Learn R Programming

⚠️There's a newer version (0.5.1) of this package.Take me there.

neonstore

neonstore provides quick access and persistent storage of NEON data tables. neonstore emphasizes simplicity and a clean data provenance trail, see Provenance section below.

Installation

Install the development version from GitHub with:

# install.packages("devtools")
devtools::install_github("cboettig/neonstore")

Quickstart

Discover data products of interest:

products <- neon_products()

i <- grepl("Populations", products$themes)
products[i, c("productCode", "productName")]
#> # A tibble: 50 x 2
#>    productCode   productName                                  
#>    <chr>         <chr>                                        
#>  1 DP1.00033.001 Phenology images                             
#>  2 DP1.10003.001 Breeding landbird point counts               
#>  3 DP1.10010.001 Coarse downed wood log survey                
#>  4 DP1.10020.001 Ground beetle sequences DNA barcode          
#>  5 DP1.10022.001 Ground beetles sampled from pitfall traps    
#>  6 DP1.10026.001 Plant foliar physical and chemical properties
#>  7 DP1.10033.001 Litterfall and fine woody debris sampling    
#>  8 DP1.10038.001 Mosquito sequences DNA barcode               
#>  9 DP1.10041.001 Mosquito-borne pathogen status               
#> 10 DP1.10043.001 Mosquitoes sampled from CO2 traps            
#> # … with 40 more rows
 
i <- grepl("bird", products$keywords)
products[i, c("productCode", "productName")]
#> # A tibble: 1 x 2
#>   productCode   productName                   
#>   <chr>         <chr>                         
#> 1 DP1.10003.001 Breeding landbird point counts

Download all data files in the bird survey data products.

library(neonstore)
neon_download("DP1.10003.001")

Now, view your store of NEON products:

neon_store()
#> # A tibble: 7 x 3
#>   product       table                   n_files
#>   <chr>         <chr>                     <int>
#> 1 DP1.10003.001 brd_countdata-expanded      204
#> 2 DP1.10003.001 brd_perpoint-basic          204
#> 3 DP1.10003.001 brd_references-expanded     204
#> 4 DP1.10003.001 EML-                        204
#> 5 DP1.10003.001 readme-                     204
#> 6 DP0.10003.001 validation-                 204
#> 7 DP1.10003.001 variables-                  204

These will persist between sessions, so you only need to download once or to retrieve updates. neon_store() can take arguments to filter by product or pattern in table name, e.g. neon_store(table = "brd").

Once you determine the table of interest, you can read in all the component tables into a single data.frame

neon_read("brd_countdata-expanded")
#> Rows: 164,782
#> Columns: 24
#> Delimiter: ","
#> chr  [19]: uid, namedLocation, domainID, siteID, plotID, plotType, pointID, eventID, targe...
#> dbl  [ 3]: pointCountMinute, observerDistance, clusterSize
#> lgl  [ 1]: clusterCode
#> dttm [ 1]: startDate
#> 
#> Use `spec()` to retrieve the guessed column specification
#> Pass a specification to the `col_types` argument to quiet this message
#> # A tibble: 164,782 x 24
#>    uid   namedLocation domainID siteID plotID plotType pointID
#>    <chr> <chr>         <chr>    <chr>  <chr>  <chr>    <chr>  
#>  1 ad84… BART_025.bir… D01      BART   BART_… distrib… C1     
#>  2 2115… BART_025.bir… D01      BART   BART_… distrib… C1     
#>  3 0592… BART_025.bir… D01      BART   BART_… distrib… C1     
#>  4 8e5a… BART_025.bir… D01      BART   BART_… distrib… C1     
#>  5 9b07… BART_025.bir… D01      BART   BART_… distrib… C1     
#>  6 145f… BART_025.bir… D01      BART   BART_… distrib… B1     
#>  7 f70e… BART_025.bir… D01      BART   BART_… distrib… B1     
#>  8 648b… BART_025.bir… D01      BART   BART_… distrib… B1     
#>  9 2295… BART_025.bir… D01      BART   BART_… distrib… B1     
#> 10 cc6d… BART_025.bir… D01      BART   BART_… distrib… A1     
#> # … with 164,772 more rows, and 17 more variables: startDate <dttm>,
#> #   eventID <chr>, pointCountMinute <dbl>, targetTaxaPresent <chr>,
#> #   taxonID <chr>, scientificName <chr>, taxonRank <chr>, vernacularName <chr>,
#> #   family <chr>, nativeStatusCode <chr>, observerDistance <dbl>,
#> #   detectionMethod <chr>, visualConfirmation <chr>, sexOrAge <chr>,
#> #   clusterSize <dbl>, clusterCode <lgl>, identifiedBy <chr>

Two other functions access additional API endpoints that may also be of interest. neon_sites() returns a data.frame of site information, including site descriptions and the ecological domain that each site falls into:

neon_sites()
#> # A tibble: 81 x 11
#>    siteCode siteName siteDescription siteType siteLatitude siteLongitude
#>    <chr>    <chr>    <chr>           <chr>           <dbl>         <dbl>
#>  1 ABBY     Abby Ro… Abby Road       RELOCAT…         45.8        -122. 
#>  2 ARIK     Arikare… Arikaree River  CORE             39.8        -102. 
#>  3 BARC     Barco L… Barco Lake      CORE             29.7         -82.0
#>  4 BARR     Utqiaġv… Utqiaġvik       RELOCAT…         71.3        -157. 
#>  5 BART     Bartlet… Bartlett Exper… RELOCAT…         44.1         -71.3
#>  6 BIGC     Upper B… Upper Big Creek RELOCAT…         37.1        -119. 
#>  7 BLAN     Blandy … Blandy Experim… RELOCAT…         39.0         -78.0
#>  8 BLDE     Blackta… Blacktail Deer… CORE             45.0        -111. 
#>  9 BLUE     Blue Ri… Blue River      RELOCAT…         34.4         -96.6
#> 10 BLWA     Black W… Black Warrior … RELOCAT…         32.5         -87.8
#> # … with 71 more rows, and 5 more variables: stateCode <chr>, stateName <chr>,
#> #   domainCode <chr>, domainName <chr>, dataProducts <list>

Lastly, neon_products() returns a table with a list of all neon products, which may be useful for data discovery or additional metadata about any given product:

neon_products()
#> # A tibble: 181 x 23
#>    productCodeLong productCode productCodePres… productName productDescript…
#>    <chr>           <chr>       <chr>            <chr>       <chr>           
#>  1 NEON.DOM.SITE.… DP1.00001.… NEON.DP1.00001   2D wind sp… Two-dimensional…
#>  2 NEON.DOM.SITE.… DP1.00002.… NEON.DP1.00002   Single asp… Air temperature…
#>  3 NEON.DOM.SITE.… DP1.00003.… NEON.DP1.00003   Triple asp… Air temperature…
#>  4 NEON.DOM.SITE.… DP1.00004.… NEON.DP1.00004   Barometric… Barometric pres…
#>  5 NEON.DOM.SITE.… DP1.00005.… NEON.DP1.00005   IR biologi… Infrared temper…
#>  6 NEON.DOM.SITE.… DP1.00006.… NEON.DP1.00006   Precipitat… Precipitation i…
#>  7 NEON.DOM.SITE.… DP1.00007.… NEON.DP1.00007   3D wind sp… Three-dimension…
#>  8 NEON.DOM.SITE.… DP1.00010.… NEON.DP1.00010   3D wind at… Measurement of …
#>  9 NEON.DOM.SITE.… DP1.00013.… NEON.DP1.00013   Wet deposi… Total dissolved…
#> 10 NEON.DOM.SITE.… DP1.00014.… NEON.DP1.00014   Shortwave … Total, direct b…
#> # … with 171 more rows, and 18 more variables: productStatus <chr>,
#> #   productCategory <chr>, productHasExpanded <lgl>,
#> #   productScienceTeamAbbr <chr>, productScienceTeam <chr>,
#> #   productPublicationFormatType <chr>, productAbstract <chr>,
#> #   productDesignDescription <chr>, productStudyDescription <chr>,
#> #   productBasicDescription <chr>, productExpandedDescription <chr>,
#> #   productSensor <chr>, productRemarks <chr>, themes <chr>, changeLogs <list>,
#> #   specs <list>, keywords <chr>, siteCodes <list>

Design Details / comparison to neonUtilities

neonstore is not meant as a replacement to the neonUtilities package developed by NEON staff. neonUtilities performs a range of product-specific data querying, parsing, and data manipulation beyond what is provided by NEON’s API or web interface. neonUtilities also provides other utilities for working with NEON data beyond the scope of the NEON API or the data download/ingest process. While this processing is undoubtedly useful, it may make it difficult to compare results or analyses based on data downloaded and accessed using neonUtilities R package with analyses based on data accessed directly from the web interface, the API, or another tool (or even a different release of the neonUtilities).

By contrast, neonstore aims to do far less. neonstore merely automates the download of individual NEON data files. In contrast to neonUtilities which by default “stacks” these raw files into single tables and discards the raw data, neonstore preserves only the raw files in the store, stacking the individual tables “on demand” using neon_read(). neon_read() is a thin wrapper around the vroom package, Hester & Wickham, 2020, which uses the altrep mechanism in R to provide very fast reads of rectangular text data into R, and trivially handles the case of a single table being broken across many files. Some NEON tables are not entirely consistent in their use of columns across the individual site-month files, so neon_read() transparently checks for this, reading in groups of files sharing all matching columns with vroom before binding the groups together. This makes it easier to always trace an analysis back to the original input data, makes it easier to update input data files without facing the challenge of either downloading & stacking the whole data product from scratch again or having to keep track of some previously downloaded data file.

A few other differences are also worth noting.

  • neonstore aims to provide persistent storage, writing raw data files to the appropriate app directory for your operating system (see rappdirs, Ratnakumar et al 2016). More details about this can be found in Provenance, below.
  • neon_download() provides clean and concise progress bars for the two key processes involved: querying the API to obtain download URLs (which involves no large data transfer but counts against API rate limiting, see below), and the actual file downloads.
  • neon_download() will verify the integrity of file downloads against the MD5 hashes provided.
  • neon_download() will omit downloads of any existing data files in the local store.
  • You can request multiple products at once using vector notation, though API rate limiting may interfere with large requests.
  • neon_download() uses curl::curl_download() instead of downloadr package used in neonUtilities, which can be finicky on Windows and older versions of R.
  • neonstore has slightly lighter dependencies: only vroom and httr, and packages already used by one of those two (curl, openssl).

Like neonUtilities, You can optionally include site and date filters, e.g. to request only records more recent than a certain date. Doing so will preserve API quota and improve speed (see API limits, below). neonUtilities is also far more widely tested and has extensive error handling tailored to individual data products.

Provenance

Because neonstore only stores raw data products as returned from the NEON API, it can easily determine which files have already been downloaded, and only download new files without requiring the user to specify specific dates. (It must still query the API for all the metadata in the requested date range). This same modular approach also makes it easy to track data provenance, an essential element of reproduciblity in comparing results across other analyses of the NEON data.

We can list precisely which component files are being read in by neon_read() by consulting neon_index():

raw_files <- neon_index(table = "brd_countdata-expanded", hash="md5")
raw_files
#> # A tibble: 204 x 9
#>    product  site  table    type   ext   month  timestamp path          hash     
#>    <chr>    <chr> <chr>    <chr>  <chr> <chr>  <chr>     <chr>         <chr>    
#>  1 DP1.100… BART  brd_cou… expan… csv   2015-… 20191107… /tmp/RtmpKHJ… hash://m…
#>  2 DP1.100… BART  brd_cou… expan… csv   2016-… 20191107… /tmp/RtmpKHJ… hash://m…
#>  3 DP1.100… BART  brd_cou… expan… csv   2017-… 20191107… /tmp/RtmpKHJ… hash://m…
#>  4 DP1.100… BART  brd_cou… expan… csv   2018-… 20191107… /tmp/RtmpKHJ… hash://m…
#>  5 DP1.100… BART  brd_cou… expan… csv   2019-… 20191205… /tmp/RtmpKHJ… hash://m…
#>  6 DP1.100… HARV  brd_cou… expan… csv   2015-… 20191107… /tmp/RtmpKHJ… hash://m…
#>  7 DP1.100… HARV  brd_cou… expan… csv   2015-… 20191107… /tmp/RtmpKHJ… hash://m…
#>  8 DP1.100… HARV  brd_cou… expan… csv   2016-… 20191107… /tmp/RtmpKHJ… hash://m…
#>  9 DP1.100… HARV  brd_cou… expan… csv   2017-… 20191107… /tmp/RtmpKHJ… hash://m…
#> 10 DP1.100… HARV  brd_cou… expan… csv   2018-… 20191107… /tmp/RtmpKHJ… hash://m…
#> # … with 194 more rows

neon_read() is a relatively trivial function that simply passes this file list to vroom::vroom(), a fast, vectorized parser that can easily read in a single table that is broken into many separate files.

Imagine instead that we use the common pattern of downloading these raw files, stacks and possibly cleans the data, saving only this derived product while discarding the individual files. Now imagine a second researcher, at some later date, queries the API over the same reported range of dates and sites, uses the same software package to stack the tables, only to discover the resulting table is somehow different from ours (e.g. by comparing file hashes). Pinpointing the source of the discrepancy would be challenging and labor-intensive.

In contrast, the same detective-work would be easy with the neonstore file list. We can confirm if the API had returned the same number of raw files with the same names; and better, can verify integrity of the contents by comparing hashes of files now being returned to those recorded by neon_index(). In this way, we could determine if any additional files had been included or pinpoint any files that may have changed.

As such, users might want to store the neon_index() data.frame for the table(s) they have used as part of their analysis, including the individual file hashes. One can also generate a zip of all the data files for archival purposes. (Note that NEON is an Open Data provider, see LICENCE.)

write.csv(raw_files, "index.csv")
zip("brd_countdata.zip", raw_files$path)

Data citation

Always remember to cite your data sources! neonstore knows how to generate the appropriate citation for the data in your local store (or any specific product).

neon_citation()
#> National Ecological Observatory Network (2020). "Data Products:
#> NEON.DP1.10003.001 NEON.DP0.10003.001 . Provisional data downloaded
#> from http://data.neonscience.org on 12 Aug 2020."

Note on API limits

The NEON API now rate-limits requests.. Using a personal token will increase the number of requests you can make. See that link for directions on registering for a token. Then pass this token in .token argument of neon_download(), or for frequent use, add this token as an environmental variable, NEON_DATA to your local .Renviron file in your user’s home directory.

neon_download() must first query each the API of eacn NEON site which collects that product, for each month the product is collected. (It would be much more efficient on the NEON server if the API could take queries of the from /data/<product>/<site>, and pool the results, rather than require each month of sampling separately!)

Copy Link

Version

Install

install.packages('neonstore')

Monthly Downloads

887

Version

0.2.1

License

MIT + file LICENSE

Maintainer

Carl Boettiger

Last Published

August 29th, 2020

Functions in neonstore (0.2.1)

neon_download

Download NEON data products into a local store
neon_read

read in neon tabular data
neon_download_s3

Download requested NEON files from an S3 bucket
neon_dir

Default directory for persistent NEON file store
neon_filename_parser

NEON filename parser
neon_citation

Generate the appropriate citation for your data
neon_import

Import a previously exported zip archive of raw NEON files
neon_sites

Table of all NEON sites
neon_store

Show tables that have been downloaded to the neon store
neon_index

Show information about all files downloaded to the local store
neon_export

export local neon store as a zip archive
neon_products

Table of all NEON Data Products