neonstore
neonstore
provides quick access and persistent storage of NEON data
tables. neonstore
emphasizes simplicity and a clean data provenance
trail, see Provenance section below.
Installation
Install the development version from GitHub with:
# install.packages("devtools")
devtools::install_github("cboettig/neonstore")
Quickstart
Discover data products of interest:
products <- neon_products()
i <- grepl("Populations", products$themes)
products[i, c("productCode", "productName")]
#> # A tibble: 50 x 2
#> productCode productName
#> <chr> <chr>
#> 1 DP1.00033.001 Phenology images
#> 2 DP1.10003.001 Breeding landbird point counts
#> 3 DP1.10010.001 Coarse downed wood log survey
#> 4 DP1.10020.001 Ground beetle sequences DNA barcode
#> 5 DP1.10022.001 Ground beetles sampled from pitfall traps
#> 6 DP1.10026.001 Plant foliar physical and chemical properties
#> 7 DP1.10033.001 Litterfall and fine woody debris sampling
#> 8 DP1.10038.001 Mosquito sequences DNA barcode
#> 9 DP1.10041.001 Mosquito-borne pathogen status
#> 10 DP1.10043.001 Mosquitoes sampled from CO2 traps
#> # … with 40 more rows
i <- grepl("bird", products$keywords)
products[i, c("productCode", "productName")]
#> # A tibble: 1 x 2
#> productCode productName
#> <chr> <chr>
#> 1 DP1.10003.001 Breeding landbird point counts
Download all data files in the bird survey data products.
library(neonstore)
neon_download("DP1.10003.001")
Now, view your store of NEON products:
neon_store()
#> # A tibble: 7 x 3
#> product table n_files
#> <chr> <chr> <int>
#> 1 DP1.10003.001 brd_countdata-expanded 204
#> 2 DP1.10003.001 brd_perpoint-basic 204
#> 3 DP1.10003.001 brd_references-expanded 204
#> 4 DP1.10003.001 EML- 204
#> 5 DP1.10003.001 readme- 204
#> 6 DP0.10003.001 validation- 204
#> 7 DP1.10003.001 variables- 204
These will persist between sessions, so you only need to download once
or to retrieve updates. neon_store()
can take arguments to filter by
product or pattern in table name, e.g. neon_store(table = "brd")
.
Once you determine the table of interest, you can read in all the
component tables into a single data.frame
neon_read("brd_countdata-expanded")
#> Rows: 164,782
#> Columns: 24
#> Delimiter: ","
#> chr [19]: uid, namedLocation, domainID, siteID, plotID, plotType, pointID, eventID, targe...
#> dbl [ 3]: pointCountMinute, observerDistance, clusterSize
#> lgl [ 1]: clusterCode
#> dttm [ 1]: startDate
#>
#> Use `spec()` to retrieve the guessed column specification
#> Pass a specification to the `col_types` argument to quiet this message
#> # A tibble: 164,782 x 24
#> uid namedLocation domainID siteID plotID plotType pointID
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 ad84… BART_025.bir… D01 BART BART_… distrib… C1
#> 2 2115… BART_025.bir… D01 BART BART_… distrib… C1
#> 3 0592… BART_025.bir… D01 BART BART_… distrib… C1
#> 4 8e5a… BART_025.bir… D01 BART BART_… distrib… C1
#> 5 9b07… BART_025.bir… D01 BART BART_… distrib… C1
#> 6 145f… BART_025.bir… D01 BART BART_… distrib… B1
#> 7 f70e… BART_025.bir… D01 BART BART_… distrib… B1
#> 8 648b… BART_025.bir… D01 BART BART_… distrib… B1
#> 9 2295… BART_025.bir… D01 BART BART_… distrib… B1
#> 10 cc6d… BART_025.bir… D01 BART BART_… distrib… A1
#> # … with 164,772 more rows, and 17 more variables: startDate <dttm>,
#> # eventID <chr>, pointCountMinute <dbl>, targetTaxaPresent <chr>,
#> # taxonID <chr>, scientificName <chr>, taxonRank <chr>, vernacularName <chr>,
#> # family <chr>, nativeStatusCode <chr>, observerDistance <dbl>,
#> # detectionMethod <chr>, visualConfirmation <chr>, sexOrAge <chr>,
#> # clusterSize <dbl>, clusterCode <lgl>, identifiedBy <chr>
Two other functions access additional API endpoints that may also be of
interest. neon_sites()
returns a data.frame
of site information,
including site descriptions and the ecological domain that each site
falls into:
neon_sites()
#> # A tibble: 81 x 11
#> siteCode siteName siteDescription siteType siteLatitude siteLongitude
#> <chr> <chr> <chr> <chr> <dbl> <dbl>
#> 1 ABBY Abby Ro… Abby Road RELOCAT… 45.8 -122.
#> 2 ARIK Arikare… Arikaree River CORE 39.8 -102.
#> 3 BARC Barco L… Barco Lake CORE 29.7 -82.0
#> 4 BARR Utqiaġv… Utqiaġvik RELOCAT… 71.3 -157.
#> 5 BART Bartlet… Bartlett Exper… RELOCAT… 44.1 -71.3
#> 6 BIGC Upper B… Upper Big Creek RELOCAT… 37.1 -119.
#> 7 BLAN Blandy … Blandy Experim… RELOCAT… 39.0 -78.0
#> 8 BLDE Blackta… Blacktail Deer… CORE 45.0 -111.
#> 9 BLUE Blue Ri… Blue River RELOCAT… 34.4 -96.6
#> 10 BLWA Black W… Black Warrior … RELOCAT… 32.5 -87.8
#> # … with 71 more rows, and 5 more variables: stateCode <chr>, stateName <chr>,
#> # domainCode <chr>, domainName <chr>, dataProducts <list>
Lastly, neon_products()
returns a table with a list of all neon
products, which may be useful for data discovery or additional metadata
about any given product:
neon_products()
#> # A tibble: 181 x 23
#> productCodeLong productCode productCodePres… productName productDescript…
#> <chr> <chr> <chr> <chr> <chr>
#> 1 NEON.DOM.SITE.… DP1.00001.… NEON.DP1.00001 2D wind sp… Two-dimensional…
#> 2 NEON.DOM.SITE.… DP1.00002.… NEON.DP1.00002 Single asp… Air temperature…
#> 3 NEON.DOM.SITE.… DP1.00003.… NEON.DP1.00003 Triple asp… Air temperature…
#> 4 NEON.DOM.SITE.… DP1.00004.… NEON.DP1.00004 Barometric… Barometric pres…
#> 5 NEON.DOM.SITE.… DP1.00005.… NEON.DP1.00005 IR biologi… Infrared temper…
#> 6 NEON.DOM.SITE.… DP1.00006.… NEON.DP1.00006 Precipitat… Precipitation i…
#> 7 NEON.DOM.SITE.… DP1.00007.… NEON.DP1.00007 3D wind sp… Three-dimension…
#> 8 NEON.DOM.SITE.… DP1.00010.… NEON.DP1.00010 3D wind at… Measurement of …
#> 9 NEON.DOM.SITE.… DP1.00013.… NEON.DP1.00013 Wet deposi… Total dissolved…
#> 10 NEON.DOM.SITE.… DP1.00014.… NEON.DP1.00014 Shortwave … Total, direct b…
#> # … with 171 more rows, and 18 more variables: productStatus <chr>,
#> # productCategory <chr>, productHasExpanded <lgl>,
#> # productScienceTeamAbbr <chr>, productScienceTeam <chr>,
#> # productPublicationFormatType <chr>, productAbstract <chr>,
#> # productDesignDescription <chr>, productStudyDescription <chr>,
#> # productBasicDescription <chr>, productExpandedDescription <chr>,
#> # productSensor <chr>, productRemarks <chr>, themes <chr>, changeLogs <list>,
#> # specs <list>, keywords <chr>, siteCodes <list>
Design Details / comparison to neonUtilities
neonstore
is not meant as a replacement to the neonUtilities
package
developed by NEON staff. neonUtilities
performs a range of
product-specific data querying, parsing, and data manipulation beyond
what is provided by NEON’s API or web interface. neonUtilities
also
provides other utilities for working with NEON data beyond the scope of
the NEON API or the data download/ingest process. While this processing
is undoubtedly useful, it may make it difficult to compare results or
analyses based on data downloaded and accessed using neonUtilities
R
package with analyses based on data accessed directly from the web
interface, the API, or another tool (or even a different release of the
neonUtilities
).
By contrast, neonstore
aims to do far less. neonstore
merely
automates the download of individual NEON data files. In contrast to
neonUtilities
which by default “stacks” these raw files into single
tables and discards the raw data, neonstore
preserves only the raw
files in the store, stacking the individual tables “on demand” using
neon_read()
. neon_read()
is a thin wrapper around the vroom
package, Hester & Wickham, 2020, which uses
the altrep
mechanism in R to provide very fast reads of rectangular
text data into R, and trivially handles the case of a single table being
broken across many files. Some NEON tables are not entirely consistent
in their use of columns across the individual site-month files, so
neon_read()
transparently checks for this, reading in groups of files
sharing all matching columns with vroom
before binding the groups
together. This makes it easier to always trace an analysis back to the
original input data, makes it easier to update input data files without
facing the challenge of either downloading & stacking the whole data
product from scratch again or having to keep track of some previously
downloaded data file.
A few other differences are also worth noting.
neonstore
aims to provide persistent storage, writing raw data files to the appropriate app directory for your operating system (seerappdirs
, Ratnakumar et al 2016). More details about this can be found in Provenance, below.neon_download()
provides clean and concise progress bars for the two key processes involved: querying the API to obtain download URLs (which involves no large data transfer but counts against API rate limiting, see below), and the actual file downloads.neon_download()
will verify the integrity of file downloads against the MD5 hashes provided.neon_download()
will omit downloads of any existing data files in the local store.- You can request multiple products at once using vector notation, though API rate limiting may interfere with large requests.
neon_download()
usescurl::curl_download()
instead ofdownloadr
package used inneonUtilities
, which can be finicky on Windows and older versions of R.neonstore
has slightly lighter dependencies: onlyvroom
andhttr
, and packages already used by one of those two (curl
,openssl
).
Like neonUtilities
, You can optionally include site and date filters,
e.g. to request only records more recent than a certain date. Doing so
will preserve API quota and improve speed (see API limits, below).
neonUtilities
is also far more widely tested and has extensive error
handling tailored to individual data products.
Provenance
Because neonstore
only stores raw data products as returned from the
NEON API, it can easily determine which files have already been
downloaded, and only download new files without requiring the user to
specify specific dates. (It must still query the API for all the
metadata in the requested date range). This same modular approach also
makes it easy to track data provenance, an essential element of
reproduciblity in comparing results across other analyses of the NEON
data.
We can list precisely which component files are being read in by
neon_read()
by consulting neon_index()
:
raw_files <- neon_index(table = "brd_countdata-expanded", hash="md5")
raw_files
#> # A tibble: 204 x 9
#> product site table type ext month timestamp path hash
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 DP1.100… BART brd_cou… expan… csv 2015-… 20191107… /tmp/RtmpKHJ… hash://m…
#> 2 DP1.100… BART brd_cou… expan… csv 2016-… 20191107… /tmp/RtmpKHJ… hash://m…
#> 3 DP1.100… BART brd_cou… expan… csv 2017-… 20191107… /tmp/RtmpKHJ… hash://m…
#> 4 DP1.100… BART brd_cou… expan… csv 2018-… 20191107… /tmp/RtmpKHJ… hash://m…
#> 5 DP1.100… BART brd_cou… expan… csv 2019-… 20191205… /tmp/RtmpKHJ… hash://m…
#> 6 DP1.100… HARV brd_cou… expan… csv 2015-… 20191107… /tmp/RtmpKHJ… hash://m…
#> 7 DP1.100… HARV brd_cou… expan… csv 2015-… 20191107… /tmp/RtmpKHJ… hash://m…
#> 8 DP1.100… HARV brd_cou… expan… csv 2016-… 20191107… /tmp/RtmpKHJ… hash://m…
#> 9 DP1.100… HARV brd_cou… expan… csv 2017-… 20191107… /tmp/RtmpKHJ… hash://m…
#> 10 DP1.100… HARV brd_cou… expan… csv 2018-… 20191107… /tmp/RtmpKHJ… hash://m…
#> # … with 194 more rows
neon_read()
is a relatively trivial function that simply passes this
file list to vroom::vroom()
, a fast, vectorized parser that can easily
read in a single table that is broken into many separate files.
Imagine instead that we use the common pattern of downloading these raw files, stacks and possibly cleans the data, saving only this derived product while discarding the individual files. Now imagine a second researcher, at some later date, queries the API over the same reported range of dates and sites, uses the same software package to stack the tables, only to discover the resulting table is somehow different from ours (e.g. by comparing file hashes). Pinpointing the source of the discrepancy would be challenging and labor-intensive.
In contrast, the same detective-work would be easy with the neonstore
file list. We can confirm if the API had returned the same number of raw
files with the same names; and better, can verify integrity of the
contents by comparing hashes of files now being returned to those
recorded by neon_index()
. In this way, we could determine if any
additional files had been included or pinpoint any files that may have
changed.
As such, users might want to store the neon_index()
data.frame
for
the table(s) they have used as part of their analysis, including the
individual file hashes. One can also generate a zip of all the data
files for archival purposes. (Note that NEON is an Open Data provider,
see
LICENCE.)
write.csv(raw_files, "index.csv")
zip("brd_countdata.zip", raw_files$path)
Data citation
Always remember to cite your data sources! neonstore
knows how to
generate the appropriate citation for the data in your local store (or
any specific product).
neon_citation()
#> National Ecological Observatory Network (2020). "Data Products:
#> NEON.DP1.10003.001 NEON.DP0.10003.001 . Provisional data downloaded
#> from http://data.neonscience.org on 12 Aug 2020."
Note on API limits
The NEON API now rate-limits
requests..
Using a personal token will increase the number of requests you can
make. See that link for directions on registering for a token. Then pass
this token in .token
argument of neon_download()
, or for frequent
use, add this token as an environmental variable, NEON_DATA
to your
local .Renviron
file in your user’s home directory.
neon_download()
must first query each the API of eacn NEON site which
collects that product, for each month the product is collected. (It
would be much more efficient on the NEON server if the API could take
queries of the from /data/<product>/<site>
, and pool the results,
rather than require each month of sampling separately!)