Load: Loads Experimental And Observational Data

Description

This function loads monthly or daily data from a set of specified experimental datasets together with data that date-corresponds from a set of specified observational datasets. See parameters 'storefreq', 'sampleperiod', 'exp' and 'obs'. Load() arranges the data into two arrays with a similar format with the following dimensions:

The number of experimental datasets determined by the user through the argument 'exp' for the experimental data array or the number of observational datasets available for validation for the observational array determined as well by the user through the argument 'obs'.

The greatest number of members across all experiments in the experimental data array or across all observational datasets in the observational data array. The number of starting dates determined by the user through the 'sdates' argument (we need to store data of each prediction of the model in each starting date). The greatest number of lead-times. The number of latitudes of the zone we want to consider. The number of longitudes of the zone we want to consider.

Usage

Load(var, exp = NULL, obs = NULL, sdates, nmember = NULL, nmemberobs = NULL, 
     nleadtime = NULL, leadtimemin = 1, leadtimemax = NULL, 
     storefreq = 'monthly', sampleperiod = 1, lonmin = 0, lonmax = 360, 
     latmin = -90, latmax = 90, output = 'areave', method = 'conservative', 
     grid = NULL, maskmod = vector("list", 15), maskobs = vector("list", 15), 
     configfile = NULL, suffixexp = NULL, suffixobs = NULL, varmin = NULL, 
     varmax = NULL, silent = FALSE, nprocs = NULL, dimnames = NULL)

Arguments

var

Name of the variable to load. Must have matching entries in the list of global mean or in the list of 2-dimensional variables in the configuration file. Also must have matching entries together with each dataset specified in 'exp' and 'obs' in any of the

exp

Vector of experimental dataset IDs. Each must have matching entries together with the variable name specified in 'var' in any of the experimental datasets tables in the configuration file. IMPORTANT: Place first the experiment with the largest number of m

obs

Vector of observational dataset IDs. Each must have matching entries together with the variable name specified in 'var' in any of the observational datasets tables in the configuration file. IMPORTANT: Place first the observation with the largest number o

sdates

Vector of starting dates of the experimental runs to be loaded following the pattern 'YYYYMMDD'. This argument is mandatory. Ex: c('19601101', '19651101', '19701101')

nmember

Vector with the numbers of members to load from the specified experimental datasets in 'exp'. If not specified, the number of members of the first experimental dataset is detected and replied to all the experimental datasets. If a single value is specifie

nmemberobs

Vector with the numbers of members to load from the specified observational datasets in 'obs'. If not specified, the number of members of the first observational dataset is detected and replied to all the observational datasets. If a single value is speci

nleadtime

Largest number of lead-times among experimental datasets. Takes by default the number of lead-times of the first experimental dataset in 'exp'. If 'exp' is NULL this argument won't have any effect (see Load() description).

leadtimemin

Only lead-times higher or equal to 'leadtimemin' are loaded. Takes by default value 1.

leadtimemax

Only lead-times lower or equal to 'leadtimemax' are loaded. Takes by default the same value as 'nleadtime'.

storefreq

Frequency at which the data to be loaded is stored in the file system. Can take values 'monthly' or 'daily'. By default it takes 'monthly'. Note: Data stored in other frequencies with a period which is divisible by a month can be loaded with a proper use

sampleperiod

To load only a subset between 'leadtimemin' and 'leadtimemax' with the period of subsampling 'sampleperiod'. Takes by default value 1 (all lead-times are loaded). See 'storefreq' for more information.

lonmin

If a 2-dimensional variable is loaded, values at longitudes lower than 'lonmin' aren't loaded. Must take a value in the range [0, 360] (if negative longitudes are found in the data files these are translated to this range). It is set to 0 if not specified

lonmax

If a 2-dimensional variable is loaded, values at longitudes higher than 'lonmax' aren't loaded. Must take a value in the range [0, 360] (if negative longitudes are found in the data files these are translated to this range). It is set to 360 if not specif

latmin

If a 2-dimensional variable is loaded, values at latitudes lower than 'latmin' aren't loaded. Must take a value in the range [-90, 90]. It is set to -90 if not specified.

latmax

If a 2-dimensional variable is loaded, values at latitudes higher than 'latmax' aren't loaded. Must take a value in the range [-90, 90]. It is set to 90 if not specified.

output

This parameter determines the format in which the data is arranged in the output arrays. Can take values 'areave', 'lon', 'lat', 'lonlat'.

'areave': Time series of area-averaged variables over the specified domain.

'lon': Tim

Value

$modModel outputs. If output = 'areave', matrix with dimensions c(nmod/nexp, nmemb/nparam, nsdates, nltime) If output = 'lat', matrix with dimensions c(nmod/nexp, nmemb/nparam, nsdates, nltime, nlat) If output = 'lon', matrix with dimensions c(nmod/nexp, nmemb/nparam, nsdates, nltime, nlon) If output = 'lonlat', matrix with dimensions c(nmod/nexp, nmemb/nparam, nsdates, nltime, nlat, nlon)
$obsObservations. Matrix with same dimensions as '$mod' except along the first two. If output = 'areave', matrix with dimensions c(nobs, nmemb, nsdates, nltime) If output = 'lat', matrix with dimensions c(nobs, nmemb, nsdates, nltime, nlat) If output = 'lon', matrix with dimensions c(nobs, nmemb nsdates, nltime, nlon) If output = 'lonlat', matrix with dimensions c(nobs, nmemb, nsdates, nltime, nlat, nlon)
$latLatitudes of the output grid (default: model grid of the first experiment). If 'areave' is selected or a global mean variable is specified, takes value 0.
$lonLongitudes of the output grid (default: model grid of the first experiment). If 'areave' is selected or a global mean variable is specified, takes value 0.

cr

In the case of loading an area average the dimensions of the arrays will be only the first 4. Only a specified variable and a set of starting dates is loaded from each experiment. See parameters 'var' and 'sdates'. Afterwards, observational data that matches every starting date and lead-time of every experimental dataset is fetched in the file system (so, if two predictions at two different start dates overlap, some observational values will be loaded more than once). If no data is found in the file system for an experimental or observational array point it is filled with an NA value. If the specified output is 2-dimensional or latitude- or longitude-averaged time series all the data is interpolated into a common grid. If the specified output type is area averaged time series the data is averaged on the individual grid of each dataset but can also be averaged after interpolating into a common grid. See parameters 'grid' and 'method'. Once the two arrays are filled by calling this function, other functions in the s2dverification package that receive as inputs data formatted in this data structure can be executed (e.g: Clim() to compute climatologies, Ano() to compute anomalies, ...). Load() has many additional parameters to disable values and trim dimensions of selected variable, even masks can be applied to 2-dimensional variables. See parameters 'nmember', 'nmemberobs', 'nleadtime', 'leadtimemin', 'leadtimemax', 'sampleperiod', 'lonmin', 'lonmax', 'latmin', 'latmax', 'maskmod', 'maskobs', 'varmin', 'varmax'. The parameters 'exp' and 'obs' are lists of dataset identifiers. To fetch the data in the repository associated to each dataset, Load() reads a configuration file which associates each pair of (dataset ID, variable name) to a corresponding path pattern in which the dataset and variable data is stored. These patterns can contain wildcards and variables tags that will be replaced automatically by Load() with the specified starting dates, member numbers, variable name, etc. Furthermore, each pair can be associated not only to a path but also to other values such as grid type, maximum or minimum allowed values, etc., that make Load() capable of loading data stored in multiple formats and naming conventions. See '?ConfigFileOpen' and parameters 'configfile', 'suffixexp' and 'suffixobs' in this help page for more information. Only NetCDF files are supported. OPeNDAP URLs to NetCDF files are also supported. All in all, Load() can load 2-dimensional or global mean variables in any of the following formats:
- experiments:
  - file per ensemble per starting date (YYYY, MM and DD somewhere in the path)
- file per member per starting date (YYYY, MM, DD and MemberNumber somewhere in the path, exps with different numbers of members supported)
All the data files must contain the target variable with the same name and, that variable, must be defined over time, members, latitude and longitude dimensions in this order, time being the record dimension, members being only required if in 'file per member per starting date' format, and longitude and latitude being only required if is a two-dimensional variable. In the case of a two-dimensional variable, the variables longitude and latitude must be defined inside the data file too and must have the same names as the dimension for longitudes and latitudes respectively. The names of these dimensions (and longitude and latitude variables) can be configured in the configuration file (see ?ConfigFileOpen) via the variables DEFAULT_DIM_NAME_LONGITUDES, DEFAULT_DIM_NAME_LATITUDES and DEFAULT_DIM_NAME_MEMBERS. When generating a new configuration file (see ?ConfigFileCreate) these take as default 'longitude', 'latitude' and 'ensemble'. All the data files are expected to have numeric values representable with 32 bits. Be aware when choosing the fill values or infinite values in the datasets to load. The Load() function returns a named list whith four components: 'mod', 'obs', 'lat' and 'lon'.
- 'mod' is the array that contains the experimental data.
'obs' is the array that contains the observational data. 'lat' and 'lon' are the latitudes and longitudes of the grid into which the data is interpolated (0 if the loaded variable is a global mean or the output is an area average).
Takes by default the value 'conservative'.
If not specified and the selected output type is 'lon', 'lat' or 'lonlat', this parameter takes as default value the grid of the first experimental dataset, which is read from the configuration file. The grid must be supported by 'cdo' tools: rNXxNY or tTRgrid. Ex: 'r96x72'
By default all values are kept (all ones). If you are loading maps (i.e. 'lonlat', 'lon' or 'lat' output types) all the data will be interpolated onto the common 'grid'. If you want to specify a mask, you will have to provide it already interpolated onto the common grid (you may use 'cdo' for this purpose). It is not usual to apply different masks on experimental datasets on the same grid, so all the experiment masks are expected to be the same. When loading maps, any masks defined for the observational data will be ignored to make sure the same mask is applied to the experimental and observational data. Warning: list() compulsory even if loading 1 experimental dataset only! Ex: list(array(1, dim = c(num_lons, num_lats)))
By default all values are kept (all ones). If you are loading maps (i.e. 'lonlat', 'lon' or 'lat' output types) all the data will be interpolated onto the common 'grid'. If you want to specify a mask, you will have to provide it already interpolated onto the common grid (you may use 'cdo' for this purpose). It is not usual to apply different masks on experimental datasets on the same grid, so all the experiment masks are expected to be the same. When loading maps, any masks defined for the observational data will be ignored to make sure the same mask is applied to the experimental and observational data. Warning: list() compulsory even if loading 1 observational dataset only! Ex: list(array(1, dim = c(num_lons, num_lats)))
By default the configuration file used at IC3 is used (it is included in the package). Check the IC3's configuration file or a template of configuration file in the folder 'inst/config' in the package. Check further information on the configuration file mechanism in '?LoadConfigurationFile'.
Each pair of experimental dataset ID and variable name can have a suffix associated in the configuration file. If only one suffix is specified all experimental datasets will take the same suffix. An NA value in the list of suffixes is interpreted as "take the suffix specified in the configuration file". Ex: c('_f6h', '_3hourly')
Each pair of observational dataset ID and variable name can have a suffix associated in the configuration file. If only one suffix is specified all observational datasets will take the same suffix. An NA value in the list of suffixes is interpreted as "take the suffix specified in the configuration file". Ex: c('_f6h', '_3hourly')
Takes by default the value specified in the configuration file.
Takes by default the value specified in the configuration file.
Warnings will be displayed even if 'silent' is set to TRUE. Takes by default the value 'FALSE'.
These processes will use shared memory in the processor in which Load() is launched. By default the number of logical cores in the machine will be detected and as many processes as logical cores there are will be created. A value of 1 won't create parallel processes. When running in multiple processes, if an error occurs in any of the processes, a crash message appears in the R session of the original process but no detail is given about the error. A value of 1 will display all error messages in the original and only R session. Note: the parallel process create other blocking processes each time they need to compute an interpolation via 'cdo'.
The value associated to each name is the actual dimension name in the NetCDF file. The variables in the file that contain the longitudes and latitudes of the data (if the data is a 2-dimensional variable) must have the same name as the longitude and latitude dimensions. By default, these names are taken from the mandatory variables in the configuration file (DEFAULT_DIM_NAME_LONGIUDE, DEFAULT_DIM_NAME_LATITUDE and DEFAULT_DIM_NAME_MEMBERS). If any of those is defined in the 'dimnames' parameter, it takes priority and overwrites the value in the configuration file. Ex.: list(longitudes = 'x', latitudes = 'y') In that example, the dimension 'members' will take the DEFAULT_DIM_NAME_MEMBERS specified in the configuration file.

item

observations:
- file per ensemble per month (YYYY and MM somewhere in the path)
file per member per month (YYYY, MM and MemberNumber somewhere in the path, obs with different numbers of members supported) file per dataset (No constraints in the path but the time axes in the file have to be properly defined)
method
grid
maskmod
maskobs
configfile
suffixexp
suffixobs
varmin
varmax
silent
nprocs
dimnames

Details

The two output matrices have between 2 and 6 dimensions:

Number of experimental/observational datasets.

Number of members. Number of startdates. Number of leadtimes. Number of latitudes (optional). Number of longitudes (optional).

Examples

Run this code

# Let's assume we want to perform verification with data of a variable
# called 'tos' from a model called 'model' and observed data coming from 
# an observational dataset called 'observation'.
#
# The model was run in the context of an experiment named 'experiment'. 
# It simulated from 1st November in 1985, 1990, 1995, 2000 and 2005 for a 
# period of 5 years time from each starting date. 5 different sets of 
# initial conditions were used so an ensemble of 5 members was generated 
# for each starting date.
# The model generated values for the variables 'tos' and 'tas' in a 
# 3-hourly frequency but, after some initial post-processing, it was 
# averaged over every month.
# The resulting monthly average series were stored in a file for each 
# starting date for each variable with the data of the 5 ensemble members.
# The resulting directory tree was the following:
#   model
#    |--> experiment
#          |--> monthly_mean
#                |--> tos_3hourly
#                |     |--> tos_19851101.nc
#                |     |--> tos_19901101.nc
#                |               .
#                |               .
#                |     |--> tos_20051101.nc 
#                |--> tas_3hourly
#                      |--> tas_19851101.nc
#                      |--> tas_19901101.nc
#                                .
#                                .
#                      |--> tas_20051101.nc
# 
# The observation recorded values of 'tos' and 'tas' at each day of the 
# month over that period but was also averaged over months and stored in 
# a file per month. The directory tree was the following:
#   observation
#    |--> monthly_mean
#          |--> tos
#          |     |--> tos_198511.nc
#          |     |--> tos_198512.nc
#          |     |--> tos_198601.nc
#          |               .
#          |               .
#          |     |--> tos_201010.nc
#          |--> tas
#                |--> tas_198511.nc
#                |--> tas_198512.nc
#                |--> tas_198601.nc
#                          .
#                          .
#                |--> tas_201010.nc
#
# The model data is stored in a file-per-startdate fashion and the
# observational data is stored in a file-per-month, and both are stored in 
# a monthly frequency. The file format is NetCDF.
# Hence all the data is supported by Load() (see details and other supported 
# conventions in ?Load) but first we need to configure it properly.
#
# These data files are included in the package (in the 'sample_data' folder),
# only for the variable 'tos'. They have been interpolated to a very low 
# resolution grid so as to make it on CRAN.
# The original grid names (following CDO conventions) for experimental and 
# observational data were 't106grid' and 'r180x89' respectively. The final
# resolutions are 'r20x10' and 'r16x8' respectively. 
# The experimental data comes from the decadal climate prediction experiment 
# run at IC3 in the context of the CMIP5 project. Its name within IC3 local 
# database is 'i00k'. 
# The observational dataset used for verification is the 'ERSST' 
# observational dataset.
#
# The configuration file 'sample.conf' that we will create in the example 
# has the proper entries to load these (see ?LoadConfigFile for details on 
# writing a configuration file). 
#
# The code is not run because it dispatches system calls to 'cdo' and 'nco'
# which is not allowed as per CRAN policies. You can run it in your system 
# though. Instead, the code in 'dontshow' is run, which loads the equivalent
# data already processed in R.
  configfile <- paste0(tempdir(), '/sample.conf')
ConfigFileCreate(configfile, confirm = FALSE)
c <- ConfigFileOpen(configfile)
c <- ConfigEditDefinition(c, 'DEFAULT_GRID', 'r20x10', confirm = FALSE)
c <- ConfigEditDefinition(c, 'DEFAULT_VAR_MIN', '-1e19', confirm = FALSE)
c <- ConfigEditDefinition(c, 'DEFAULT_VAR_MAX', '1e19', confirm = FALSE)
c <- ConfigAddVar(c, '2d', 'tos')
data_path <- system.file('sample_data', package = 's2dverification')
exp_data_path <- paste0(data_path, '/model/$EXP_NAME$')
obs_data_path <- paste0(data_path, '/$OBS_NAME$')
c <- ConfigAddEntry(c, 'experiments', 'file-per-startdate',
     dataset_name = 'experiment', var_name = 'tos', main_path = exp_data_path,
     file_path = '$STORE_FREQ$_mean/$VAR_NAME$_3hourly/$VAR_NAME$_$START_DATE$.nc')
c <- ConfigAddEntry(c, 'observations', 'file-per-month',
     dataset_name = 'observation', var_name = 'tos', main_path = obs_data_path,
     file_path = '$STORE_FREQ$_mean/$VAR_NAME$/$VAR_NAME$_$YEAR$$MONTH$.nc')
ConfigFileSave(c, configfile, confirm = FALSE)

# Now we are ready to use Load().
startDates <- c('19851101', '19901101', '19951101', '20001101', '20051101')
sampleData <- Load('tos', c('experiment'), c('observation'), startDates, 
                   output = 'areave', latmin = 27, latmax = 48, 
                   lonmin = -12, lonmax = 40, configfile = configfile)
  startDates <- c('19851101', '19901101', '19951101', '20001101', '20051101')
sampleData <- s2dverification:::.LoadSampleData('tos', c('experiment'), 
                                                c('observation'), startDates,
                                                output = 'areave', 
                                                latmin = 27, latmax = 48, 
                                                lonmin = -12, lonmax = 40)

Run the code above in your browser using DataLab