Load: Loads Experimental And Observational Data

Description

This function loads monthly or daily data from a set of specified experimental datasets together with data that date-corresponds from a set of specified observational datasets. See parameters 'storefreq', 'sampleperiod', 'exp' and 'obs'. A set of starting dates is specified through the parameter 'sdates'. Data of each starting date is loaded for each model. Load() arranges the data in two arrays with a similar format both with the following dimensions:

The number of experimental datasets determined by the user through the argument 'exp' (for the experimental data array) or the number of observational datasets available for validation (for the observational array) determined as well by the user through the argument 'obs'.

The greatest number of members across all experiments (in the experimental data array) or across all observational datasets (in the observational data array). The number of starting dates determined by the user through the 'sdates' argument. The greatest number of lead-times. The number of latitudes of the selected zone. The number of longitudes of the selected zone.

Usage

Load(var, exp = NULL, obs = NULL, sdates, nmember = NULL, 
     nmemberobs = NULL, nleadtime = NULL, leadtimemin = 1, 
     leadtimemax = NULL, storefreq = 'monthly', sampleperiod = 1, 
     lonmin = 0, lonmax = 360, latmin = -90, latmax = 90, 
     output = 'areave', method = 'conservative', grid = NULL, 
     maskmod = vector("list", 15), maskobs = vector("list", 15), 
     configfile = NULL, varmin = NULL, varmax = NULL, 
     silent = FALSE, nprocs = NULL, dimnames = NULL, 
     remapcells = 2)

Arguments

var

Name of the variable to load. If the variable name inside the files to load is not the same as this, adjust properly the parameters 'exp' and 'obs'. This parameter is mandatory. Ex: 'tas'

exp

This argument can take two formats: a list of lists or a vector of character strings. Each format will trigger a different mechanism of locating the requested datasets. The first format is adequate when loading data you'll only load once or occasionally.

Value

Load() returns a named list following a structure similar to the used in the package 'downscaleR'. The components are the following:
- 'mod' is the array that contains the experimental data. It has the attribute 'dimensions' associated to a vector of strings with the labels of each dimension of the array, in order.
'obs' is the array that contains the observational data. It has the attribute 'dimensions' associated to a vector of strings with the labels of each dimension of the array, in order.
'obs' is the array that contains the observational data.
'lat' and 'lon' are the latitudes and longitudes of the grid into which the data is interpolated (0 if the loaded variable is a global mean or the output is an area average). Both have the attribute 'cdo_grid_des' associated with a character string with the name of the common grid of the data, following the CDO naming conventions for grids. The attribute 'projection' is kept for compatibility with 'downscaleR'.
'Variable' has the following components:
- 'varName', with the short name of the loaded variable as specified in the parameter 'var'.
'level', with information on the pressure level of the variable. Is kept to NULL by now.
And the following attributes:
- 'is_standard', kept for compatibility with 'downscaleR', tells if a dataset has been homogenized to standards with 'downscaleR' catalogs.
'units', a character string with the units of measure of the variable, as found in the source files.
'longname', a character string with the long name of the variable, as found in the source files.
'daily_agg_cellfun', 'monthly_agg_cellfun', 'verification_time', kept for compatibility with 'downscaleR'.

cr

In the case of loading an area average the dimensions of the arrays will be only the first 4. Only a specified variable is loaded from each experiment at each starting date. See parameter 'var'. Afterwards, observational data that matches every starting date and lead-time of every experimental dataset is fetched in the file system (so, if two predictions at two different start dates overlap, some observational values will be loaded and kept in memory more than once). If no data is found in the file system for an experimental or observational array point it is filled with an NA value. If the specified output is 2-dimensional or latitude- or longitude-averaged time series all the data is interpolated into a common grid. If the specified output type is area averaged time series the data is averaged on the individual grid of each dataset but can also be averaged after interpolating into a common grid. See parameters 'grid' and 'method'. Once the two arrays are filled by calling this function, other functions in the s2dverification package that receive as inputs data formatted in this data structure can be executed (e.g: Clim() to compute climatologies, Ano() to compute anomalies, ...). Load() has many additional parameters to disable values and trim dimensions of selected variable, even masks can be applied to 2-dimensional variables. See parameters 'nmember', 'nmemberobs', 'nleadtime', 'leadtimemin', 'leadtimemax', 'sampleperiod', 'lonmin', 'lonmax', 'latmin', 'latmax', 'maskmod', 'maskobs', 'varmin', 'varmax'. The parameters 'exp' and 'obs' can take various forms. The most direct form is a list of lists, where each sub-list has the component 'path' associated to a character string with a pattern of the path to the files of a dataset to be loaded. These patterns can contain wildcards and tags that will be replaced automatically by Load() with the specified starting dates, member numbers, variable name, etc. See parameter 'exp' or 'obs' for details. Only NetCDF files are supported. OPeNDAP URLs to NetCDF files are also supported. Load() can load 2-dimensional or global mean variables in any of the following formats:
- experiments:
  - file per ensemble per starting date (YYYY, MM and DD somewhere in the path)
- file per member per starting date (YYYY, MM, DD and MemberNumber somewhere in the path. Ensemble experiments with different numbers of members can be loaded in a singleLoad()call.)
All the data files must contain the target variable defined over time and potentially over members, latitude and longitude dimensions in any order, time being the record dimension. In the case of a two-dimensional variable, the variables longitude and latitude must be defined inside the data file too and must have the same names as the dimension for longitudes and latitudes respectively. The names of these dimensions (and longitude and latitude variables) and the name for the members dimension are expected to be 'longitude', 'latitude' and 'ensemble' respectively. However, these names can be adjusted with the parameter 'dimnames' or can be configured in the configuration file (read below in parameters 'exp', 'obs' or see ?ConfigFileOpen for more information. All the data files are expected to have numeric values representable with 32 bits. Be aware when choosing the fill values or infinite values in the datasets to load. The Load() function returns a named list following a structure similar to the used in the package 'downscaleR'. The components are the following:
- 'mod' is the array that contains the experimental data. It has the attribute 'dimensions' associated to a vector of strings with the labels of each dimension of the array, in order.
'obs' is the array that contains the observational data. It has the attribute 'dimensions' associated to a vector of strings with the labels of each dimension of the array, in order. 'obs' is the array that contains the observational data. 'lat' and 'lon' are the latitudes and longitudes of the grid into which the data is interpolated (0 if the loaded variable is a global mean or the output is an area average). Both have the attribute 'cdo_grid_des' associated with a character string with the name of the common grid of the data, following the CDO naming conventions for grids. The attribute 'projection' is kept for compatibility with 'downscaleR'.
'Variable' has the following components:
- 'varName', with the short name of the loaded variable as specified in the parameter 'var'.
'level', with information on the pressure level of the variable. Is kept to NULL by now.
And the following attributes:
- 'is_standard', kept for compatibility with 'downscaleR', tells if a dataset has been homogenized to standards with 'downscaleR' catalogs.
'units', a character string with the units of measure of the variable, as found in the source files. 'longname', a character string with the long name of the variable, as found in the source files. 'daily_agg_cellfun', 'monthly_agg_cellfun', 'verification_time', kept for compatibility with 'downscaleR'.
If 'obs' is not specified or set to NULL, no observational data is loaded.
This argument is mandatory. Ex: c('19601101', '19651101', '19701101')
If not specified, the automatically detected number of members of the first experimental dataset is detected and replied to all the experimental datasets. If a single value is specified it is replied to all the experimental datasets. Data for each member is fetched in the file system. If not found is filled with NA values. An NA value in the 'nmember' list is interpreted as "fetch as many members of each experimental dataset as the number of members of the first experimental dataset". Note: It is recommended to specify the number of members of the first experimental dataset if it is stored in file per member format because there are known issues in the automatic detection of members if the path to the dataset in the configuration file contains Shell Globbing wildcards such as '*'. Ex: c(4, 9)
If not specified, the automatically detected number of members of the first observational dataset is detected and replied to all the observational datasets. If a single value is specified it is replied to all the observational datasets. Data for each member is fetched in the file system. If not found is filled with NA values. An NA value in the 'nmemberobs' list is interpreted as "fetch as many members of each observational dataset as the number of members of the first observational dataset". Note: It is recommended to specify the number of members of the first observational dataset if it is stored in file per member format because there are known issues in the automatic detection of members if the path to the dataset in the configuration file contains Shell Globbing wildcards such as '*'. Ex: c(1, 5)
If 'exp' is NULL this argument won't have any effect (see ?Load description).
By default it takes 'monthly'. Note: Data stored in other frequencies with a period which is divisible by a month can be loaded with a proper use of 'storefreq' and 'sampleperiod' parameters. It can also be loaded if the period is divisible by a day and the observational datasets are stored in a file per dataset format or 'obs' is empty.
Takes by default value 1 (all lead-times are loaded). See 'storefreq' for more information.
Must take a value in the range [-360, 360] (if negative longitudes are found in the data files these are translated to this range). It is set to 0 if not specified. If 'lonmin' > 'lonmax', data across Greenwich is loaded.
Must take a value in the range [-360, 360] (if negative longitudes are found in the data files these are translated to this range). It is set to 360 if not specified. If 'lonmin' > 'lonmax', data across Greenwich is loaded.
Must take a value in the range [-90, 90]. It is set to -90 if not specified.
Must take a value in the range [-90, 90]. It is set to 90 if not specified.
Can take values 'areave', 'lon', 'lat', 'lonlat'.
- 'areave': Time series of area-averaged variables over the specified domain.
'lon': Time series of meridional averages as a function of longitudes. 'lat': Time series of zonal averages as a function of latitudes. 'lonlat': Time series of 2d fields.
All the loaded data is interpolated into the grid of the first experimental dataset except if 'areave' is selected. In that case the area averages are computed on each dataset original grid. A common grid different than the first experiment's can be specified through the parameter 'grid'. If 'grid' is specified when selecting 'areave' output type, all the loaded data is interpolated into the specified grid before calculating the area averages.
See remapcells for advanced adjustments. Takes by default the value 'conservative'.
If not specified and the selected output type is 'lon', 'lat' or 'lonlat', this parameter takes as default value the grid of the first experimental dataset, which is read automatically from the source files. The grid must be supported by 'cdo' tools: rNXxNY or tTRgrid. Ex: 'r96x72' Advanced: If the output type is 'lon', 'lat' or 'lonlat' and no common grid is specified, the grid of the first experimental or observational dataset is detected and all data is then interpolated onto this grid. If the first experimental or observational dataset's data is found shifted along the longitudes (i.e., there's no value at the longitude 0 but at a longitude close to it), the data is re-interpolated to suppress the shift. This has to be done in order to make sure all the data from all the datasets is properly aligned along longitudes, as there's no option so far in Load to specify grids starting at longitudes other than 0. This issue doesn't affect when loading in 'areave' mode without a common grid, the data is not re-interpolated in that case.
Each mask can be defined in 2 formats: a) a matrix with dimensions c(longitudes, latitudes). b) a list with the components 'path' and, optionally, 'nc_var_name'. In the format a), the matrix must have the same size as the common grid or with the same size as the grid of the corresponding experimental dataset if 'areave' output type is specified and no common 'grid' is specified. In the format b), the component 'path' must be a character string with the path to a NetCDF mask file, also in the common grid or in the grid of the corresponding dataset if 'areave' output type is specified and no common 'grid' is specified. If the mask file contains only a single variable, there's no need to specify the component 'nc_var_name'. Otherwise it must be a character string with the name of the variable inside the mask file that contains the mask values. This variable must be defined only over 2 dimensions with length greater or equal to 1. Whichever the mask format, a value of 1 at a point of the mask keeps the original value at that point whereas a value of 0 disables it (replaces by a NA value). By default all values are kept (all ones). The longitudes and latitudes in the matrix must be in the same order as in the common grid or as in the original grid of the corresponding dataset when loading in 'areave' mode. You can find out the order of the longitudes and latitudes of a file with 'cdo griddes'. Note that in a common CDO grid defined with the patterns 'tgrid' or 'rx' the latitudes and latitudes are ordered, by definition, from -90 to 90 and from 0 to 360, respectively. If you are loading maps ('lonlat', 'lon' or 'lat' output types) all the data will be interpolated onto the common 'grid'. If you want to specify a mask, you will have to provide it already interpolated onto the common grid (you may use 'cdo' libraries for this purpose). It is not usual to apply different masks on experimental datasets on the same grid, so all the experiment masks are expected to be the same. Warning: When loading maps, any masks defined for the observational data will be ignored to make sure the same mask is applied to the experimental and observational data. Warning: list() compulsory even if loading 1 experimental dataset only! Ex: list(array(1, dim = c(num_lons, num_lats)))
If not specified, the configuration file used at BSC-ES will be used (it is included in the package). Check the BSC's configuration file or a template of configuration file in the folder 'inst/config' in the package. Check further information on the configuration file mechanism in ConfigFileOpen().
By default no deactivation is performed.
By default no deactivation is performed.
Warnings will be displayed even if 'silent' is set to TRUE. Takes by default the value 'FALSE'.
These processes will use shared memory in the processor in which Load() is launched. By default the number of logical cores in the machine will be detected and as many processes as logical cores there are will be created. A value of 1 won't create parallel processes. When running in multiple processes, if an error occurs in any of the processes, a crash message appears in the R session of the original process but no detail is given about the error. A value of 1 will display all error messages in the original and only R session. Note: the parallel process create other blocking processes each time they need to compute an interpolation via 'cdo'.
The value associated to each name is the actual dimension name in the NetCDF file. The variables in the file that contain the longitudes and latitudes of the data (if the data is a 2-dimensional variable) must have the same name as the longitude and latitude dimensions. By default, these names are 'longitude', 'latitude' and 'ensemble. If any of those is defined in the 'dimnames' parameter, it takes priority and overwrites the default value. Ex.: list(lon = 'x', lat = 'y') In that example, the dimension 'member' will take the default value 'ensemble'.
The result of this interpolation can vary if the values surrounding the spatial subset are not present. To better control this process, the width in number of grid cells of the surrounding area to be taken into account can be specified with remapcells. A value of 0 will take into account no additional cells but will generate less traffic between the storage and the R processes that load data. A value beyond the limits in the data files will be automatically runcated to the actual limit. The default value is 2.

item

observations:
- file per ensemble per month (YYYY and MM somewhere in the path)
file per member per month (YYYY, MM and MemberNumber somewhere in the path, obs with different numbers of members supported) file per dataset (No constraints in the path but the time axes in the file have to be properly defined)
'Datasets' has the following components:
- 'exp', a named list where the names are the identifying character strings of each experiment in 'exp', each associated to a list with the following components:
  - 'members', a list with the names of the members of the dataset.
- 'source', a path or URL to the source of the dataset.
'obs', similar to 'exp' but for observational datasets.
'Dates', with the follwing components:
- 'start', an array of dimensions (sdate, time) with the POSIX initial date of each forecast time of each starting date.
'end', an array of dimensions (sdate, time) with the POSIX final date of each forecast time of each starting date.
'InitializationDates', a vector of starting dates as specified in 'sdates', in POSIX format.
'when', a time stamp of the date the Load() call to obtain the data was issued.
'source_files', a vector of character strings with complete paths to all the found files involved in the Load() call.
'not_found_files', a vector of character strings with complete paths to not found files involved in the Load() call.
obs
sdates
nmember
nmemberobs
nleadtime
leadtimemin
leadtimemax
storefreq
sampleperiod
lonmin
lonmax
latmin
latmax
output
method
grid
maskmod
maskobs
configfile
varmin
varmax
silent
nprocs
dimnames
remapcells
'Datasets' has the following components:
- 'exp', a named list where the names are the identifying character strings of each experiment in 'exp', each associated to a list with the following components:
  - 'members', a list with the names of the members of the dataset.
- 'source', a path or URL to the source of the dataset.
'obs', similar to 'exp' but for observational datasets.
'Dates', with the follwing components:
- 'start', an array of dimensions (sdate, time) with the POSIX initial date of each forecast time of each starting date.
'end', an array of dimensions (sdate, time) with the POSIX final date of each forecast time of each starting date.
'InitializationDates', a vector of starting dates as specified in 'sdates', in POSIX format.
'when', a time stamp of the date the Load() call to obtain the data was issued.
'source_files', a vector of character strings with complete paths to all the found files involved in the Load() call.
'not_found_files', a vector of character strings with complete paths to not found files involved in the Load() call.

code

method

Details

The two output matrices have between 2 and 6 dimensions:

Number of experimental/observational datasets.

Number of members. Number of startdates. Number of leadtimes. Number of latitudes (optional). Number of longitudes (optional).

Examples

Run this code

# Let's assume we want to perform verification with data of a variable
# called 'tos' from a model called 'model' and observed data coming from 
# an observational dataset called 'observation'.
#
# The model was run in the context of an experiment named 'experiment'. 
# It simulated from 1st November in 1985, 1990, 1995, 2000 and 2005 for a 
# period of 5 years time from each starting date. 5 different sets of 
# initial conditions were used so an ensemble of 5 members was generated 
# for each starting date.
# The model generated values for the variables 'tos' and 'tas' in a 
# 3-hourly frequency but, after some initial post-processing, it was 
# averaged over every month.
# The resulting monthly average series were stored in a file for each 
# starting date for each variable with the data of the 5 ensemble members.
# The resulting directory tree was the following:
#   model
#    |--> experiment
#          |--> monthly_mean
#                |--> tos_3hourly
#                |     |--> tos_19851101.nc
#                |     |--> tos_19901101.nc
#                |               .
#                |               .
#                |     |--> tos_20051101.nc 
#                |--> tas_3hourly
#                      |--> tas_19851101.nc
#                      |--> tas_19901101.nc
#                                .
#                                .
#                      |--> tas_20051101.nc
# 
# The observation recorded values of 'tos' and 'tas' at each day of the 
# month over that period but was also averaged over months and stored in 
# a file per month. The directory tree was the following:
#   observation
#    |--> monthly_mean
#          |--> tos
#          |     |--> tos_198511.nc
#          |     |--> tos_198512.nc
#          |     |--> tos_198601.nc
#          |               .
#          |               .
#          |     |--> tos_201010.nc
#          |--> tas
#                |--> tas_198511.nc
#                |--> tas_198512.nc
#                |--> tas_198601.nc
#                          .
#                          .
#                |--> tas_201010.nc
#
# The model data is stored in a file-per-startdate fashion and the
# observational data is stored in a file-per-month, and both are stored in 
# a monthly frequency. The file format is NetCDF.
# Hence all the data is supported by Load() (see details and other supported 
# conventions in ?Load) but first we need to configure it properly.
#
# These data files are included in the package (in the 'sample_data' folder),
# only for the variable 'tos'. They have been interpolated to a very low 
# resolution grid so as to make it on CRAN.
# The original grid names (following CDO conventions) for experimental and 
# observational data were 't106grid' and 'r180x89' respectively. The final
# resolutions are 'r20x10' and 'r16x8' respectively. 
# The experimental data comes from the decadal climate prediction experiment 
# run at IC3 in the context of the CMIP5 project. Its name within IC3 local 
# database is 'i00k'. 
# The observational dataset used for verification is the 'ERSST' 
# observational dataset.
#
# The configuration file 'sample.conf' that we will create in the example 
# has the proper entries to load these (see ?LoadConfigFile for details on 
# writing a configuration file). 
#
# The code is not run because it dispatches system calls to 'cdo' and 'nco'
# which is not allowed as per CRAN policies. You can run it in your system 
# though. Instead, the code in 'dontshow' is run, which loads the equivalent
# data already processed in R.
  configfile <- paste0(tempdir(), '/sample.conf')
ConfigFileCreate(configfile, confirm = FALSE)
c <- ConfigFileOpen(configfile)
c <- ConfigEditDefinition(c, 'DEFAULT_VAR_MIN', '-1e19', confirm = FALSE)
c <- ConfigEditDefinition(c, 'DEFAULT_VAR_MAX', '1e19', confirm = FALSE)
data_path <- system.file('sample_data', package = 's2dverification')
exp_data_path <- paste0(data_path, '/model/$EXP_NAME$/')
obs_data_path <- paste0(data_path, '/$OBS_NAME$/')
c <- ConfigAddEntry(c, 'experiments', dataset_name = 'experiment', 
     var_name = 'tos', main_path = exp_data_path,
     file_path = '$STORE_FREQ$_mean/$VAR_NAME$_3hourly/$VAR_NAME$_$START_DATE$.nc')
c <- ConfigAddEntry(c, 'observations', dataset_name = 'observation', 
     var_name = 'tos', main_path = obs_data_path,
     file_path = '$STORE_FREQ$_mean/$VAR_NAME$/$VAR_NAME$_$YEAR$$MONTH$.nc')
ConfigFileSave(c, configfile, confirm = FALSE)

# Now we are ready to use Load().
startDates <- c('19851101', '19901101', '19951101', '20001101', '20051101')
sampleData <- Load('tos', c('experiment'), c('observation'), startDates, 
                   output = 'areave', latmin = 27, latmax = 48, 
                   lonmin = -12, lonmax = 40, configfile = configfile)
  startDates <- c('19851101', '19901101', '19951101', '20001101', '20051101')
sampleData <- s2dverification:::.LoadSampleData('tos', c('experiment'), 
                                                c('observation'), startDates,
                                                output = 'areave', 
                                                latmin = 27, latmax = 48, 
                                                lonmin = -12, lonmax = 40)

Run the code above in your browser using DataLab